Bidi: now I'm confused

Discussion:

Roy Badami

2003-09-07 14:08:56 UTC

Ok, I have a problem with what I understand to be the display model
for IDNA and IRI (and presumably by extension IMA).

I'm assuming that the display model is 'render using bidi in an LTR
context'.

Specifically, the IRI draft says:

When rendered, bidirectional IRIs MUST be rendered using the Unicode
Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be
rendered with an overall left-to-right (ltr) direction.

The latter requirement isn't specified in bidi-speak, but is
presumably to be interpreted as saying they must be rendered at an
even embedding level. Actually, this isn't quite enough in the
general case, since what comes before the string may affect weak type
resolution, but since IRIs generally start with a latin letter
(generally 'h' :) this isn't really much of a problem.

So lets for the moment assume that the display model is that IDNs,
IRIs, IMAs are rendered at an even embedding level, such that the
IDN/IRI/IMA constitutes the sole text in the level run. (This can
easily be achieved by bracketing the string with LRE and PDF prior to
rendering.)

Consider the domain:

123.ARAB.com (logical order)
123.BARA.com (display order)

now consider the domain:

ARAB.123.com (logical order)
123.BARA.com (display order)

Ergo, we need another display model; this one doesn't work, at least
not if we don't want two completely different domains to display
identically.

I recall that there was a proposal on the IDN list that domains should
always be rendered with the labels appearing in order, least
significant to the left and top-level domain on the right. (This can
be trivially achieved by bracketing each label with LRE/PDF,
separating the labels with dots, and then bracketing the whole domain
with LRE/PDF.)

This would solve the above problem, but potentially might be less
friendly to users of RTL languages in other ways.

It also clearly is not what the authors of stringprep had in mind,
since the bidi restrictions in stringprep are much stronger than would
be necessary if this was the model.

-roy

Roy Badami

2003-09-07 14:23:12 UTC

Permalink

Post by Roy Badami
Ergo, we need another display model; this one doesn't work

There are also other real nasties with this display model:

ARABIC.3com.com (logical order)
3.CIBARAcom.com (display order)

ARABIC.3-com.com (logical order)
3.CIBARA-com.com (display order)

HEBREW.3-com.com (logical order)
3-.WERBEHcom.com (display order)

-roy

Roy Badami

2003-09-07 16:01:48 UTC

Permalink

Post by Roy Badami
Ergo, we need another display model; this one doesn't work

Worse than that, I think the bidi restrictions in stringprep don't
actually achieve their goal of ensuring that you can't have two
different labels that render the same.

Consider the labels:

A-123,456B

and

A456,-123B

Here, A is HEBREW LETTER ALEF, B is HEBREW LETTER BET (or any
characters of bidi class R that you like, but *not* arabic letters,
which are class AL) and the comma is actually ARABIC COMMA U+060C (or
any character of class CS or ES).

As far as I can tell these both pass nameprep with UseSTD13ASCIIRules
set, and they both render identically under bidi as:

B-123,456A

If you don't care about UseSTD13ASCIIRules, you can replace
ARABIC COMMA with COMMA, SOLIDUS or COLON.

I fully expect someone to reply explaining why I'm mistaken, but I've
checked the above as best I can...

-roy

Matitiahu Allouche

2003-09-08 09:20:18 UTC

Permalink

According to my understanding, and to testing against the Unicode C
reference implementation, you are correct in stating that the 2 strings ("A-123,456B" and "A456,-123B") will give the same display according to the Unicode algorithm for
Bidirectional text.

It proves that you have a more creative mind than the people who proposed
the limitations for Bidi names in IRIs, at least more than mine.

You will admit that your example is more than a little contrived. The
limitations set on IRIs intend to avoid ambiguity when converting from the
display order to the logical order (which in this case is not achieved,
although the vast majority of users would assume form A-123,456B, because the other form with the comma adjacent to a minus sign makes
little sense in a domain name). But those limitations were also designed
not to restrict too much the potential of creating interesting domain
names, so a compromise had to be achieved. I can find other examples of
names allowed by the rules which can mislead users trying to induce the
logical order based on the display order. All of these examples are quite
bizarre.

By the way, can you give a reference to "UseSTD13ASCIIRules", for an ignoramus like myself?

Shalom (Regards), Mati
Bidi Architect
Globalization Center Of Competency - Bidirectional Scripts
IBM Israel
Phone: +972 2 5888802 Fax: +972 2 5870333 Mobile: +972 52
554160

Sent by: public-iri-***@w3.org
To: ietf-***@imc.org, public-***@w3.org
cc:
Subject: Bidi: is stringprep broken?

Post by Roy Badami
Ergo, we need another display model; this one doesn't work

Roy Badami

2003-09-08 10:26:29 UTC

Permalink

Post by Matitiahu Allouche
According to my understanding, and to testing against the Unicode C
reference implementation, you are correct in stating that the 2 strings ("A-123,456B" and "A456,-123B") will give the same display according to the Unicode algorithm for
Bidirectional text.

Thanks for verifying it. Though it's still possible that I'm mistaken
about it passing nameprep.

Post by Matitiahu Allouche
You will admit that your example is more than a little contrived.

Yes, and it's probably unlikely ever to be registerable, since it
involves punctuation (and not only that, but punctuation associated
with the wrong script).

My other example (that ARAB.123.com and 123.ARAB.com render the same)
worries me more.

Post by Matitiahu Allouche
I can find other examples of names allowed by the rules which can
mislead users trying to induce the logical order based on the
display order. All of these examples are quite bizarre.

I'd be interested in the examples you have.

Post by Matitiahu Allouche
By the way, can you give a reference to "UseSTD13ASCIIRules", for an ignoramus like myself?

RFC3490. When the UseSTD13ASCIIRules flag is set, ASCII characters
other than alphanumerics and HYPHEN-MINUS are prohibited, in
accordance with traditional hostname rules. Hence my use of ARABIC
COMMA; I needed a character that was a number separator, was
non-ASCII, and didn't have a compatibility decomposition (since IDNA
uses NFKC).

-roy

Martin Duerst

2003-09-08 19:50:03 UTC

Permalink

Hello Roy,

I think that in general, you are right about your analysis.
Having labels (or other components) with numbers only may
lead to ambiguous displays. I seem to remember that we were
actually aware of that fact, but there was not much to do
about it:

- There currently are labels with only digits in the DNS,
outlawing them is not an option. (it would have been nice
if we could have said that the same restrictions apply
for digits and LTR letters as they do for digits and RTL
letters)
- Very explicitly for IDN, but also in many ways for IRIs,
it is highly undesirable to have inforced restrictions
on two or more labels/components. (note that this may be
somewhat different for the LHS side)

I have created an issue for this for the IRI draft, at
http://www.w3.org/International/iri-edit#bidiDigits-18.

I propose to address this by adding text that points out
such cases and warns against them (without going as far as
actually prohibiting them). I hope that this is acceptable
for you.

By the way, the alternative of having components displayed
strictly LTR was what we had for a long time. The two problems
with this approach are:
- It does not seem to correspond with what Arabic and Hebrew
writers do naturally, in particular for freestanding domain
names.
- It would require much more control over the contexts of
IRI display than we think will be available (if we get
an overall context of LTR reasonably widely implemented,
I think we already have achieved something).

Regards, Martin.

Post by Roy Badami
Ok, I have a problem with what I understand to be the display model
for IDNA and IRI (and presumably by extension IMA).
I'm assuming that the display model is 'render using bidi in an LTR
context'.
When rendered, bidirectional IRIs MUST be rendered using the Unicode
Bidirectional Algorithm [UNIV4], [UNI9]. Bidirectional IRIs MUST be
rendered with an overall left-to-right (ltr) direction.
The latter requirement isn't specified in bidi-speak, but is
presumably to be interpreted as saying they must be rendered at an
even embedding level. Actually, this isn't quite enough in the
general case, since what comes before the string may affect weak type
resolution, but since IRIs generally start with a latin letter
(generally 'h' :) this isn't really much of a problem.
So lets for the moment assume that the display model is that IDNs,
IRIs, IMAs are rendered at an even embedding level, such that the
IDN/IRI/IMA constitutes the sole text in the level run. (This can
easily be achieved by bracketing the string with LRE and PDF prior to
rendering.)
123.ARAB.com (logical order)
123.BARA.com (display order)
ARAB.123.com (logical order)
123.BARA.com (display order)
Ergo, we need another display model; this one doesn't work, at least
not if we don't want two completely different domains to display
identically.
I recall that there was a proposal on the IDN list that domains should
always be rendered with the labels appearing in order, least
significant to the left and top-level domain on the right. (This can
be trivially achieved by bracketing each label with LRE/PDF,
separating the labels with dots, and then bracketing the whole domain
with LRE/PDF.)
This would solve the above problem, but potentially might be less
friendly to users of RTL languages in other ways.
It also clearly is not what the authors of stringprep had in mind,
since the bidi restrictions in stringprep are much stronger than would
be necessary if this was the model.
-roy

Roy Badami

2003-09-08 20:53:59 UTC

Permalink

Post by Martin Duerst
I think that in general, you are right about your analysis.
Having labels (or other components) with numbers only may
lead to ambiguous displays. I seem to remember that we were
actually aware of that fact, but there was not much to do

More specifically, assuming labels (or syntactic components) obey the
stringprep bidi rules (as IDNA requires, and IRI has as a 'SHOULD')
then I think the problematic case is when a label contains one or more
digits, and no strong characters. These are just quick examples off
the top of my head; I haven't checked them carefully against the bidi
algorithm.

1-2.HEBREW.com (logical order)
1-2.WERBEH.com (display order)

HEBREW.1-2.com (logical order)
1-2.WERBEH.com (display order)

And another. N is a neutral character (presumably non-ASCII, since
there are no ASCII neutrals allowed by hostname rules).

1N2.ABC.com (logical order)
1N2.CBA.com (display order)

ABC.2N1.com (logical order)
1N2.CBA.com (display order)

Of course, this may need to be restated somewhat for IMA because the
LHS doesn't formally have structure.

Post by Martin Duerst
I propose to address this by adding text that points out
such cases and warns against them (without going as far as
actually prohibiting them). I hope that this is acceptable
for you.

If we can't change the display model (and I see why that may not be
desirable and/or practiable) then I guess that's all that can
realistically be done. I'm tempted to say it ought to be a 'SHOULD
NOT', and not just a recommendation (in line with the other 'SHOULD
NOTs' about bidi IRIs).

I think that IMA will need to contain a similar 'SHOULD NOT'.

In the case of IMA we need to warn people (at least) against addresses
such as:

***@ABC.com (logical order)
***@123.com (logical order)

both of which display as

***@CBA.com (display order)

but I think there are other cases as well.

-roy

Martin Duerst

2004-03-21 16:41:30 UTC

Permalink

Hello Roy, others,

As proposed, I have added an additional example (example 10) in
http://www.w3.org/International/iri-edit/draft-duerst-iri-06.txt
(and also in http://www.w3.org/International/iri-edit/BidiExamples).

I'm herewith closing this issue
(http://www.w3.org/International/iri-edit/#bidiDigits-18).
Please tell me if you think that's not enough
(and in that case, if possible, what else is needed).

Regards, Martin.

Post by Martin Duerst
Hello Roy,
I think that in general, you are right about your analysis.
Having labels (or other components) with numbers only may
lead to ambiguous displays. I seem to remember that we were
actually aware of that fact, but there was not much to do
- There currently are labels with only digits in the DNS,
outlawing them is not an option. (it would have been nice
if we could have said that the same restrictions apply
for digits and LTR letters as they do for digits and RTL
letters)
- Very explicitly for IDN, but also in many ways for IRIs,
it is highly undesirable to have inforced restrictions
on two or more labels/components. (note that this may be
somewhat different for the LHS side)
I have created an issue for this for the IRI draft, at
http://www.w3.org/International/iri-edit#bidiDigits-18.
I propose to address this by adding text that points out
such cases and warns against them (without going as far as
actually prohibiting them). I hope that this is acceptable
for you.
By the way, the alternative of having components displayed
strictly LTR was what we had for a long time. The two problems
- It does not seem to correspond with what Arabic and Hebrew
writers do naturally, in particular for freestanding domain
names.
- It would require much more control over the contexts of
IRI display than we think will be available (if we get
an overall context of LTR reasonably widely implemented,
I think we already have achieved something).
Regards, Martin.