Normalisation and matching

An IMA is NOT case folded. An IMA is required to preserve case.
Equivalence of local parts is defined in terms of the dequoted
form (see above) and case-insensitive matching.

I can think of two ways to make local parts case-insensitive and
case-preserving.

One way is to exclude the case-folding step from Nameprep, and to
define the comparison procedure as caseInsensitiveUnicodeCompare(
ToUnicode(X), ToUnicode(Y) ), as opposed to the current procedure which
is caseInsensitiveASCIICompare( ToASCII(X), ToASCII(Y) ). The drawback
of the UnicodeCompare approach is that it's not compatible with existing
mail servers. Users could not create internationalized local parts
until the server was updated to perform the new comparison. With the
ASCIICompare approach, that's the same operation that existing mail
servers already perform, so users could create internationalized local
parts immediately.

The other way to make local parts case-insensitive and case-preserving
is to leave the case-folding step in Nameprep, but use mixed-case
annotation to allow it to be undone. There are a handful of special
cases that this wouldn't cover (sharp s, titlecase), but it would be
compatible with existing mail servers. This can always be added later
as an option. But it's fairly complex, which is a reason to make it
optional rather than required. Why is it so important to preserve
case in identifiers that are case-insensitive? Even if case is not
preserved, the identifier will still work, and still be recognizable to
humans. Case preservation seems desirable, but not essential.

(3) is implied by (1), so there's no need to state it.

The 59-character limit applies to encoded segments of the local part,
not to the entire local part. That issue is in the could-be-reopened
category, because it hasn't been discussed much. If we raise the limit
on Punycode output size, then some existing implementations of Punycode
won't work for IMAA. If the limit is much larger than 59, then the
published Punycode algorithm becomes inappropriate, because it's an
O(n^2) algorithm. I think there exists a more complex algorithm that
is O(n log n). Once again, there is a tradeoff: is support for long
segments worth the added complexity?

An IMA is always normalised using NFKC. In free form text
there may be e-mail addresses using wide characters or other
unnormalised forms.

And those unnormalized forms are not IMAs? But they must be accepted by
application user interfaces and converted to IMAs? Then why not just
call them IMAs? If it looks like a duck and quacks like a duck...

Lasse do NOT match La<sharp s>e.

But does LASSE match La<sharp-s>e? The former is the proper all-caps
form of the latter, so they should match. Does LASSE match lasse? Yes,
they are all-ASCII strings that already match according to existing
rules. Therefore lasse matches La<sharp-s>e, unless matching is
non-transitive, which would be screwy (that would prevent the matching
from being implemented as exactCompare(Canonical(X),Canonical(Y)), which
is how matching is customarily implemented).

Besides, the Unicode Consortium already defines how to do
case-insensitive matching of Unicode strings; what makes us think we
know better?

An "IMA-aware mail address slot"...

As an IMA always is normalised the slot will always contain
a well defined, easy to handle, e-mail address.

I don't see why we should impose that constraint. An IMA-aware slot
can impose as many or as few additional constraints as it likes. For
example, it could require normalized UTF-8, or it could require ASCII
(using ACE when necessary), or it could allow non-normalized UTF-8, or
it could allow non-Unicode charsets (like iso-2022-jp or Big5)
which are implicitly non-normalized because normalization is defined
only for Unicode.

Also, IDNA does not impose this constraint, so what makes IMAA different
in this respect?

AMC

Dan Oscarsson

2003-08-11 07:00:28 UTC

Post by Adam M. Costello
The other way to make local parts case-insensitive and case-preserving
is to leave the case-folding step in Nameprep, but use mixed-case
annotation to allow it to be undone. There are a handful of special
cases that this wouldn't cover (sharp s, titlecase), but it would be
compatible with existing mail servers.

Does titlecase exist after a text has been normalised using NFKC?
I would expect NFKC to remove titelcase as it is a purely typographic matter.

sharp s should not be matched to ss (one of the reasons is just the
fact that you cannot know if uppercase ss is SS or sharp s. Unicode
should include a "lower case sharp s character code" if that is wanted).

Post by Adam M. Costello
Why is it so important to preserve
case in identifiers that are case-insensitive? Even if case is not
preserved, the identifier will still work, and still be recognizable to
humans. Case preservation seems desirable, but not essential.

Case is for many a part of the identity. As a e-mail address will be shown
to many people, you want it to be shown as you write it.
Matching is case-insensitive - everybody do not know what case you prefer
on your e-mail address (and case is not spoken over a phone when you
give your address).

If we cannot handle ACE->UCS with case preservation we could as well use
a one way hash for the ASCII world.

(3) is implied by (1), so there's no need to state it.

OK. But that may break you 59 character limit.
An ASCII local part can contain more than 59 characters.

Post by Dan Oscarsson
An IMA is always normalised using NFKC. In free form text
there may be e-mail addresses using wide characters or other
unnormalised forms.

No, we must separate what is used in protocols and what is used
in user interfaces. In a protocol a well defined simple format is very
much desired. IMAA should define how non-ASCII e-mail addresses are
used in protocols, not how a user interface should handle them.
By having IMAs being normalised using NFKC in protocols makes
everything much simpler to handle while not removing anything vital.

Post by Dan Oscarsson
Lasse do NOT match La<sharp s>e.

But does LASSE match La<sharp-s>e?

Post by Adam M. Costello
The former is the proper all-caps
form of the latter, so they should match.

Only in some cases in germany. The discussions on the IDN list have shown
that it is not the general case even in Germany.

They do not match in Swedish.

Post by Adam M. Costello
Besides, the Unicode Consortium already defines how to do
case-insensitive matching of Unicode strings; what makes us think we
know better?

They do not say that ss and sharp s should match. They have a
single to single matching that is good and easy to use.
They have a selection of special casing of which the sharp s handling
in one. You are not required to use it.
And I have seen so many examples of problems with sharp s and ss, that
it would be bad to include it.

An "IMA-aware mail address slot"...

As an IMA always is normalised the slot will always contain
a well defined, easy to handle, e-mail address.

An IMA-aware mail address slot must use UCS normalised NFKC to make
things simple. You do not want to spend resources on matching @ with "wide @".
Slots with other encodings are something else
and should not be recommended. How you encode UCS is up to the protocol.

Post by Adam M. Costello
Also, IDNA does not impose this constraint, so what makes IMAA different
in this respect?

IDNA failed with this. It is one reason why I think the IDNA RFC is
unclear.

-
Maybe the question is: what is IMAA good for?
I want a RFC defining a clear standardised simple format to use
for e-mail addresses with non-ASCII characters together with clear
and simple rules of how they are to be compared. It should also
define how they are to be encoded when transmitted over legacy
e-mail systems.

I do not want something that says that in an international context
you can transmitt characters unnormalised or in many character sets.
That does not make interoperability work.

Dan

Adam M. Costello

2003-08-11 09:11:13 UTC

Post by Dan Oscarsson
Does titlecase exist after a text has been normalised using NFKC?

Good question! Nope, NFKC-normalized text never contains titlecase
characters, so titlecase isn't a problem afterall.

Post by Dan Oscarsson
Case is for many a part of the identity. As a e-mail address will be
shown to many people, you want it to be shown as you write it.

Yes, we want that, but there are other things we want too, and sometimes
we have to make choices.

Post by Dan Oscarsson
Matching is case-insensitive - everybody do not know what case you
prefer on your e-mail address (and case is not spoken over a phone
when you give your address).
If we cannot handle ACE->UCS with case preservation we could as well
use a one way hash for the ASCII world.

Do you see what you're saying? Given:

1) José
2) josé
3) kmcsi5csxy

you're saying that if you can't show (1), then it doesn't matter whether
you show (2) or (3), because neither is any better than the other. Are
you serious?

We already went through all this for IDNA. Traditionally domain names
are case-insensitive and case-preserving, but the case-preserving
part was relaxed for IDNs because it wasn't deemed worth the
additional required complexity. We can add an optional mechanism for
case-preservation later.

We could repeat the same arguments for IMAA, but I see no reason why it
would play out any differently than it did for IDNA.

Post by Dan Oscarsson
An ASCII local part can contain more than 59 characters.

Yes, and the current draft supports that, it just doesn't support more
than 63 characters (59 + ACE infix) in any single encoded segment.

For example,

0iesg1n8jok5ay5dzabd5bym9f0cm5685rrjetr6pdxa-0iesg1b1abfaaepdrnnbgefbaDotcwatmq2g4l

is an 83-character local part that is valid under imaa-02, because neither
of its two segments exceeds the limit. But

0iesg1989aomsvi5e83db1d2a355cv1e0vak1dwrv93d5xbh15a0dt30a5jpsd879ccm6fea98c

would be invalid, even though it's shorter, because it's all one
segment.

Post by Dan Oscarsson
IMAA should define how non-ASCII e-mail addresses are used in
protocols, not how a user interface should handle them.

Just the opposite. Initially, non-ASCII mail addresses will not appear
in protocols at all (only ASCII address will), but they will appear in
user interfaces. The primary issue that IMAA needs to address is how
to bridge the gap between non-ASCII user interfaces and existing ASCII
protocols. Non-ASCII mail addresses won't appear in protocols until
new protocols are defined, and each new protocol can then specify how
non-ASCII mail addresses are to be represented in that protocol. If
normalized Unicode is the best way, then that's what they'll specify,
but there's no need for us to prescribe that choice now.
Obviously *you* don't. When you define an IMA-aware protocol, you are
welcome to restrict the syntax so that receivers don't need to bother
checking for fullwidth @.

But there may be some protocol designers who believe that it's important
to preserve not only mixed case, but also other presentational
details that would be destroyed by normalization (like a fullwidth
@, or a superscript 2). They might decide that it's better to
transmit/display/store the mail address exactly as it was originally
typed, and perform the normalization only momentarily for the purpose of
matching it to the appropriate mailbox, and then discard the normalized
form.

There is no need for us to decide this question now for all future
protocols. As long as the equivalence relation is standard, each
protocol can use whatever equivalent form it thinks is best. Gateways
between protocols will of course need to respect the requirements of
each protocol, and perform any necessary translations, but that's the
nature of a gateway.

And again, I see no reason why IMAA and IDNA should take different
courses on this issue. IDNA did not prescribe any particular form for
IDN-aware slots.

Post by Dan Oscarsson
Slots with other encodings are something else and should not be
recommended.

They are neither recommended nor discouraged.

Post by Adam M. Costello
Besides, the Unicode Consortium already defines how to do
case-insensitive matching of Unicode strings; what makes us think we
know better?

SpecialCasing.txt is for case mapping, not caseless matching. The
relevant file for caseless matching is CaseFolding.txt. It contains the
following comment:

Note that where they can be supported, the full case foldings are
superior: for example, they allow "MASSE" and "Maße" to match.

That makes it very clear that SS and sharp-s are supposed to match.
They both get folded to ss.

There is nothing locale-specific about that mapping. The only
locale-specific mappings relate to Turkish i.

AMC

Martin Duerst

2003-08-11 15:34:30 UTC

Post by Dan Oscarsson
IMAA should define how non-ASCII e-mail addresses are used in
protocols, not how a user interface should handle them.

I don't think I agree here. We want protocols to work together
easily. We want them to be able to just say 'use this'. We don't
expect them to rehash the discussions we have here. If they do,
they probably won't get the best result, because they don't have
the internationalization expertize that we have.
Also, we want new protocols to pick up on IMAAs. We should make
it easy for them to just point to a definition in our spec.

Post by Adam M. Costello
Obviously *you* don't. When you define an IMA-aware protocol, you are
welcome to restrict the syntax so that receivers don't need to bother
But there may be some protocol designers who believe that it's important
to preserve not only mixed case, but also other presentational
details that would be destroyed by normalization (like a fullwidth
@, or a superscript 2). They might decide that it's better to
transmit/display/store the mail address exactly as it was originally
typed, and perform the normalization only momentarily for the purpose of
matching it to the appropriate mailbox, and then discard the normalized
form.
There is no need for us to decide this question now for all future
protocols. As long as the equivalence relation is standard, each
protocol can use whatever equivalent form it thinks is best. Gateways
between protocols will of course need to respect the requirements of
each protocol, and perform any necessary translations, but that's the
nature of a gateway.

Let's make sure gateways are as easy as possible, and don't have
to deal with issues that protocols maybe even forgot to specify.

Regards, Martin.

Adam M. Costello

2003-08-12 01:49:03 UTC

each new protocol can then specify how non-ASCII mail addresses are
to be represented in that protocol. If normalized Unicode is the
best way, then that's what they'll specify, but there's no need for
us to prescribe that choice now.

We want protocols to work together easily. We should make it easy for
them to just point to a definition in our spec.

If we really want IMAA to make a recommendation about how IMAs are to
be represented in IMA-aware slots, then we'll need to agree on what
recommendation to make.

[What is the sound of a can of worms opening...]

Guess what recommendation I would argue for? Minimal restrictions. I
would recommend that IMA-aware protocols allow all valid IMAs (including
non-Nameprepped ones) in whatever charsets they want to support (and
they should at least support UTF-8, as recommended by BCP-whatever).
Applications should not apply Nameprep or NFC or anything before
putting the mail address into the slot; they should leave that to the
receiver to do if necessary. (It will be necessary if receiver wants
to compare the address, or relay it into a IMA-unaware slot, but not if
the receiver merely wants to relay it into an IMA-aware slot, or display
it).

I see two advantages to this approach. First, it allows presentational
details to be preserved (like fullwidth characters, superscript
characters, sharp-s, etc).

Second, it reduces superfluous computation at the sending end in
two cases. Case 1: If the receiver doesn't need the string to be
Nameprepped, then it would be a waste for the sender to apply Nameprep.

Case 2: If the receiver needs the string to be Nameprepped, then it
will probably apply Nameprep itself, even if the protocol says the
string should already be Nameprepped on arrival, in which case applying
Nameprep at the sending end is redundant. Unlike IRIs, where the
receiver doesn't know whether it's safe to perform normalization (as
described in section 5.3 of the IRI draft), it is always safe to perform
Nameprep on local parts and domain labels (safe in the sense that the
result is guaranteed to refer to the same domain/mailbox). Therefore,
just as web browsers try to interpret bad HTML rather than give up,
applications would in practice accept non-Nameprepped equivalent forms
even if the protocol said they were supposed to be already Nameprepped.
(And unlike the bad HTML case, there's not even any ambiguity; the
meaning is clearly defined by IMAA.)

So the approach I would recommend is to let applications take
responsibility for applying Nameprep when they themselves need it, don't
depend on other applications to pre-apply it for you, and don't bother
trying to pre-apply it for someone else.

This approach satisfies the properties you said you wanted: (1) It
is easy for protocols to interoperate, they just need to obey the
requirements in IMAA, applying Nameprep whenever they are required to
(that is, when performing comparisons and when converting to ASCII
form) and at no other time. (2) New protocols can simply point to a
definition in the IMAA spec, namely the definition of IMA.

So that's the recommendation I would prefer to make, and those are
my reasons. I would very much prefer not to make the opposite
recommendation (that is, to recommend that senders pre-apply Nameprep).
I'm quite willing to compromise by making no recommendation at all,
leaving the decision in the hands of the IMA-aware protocol desigers,
but that's where we started...

AMC

John Cowan

2003-08-12 02:15:13 UTC

Post by Adam M. Costello
[What is the sound of a can of worms opening...]

/me stands still and brandishes a can opener in the air.

--
John Cowan ***@reutershealth.com www.reutershealth.com www.ccil.org/~cowan
If a soldier is asked why he kills people who have done him no harm, or a
terrorist why he kills innocent people with his bombs, they can always
reply that war has been declared, and there are no innocent people in an
enemy country in wartime. The answer is psychotic, but it is the answer
that humanity has given to every act of aggression in history. --Northrop Frye

Adam M. Costello

2003-08-12 02:21:59 UTC

Post by Adam M. Costello
I see two advantages to this approach. First, it allows
presentational details to be preserved (like fullwidth characters,
superscript characters, sharp-s, etc).

I forgot the most compelling presentational detail: mixed case.

My message focused on the issue of whether to pre-apply Nameprep. If
we consider the similar issue of whether to pre-apply normalization
(without the rest of Nameprep), the argument against doing so becomes
even stronger. If the receiver is going to compare the address or
convert it to ASCII, and the string did not arrive pre-folded, then
the receiver will need to apply case-folding, which is a sufficiently
heavyweight operation that there's no significant advantage in having
the string arrive pre-normalized; the receiver might as well just apply
all of Nameprep, in which case any pre-normalization is redundant. On
the other hand, if the receiver is not going to compare the address or
convert it to ASCII, then pre-normalization wouldn't help it at all.

AMC

J-F C. (Jefsey) Morfin

2003-08-11 18:00:47 UTC

Post by Adam M. Costello
Yes, we want that, but there are other things we want too, and sometimes
we have to make choices.

Sorry, but the only reason why would be that you do not know how to make
it. It is true there is a problem, but the reason of this effort is to work
out a solution. And if this demonstrates not possible priorities in choice
may be other prirorities than yours.

Post by Dan Oscarsson
Matching is case-insensitive - everybody do not know what case you
prefer on your e-mail address (and case is not spoken over a phone
when you give your address).

No. Matching is the matching system the application is set-up for. Do not
think in old terms. We are discussing future not only support of the past.

Post by Dan Oscarsson
If we cannot handle ACE->UCS with case preservation we could as well
use a one way hash for the ASCII world.

1) José
2) josé
3) kmcsi5csxy
you're saying that if you can't show (1), then it doesn't matter whether
you show (2) or (3), because neither is any better than the other. Are
you serious?
We already went through all this for IDNA. Traditionally domain names
are case-insensitive and case-preserving, but the case-preserving
part was relaxed for IDNs because it wasn't deemed worth the
additional required complexity. We can add an optional mechanism for
case-preservation later.
We could repeat the same arguments for IMAA, but I see no reason why it
would play out any differently than it did for IDNA.

Adam, this is not because _you_ do not understand a need that this need
does not exist. This is not because you made something limited for years
that this must go for ever. This is not because you chose one way for
something that it must be followed for everything else.

We all agreed that IDNA was a try. This is why it was eventually accepted.
Don't try to make it a rule. The only result would be a network split.

Post by Dan Oscarsson
IMAA should define how non-ASCII e-mail addresses are used in
protocols, not how a user interface should handle them.

Just the opposite. Initially, non-ASCII mail addresses will not appear
in protocols at all (only ASCII address will), but they will appear in
user interfaces.

This is a demand of the DNS for RHS. There is no DNS on LHS.

Post by Adam M. Costello
And again, I see no reason why IMAA and IDNA should take different
courses on this issue. IDNA did not prescribe any particular form for
IDN-aware slots.

ditto.

I see no reason why future IDNA and IMAA should take different
courses on this issue. IMAA is not to prescribe any particular form
for LHS.

Post by Dan Oscarsson
Slots with other encodings are something else and should not be
recommended.

They are neither recommended nor discouraged.

Let then have an IMAA and a RichIMAA

Let have a new RR as "RM" for RichMail supported instead
of MX. MX will mean that RichMail will have to be degraded
to Standard IMAA. It is likely that non "MR" supporting Bind
versions will not give access to RichMail applications.

Post by Adam M. Costello
The local part is no more a name than the domain part. Both are
identifiers designed to be memorable in association with a
person/organization, not equal to the name of that
person/organization. No one expects to exchange email with <John Q.
Public at Yahoo! Inc.>, they are accustomed to exchanging email with "John

Will repeating that, change that YES there ARE people demanding it?

That for 6.000 years men use Upper Cases when they could have used lower
cases only. That there are people crazy enough to set up language
organizations like Eurolinc to demand it. Even States like 25 european
States putting into their Constitution that such things have priority over
international agreements and local laws.

1. our role is not to impose technical limitations but to work out the way
we will address their expectations. Or just say "I do not know how to do it".

2. this is not because some techies had not the budget, the skills or only
the cultural interest and accepted to degrade language, courtesy, user
services that this should extend to the workd for ever.

This world of us is made of names, trade marks, IP rights, information,
knowledge, suscriptions, .. which not only come Upper and Lower cases we
obviuosly have to support but also with many new signs like emotocons and
dingbats we need to support. Unless we do not want e-mail to stay
compatible with SMS?

Internationalizing is ASCII patch. Multiliguism is a necessity. Vernacular
e-names is the demand (on the network the names the way they are used be
everyone everywhere else). If Real Names was closed it was by Bill Gates
political decision we all know why, not by lack of user demand.

jfc

PS. may I recall you that "@" is an old French character for "ad" in Latin
(as & for "et") - also spelled "à" (same key on the keyboard) which is used
exactly in the way you say we do not.

Dan Oscarsson

2003-08-11 11:09:23 UTC

1) José
2) josé
3) kmcsi5csxy
you're saying that if you can't show (1), then it doesn't matter whether
you show (2) or (3), because neither is any better than the other. Are
you serious?

3) is going extream. Only 1) is good.

Post by Adam M. Costello
We already went through all this for IDNA. Traditionally domain names
are case-insensitive and case-preserving, but the case-preserving
part was relaxed for IDNs because it wasn't deemed worth the
additional required complexity. We can add an optional mechanism for
case-preservation later.
We could repeat the same arguments for IMAA, but I see no reason why it
would play out any differently than it did for IDNA.

It was not good in IDNA and it will break som programs.
But domain names converted to lower case is not as inmportant as names.

Post by Dan Oscarsson
IMAA should define how non-ASCII e-mail addresses are used in
protocols, not how a user interface should handle them.

While this could be good, the character allowed in the international form
may not be restricted by limits in ASCII encoding.
Still it is bad that IMAA (and IDNA) gives a lot of impact on
the international form that would not exist if you started
from the international perspective.

Post by Adam M. Costello
But there may be some protocol designers who believe that it's important
to preserve not only mixed case, but also other presentational
details that would be destroyed by normalization (like a fullwidth
@, or a superscript 2). They might decide that it's better to
transmit/display/store the mail address exactly as it was originally
typed, and perform the normalization only momentarily for the purpose of
matching it to the appropriate mailbox, and then discard the normalized
form.

That may be so, but you will not get interoperability by that.
I can see no reason to support more than one code point for a
character. If I want to display a character using a wider size I will
do that, but the wide attribute do not belong as a part of the character code.

Actually NFKC is not good in some cases as it removes some characters
that are destinct and will lose their meaning by NFKC. Wide characters
do not belong to that category.

Post by Adam M. Costello
SpecialCasing.txt is for case mapping, not caseless matching. The
relevant file for caseless matching is CaseFolding.txt. It contains the
Note that where they can be supported, the full case foldings are
superior: for example, they allow "MASSE" and "Maße" to match.
That makes it very clear that SS and sharp-s are supposed to match.
They both get folded to ss.

I could accept that for matching, even though it is not correct in many
cases. But not when doing a lower case conversion.
"Laße" may not be converted by IMAA to "lasse" because it changes the
meaning of the name.

Post by Adam M. Costello
There is nothing locale-specific about that mapping. The only
locale-specific mappings relate to Turkish i.

Actually it is locale specific, even though Unicode do not say so.
There are several mistakes in Unicode and code points that should never
have existed, making everything unnecessary complex (like having to have
NFKC because the same character exists more than once for some characters).

Dan

Adam M. Costello

2003-08-11 12:25:05 UTC

But domain names converted to lower case is not as important as names.

The local part is no more a name than the domain part. Both
are identifiers designed to be memorable in association
with a person/organization, not equal to the name of that
person/organization. No one expects to exchange email with
<John Q. Public at Yahoo! Inc.>, they are accustomed to

Still it is bad that IMAA (and IDNA) gives a lot of impact on the
international form that would not exist if you started from the
international perspective.

Yes, but we don't get to choose our starting point. Our starting point
is the ASCII status quo.

...and details that would be destroyed by case-folding, like sharp-s. :)

Post by Adam M. Costello
That makes it very clear that SS and sharp-s are supposed to match.
They both get folded to ss.

I could accept that for matching, even though it is not correct in
many cases. But not when doing a lower case conversion.

Exactly. Unicode specifies that ToLower(Laße) = laße, not lasse. But
IDNA & IMAA don't do lower case conversion; they have no use for it.
What they need is case folding, because caseless matching is defined
in terms of case folding. Unicode specifies that Fold(Laße) = lasse,
because Laße needs to match LASSE, and Fold(LASSE) = lasse.

There are...code points that should never have existed, making
everything unnecessary complex (like having to have NFKC because the
same character exists more than once for some characters).

Some of those redundant code points (including fullwidth @) are there
for a good reason: to enable lossless round-trips from a national
standard character set to Unicode and back again.

AMC

Dan Oscarsson

2003-08-13 12:23:34 UTC

Post by J-F C. (Jefsey) Morfin

Post by Adam M. Costello
And again, I see no reason why IMAA and IDNA should take different
courses on this issue. IDNA did not prescribe any particular form for
IDN-aware slots.

ditto.
I see no reason why future IDNA and IMAA should take different
courses on this issue. IMAA is not to prescribe any particular form
for LHS.

If IMAA does not want to say anything on form of IMA, then it
should NOT require full width @ to be recognised. That is up to
those defining how IMAs are to be used to define.

Dan

Roy Badami

2003-08-13 12:43:01 UTC

Post by Dan Oscarsson
If IMAA does not want to say anything on form of IMA, then it
those defining how IMAs are to be used to define.

But that requirement is symetrical with IDNAs requirement to recognize
labels separated by full width dots.

-roy

Adam M. Costello

2003-08-14 01:51:33 UTC

If IMAA does not want to say anything on form of IMA, then it should
defining how IMAs are to be used to define.
As IDNA I now understand does not define how IDNs should be encoded
there is no reason to require anything about full dots. That will
be up to protocol implementors that create protocols with IDN aware
slots.

IDNA/IMAA define the space of valid IDNs/IMAs and the equivalence
relation among them. Neither requires that a slot allow all forms.
IMA-unaware slots allow only ASCII forms, and a future IMA-aware slot
could choose to allow only normalized forms, in which case fullwidth
at-sign would not be allowed, and therefore obeying the requirement
that fullwidth at-sign be recognized would be effortless and automatic,
because a receiver cannot fail to recognize something that it never has
an opportunity to see. Similarly, a future IDN-aware slot might allow
only normalized forms devoid of ideographic full stop, in which case
it would be impossible for a receiver to violate the requirement that
ideographic full stops be recognized as dots.

Perhaps the specs could be more clear on this point. Here is the
current wording:

Whenever dots are used as label separators, the following characters
MUST be recognized as dots:

In an internationalized mail address, the following characters MUST
be recognized as at-signs for separating the local part from the
domain name:

Maybe adding "if they appear" would make them clearer:

Whenever dots are used as label separators, the following characters
MUST be recognized as dots if they appear in an IDN:

For separating the local part from the domain name, the following
characters MUST be recognized as at-signs if they appear in an
internationalized mail address:

There is no requirement that these characters be allowed in any given
slot; the intention is merely to require that if they appear they must
be treated as equivalent to the ASCII delimiters.

I am sure you will find that a lot of software capable of displaying
UCS will fail when it is not normalised, and also for things like full
width or circled forms.

That's an argument for normalizing UCS text before displaying it. That
doesn't imply that UCS text should be normalized any earlier than
that. Deferring normalization until it's really necessary would allow
applications with non-buggy display systems (and applications that don't
display the address at all) to opt out of the normalization and save
that cost.

If you use NFC all is preserved. But to simplify character handling
only ONE representation of a character should be allowed. This does
not mean NFKC - it unfortunately does more than that. I want sharp-s
and masculine ordinal indicator to be preserved.

NFKC preserves sharp-s; it's case-folding that destroys sharp-s. But
NFKC does destroy the ordinal indicators.

I do not want full width characters as not all letters can be full
width and it is just a second encoding of the standard width letter,
nor do I want ligatures.

So what you're saying is, you want to require pre-normalization, but not
NFC or NFKC, but rather NF-Dan. I guess your next step is to write a
spec for that.

Even assuming you write that spec and convince the Unicode Consortium or
this mailing list to consider NF-Dan, I still don't see the advantage
of requiring early normalization. The receiver knows what it needs,
but the sender doesn't know what the receiver needs. If a receiver
needs normalized strings for whatever reason, it can normalize them
itself, but if it doesn't need normalized strings, there's no need to
waste the sender's effort on pre-normalization. How often would the
receiver benefit from pre-normalization anyway? If the receiver is
going to display the address, there might be a benefit, if the display
system can't handle non-normalized strings (how common is that?). But
if the receiver is going to compare the address, or resolve the address,
or gateway it to an ASCII slot, then there is no benefit, because the
receiver needs to apply Nameprep's case-folding and NFKC, which means
any prior normalization is redundant.

From all I have read the best thing is if sender does normalisation,
not receiver. It is often easy during input to normalise without
overhead. To write code that can normalise every time you get data
before it can be usable will cost a lot more.
IRI uses NFC for this reason.

IRI recommends NFKC when creating IRIs and requires NFC when converting
IRIs to URIs. The reason given for having senders rather than receivers
perform normalization has nothing to do with performance or cost, it has
to do with correctness:

Equivalence of IRIs MUST rely on the assumption that IRIs are
appropriately pre-normalized, rather than applying normalization
when comparing two IRIs.

Because we do not know how a particular field is treated with
respect to text normalization, it would be inappropriate to
allow third parties to normalize an IRI arbitrarily. This
does not contradict the recommendation that if you create a
resource, and an IRI for that resource, you try to be as normalized
as possible (i.e. NFKC if possible). This is similar to the
upper-case/lower-case problems in URIs. Some parts of an URI are
case-insensitive (domain name). For others, it is unclear whether
they are case-sensitive or case-insensitive, or something in
between (e.g. case-sensitive, but if you use the wrong case, may
not directly get a result, but rather a 'Multiple choices'). The
best recipe we have there is that the generator uses a reasonable
capitalization, and when transfering the URI, you do not change
capitalization.

(Section 5.3.)

In summary, IRIs need to use pre-normalization because there is no
single well-known equivalence relation; only the creator of an IRI knows
for sure if any other strings are equivalent to it, and knows how to
compare that IRI against others.

That's not a problem for IDNs and IMAs, because IDNA and IMAA define
a standard well-known equivalence relation for all non-ASCII IDNs
and IMAs. Anyone can compute canonical forms and compare IDNs/IMAs,
not just the creator. Therefore there is no need for the creator to
pre-normalize; if and when normalization is needed, it can be performed
by whomever needs it.

AMC

Dan Oscarsson

2003-08-13 12:41:36 UTC

Post by Adam M. Costello
Guess what recommendation I would argue for? Minimal restrictions. I
would recommend that IMA-aware protocols allow all valid IMAs (including
non-Nameprepped ones) in whatever charsets they want to support (and
they should at least support UTF-8, as recommended by BCP-whatever).
Applications should not apply Nameprep or NFC or anything before
putting the mail address into the slot; they should leave that to the
receiver to do if necessary. (It will be necessary if receiver wants
to compare the address, or relay it into a IMA-unaware slot, but not if
the receiver merely wants to relay it into an IMA-aware slot, or display
it).

I am sure you will find that a lot of software capable of displaying
UCS will fail when it is not normalised, and also for things like
full width or circled forms.

Post by Adam M. Costello
I see two advantages to this approach. First, it allows presentational
details to be preserved (like fullwidth characters, superscript
characters, sharp-s, etc).

Normalised does not mean that that goes away. If you use NFC all is
preserved. But to simplify character handling only ONE representation
of a character should be allowed. This does not mean NFKC - it unfortunately
does more than that. I want sharp-s and masculine ordinal indicator
to be preserved.
I do not want full width characters as not all letters can be full width and
it is just a second encoding of the standard width letter, nor do I want
ligatures. I prefer simple forms.

Post by Adam M. Costello
Second, it reduces superfluous computation at the sending end in
two cases. Case 1: If the receiver doesn't need the string to be
Nameprepped, then it would be a waste for the sender to apply Nameprep.

I do not want it to be nameprepped. I want it to be normalised and
no multiple represenations of the same character.

Post by Adam M. Costello
So the approach I would recommend is to let applications take
responsibility for applying Nameprep when they themselves need it, don't
depend on other applications to pre-apply it for you, and don't bother
trying to pre-apply it for someone else.
From all I have read the best thing is if sender does normalisation, not

receiver. It is often easy during input to normalise without overhead.
To write code that can normalise every time you get data before it can
be usable will cost a lot more.

IRI uses NFC for this reason.

Dan

Mark Davis

2003-08-13 13:45:35 UTC

Post by Adam M. Costello
From all I have read the best thing is if sender does normalisation, not
receiver. It is often easy during input to normalise without

overhead.

Post by Adam M. Costello
To write code that can normalise every time you get data before it can
be usable will cost a lot more.

The "will cost a lot more" is a bit overstated. Because most text, in
practice, is already normalized, the usual practice is to check
whether the text is already normalized, using a very fast algorithm
like that in http://www.unicode.org/reports/tr15/#Annex8, or
enhancements thereof. Only if unnormalized sequences are detected does
full normalization need to be invoked.

It does cost a bit more, since you do have to check each character
[only in House Republican fancy does increasing (government spending)
reduce size (of government)]. But depending on how much other
processing is going on, it may or may not be significant. In parsing
XML, for example, it is not.

For more descriptions of different strategies, see the end of
http://www.unicode.org/reports/tr15/#Canonical_Equivalence.

Mark
__________________________________
http://www.macchiato.com
► “Eppur si muove” ◄

----- Original Message -----
From: "Dan Oscarsson" <***@kiconsulting.se>
To: <ietf-***@imc.org>
Sent: Wednesday, August 13, 2003 05:41
Subject: Re: Normalisation and matching

Post by Adam M. Costello
Guess what recommendation I would argue for? Minimal restrictions.

Post by Adam M. Costello
would recommend that IMA-aware protocols allow all valid IMAs

(including

Post by Adam M. Costello
non-Nameprepped ones) in whatever charsets they want to support (and
they should at least support UTF-8, as recommended by

BCP-whatever).

Post by Adam M. Costello
Applications should not apply Nameprep or NFC or anything before
putting the mail address into the slot; they should leave that to the
receiver to do if necessary. (It will be necessary if receiver wants
to compare the address, or relay it into a IMA-unaware slot, but not if
the receiver merely wants to relay it into an IMA-aware slot, or display
it).

I am sure you will find that a lot of software capable of displaying
UCS will fail when it is not normalised, and also for things like
full width or circled forms.

Post by Adam M. Costello
I see two advantages to this approach. First, it allows

presentational

Post by Adam M. Costello
details to be preserved (like fullwidth characters, superscript
characters, sharp-s, etc).

Normalised does not mean that that goes away. If you use NFC all is
preserved. But to simplify character handling only ONE

representation

Post by Adam M. Costello
of a character should be allowed. This does not mean NFKC - it

unfortunately

Post by Adam M. Costello
does more than that. I want sharp-s and masculine ordinal indicator
to be preserved.
I do not want full width characters as not all letters can be full width and
it is just a second encoding of the standard width letter, nor do I want
ligatures. I prefer simple forms.

Nameprep.

Post by Adam M. Costello
I do not want it to be nameprepped. I want it to be normalised and
no multiple represenations of the same character.

From all I have read the best thing is if sender does normalisation, not
receiver. It is often easy during input to normalise without

overhead.

Post by Adam M. Costello
To write code that can normalise every time you get data before it can
be usable will cost a lot more.
IRI uses NFC for this reason.
Dan

J-F C. (Jefsey) Morfin

2003-08-13 15:06:00 UTC

Post by Mark Davis
It does cost a bit more, since you do have to check each character
[only in House Republican fancy does increasing (government spending)
reduce size (of government)]. But depending on how much other
processing is going on, it may or may not be significant. In parsing
XML, for example, it is not.

Were there not any attempt sometime to modelize the different layers
involved in such things (without going into a complete network architecture)?

All these problems seems addressable by OPES (open pluggable edge services)
now under finalization.

I agree most consider the internet as a dump network/smart host approach vs
smart network/dumb host telecoms vision. Could we not think again and as
NGN will help doing "smart network/smart host"? OPES just start permiting
it (see PS of this post). We can imagine mail services will simply
subscribe to a converting OPES until they update and aggregate the
necessary solutions.

I bored the WG-OPES with the requirment to support the DNS. IDNA does not
address the real operators demand: ML.ML. There is no problem in having
ISPs including the filtering in/out of the Chinese TLDs into ".com" or
".net" without using DNAMES. What OPES can do in the RHS they can also do
it on the LHS.

My own old proposition (my job before IETF was created :-) is just the next
step ahead; it is to internetwork OPES into ONES (open network extended
services) i.e. interactive networks/host services to the relations
of community. They started in the 80s (Swft, SIta, Visa, Amadeus, etc.)
and were culturally blocked by OSI because the network continuity was no
more win/win (sorry: smart/smart). This means that OPES can be networked
and provide an upper inter-application layer for fancy community services
such as anti-spam meta directory, classment, ACKs, etc. what might help a
quick dissemination.

IMHO we have to get the internet a little bit rennovative. Mail is the
leading applictation. This is a opportunity which does not shake the old
RFCs and permits to address the real user demands.

In this regards IDNA is an extremely interesting contribution. It shown
that we can really enhance the network services (real operations) in
encapsulating the current solution into an innvatve layer. This is the key
of the network evolution. This permits to make the technology grow and
maintain different levels of evolution together. When you need to have IDNA
you have the IDNA layer, when iti s builtin in the OS you do not need
anymore if you need the ML.ML, the OPES "out-plug" will take care until the
DNS servers supports it...

jfc

For those not familiar with OPES just think of them as a smart wall plug.
There is a "filter" (the dispatcher) were you put the rules you want. When
a rule is triggered the data are sent under OCP to one or several servers
to be massaged and returned. The process is transparent to the other end or
not.

My proposed generalization as ONES has initially created discussions to
know if they were part or not of the OPES. They are above. It means that
OPES servers (those managing the information) are networked and may
interact in function of the information they worl on and may even reroute
the data mong themselves. There is also another controversy about the
possible place of the dispatcher irt the application and the socket (IAB
asked OPES to address the proxy based applications).

The debate was also the level of OCP from above XML down to a replacement
ot IP. The final approach is to keep this neutral/adaptable as much as
possible - I must say that I partly skipped the finalization of OCP by
personnal overload - and I am more interested in the global architecture
aspects/defintion.

What is of niterest in there is that OPES was initially supposed to address
HTTP flows and that from start everyone added SMTP. IMHO this permits to
_investigate_ (I say no more) all the forms possible RMTP (Rich Mail)

Dan Oscarsson

2003-08-13 12:48:34 UTC

Post by Roy Badami

Post by Dan Oscarsson
If IMAA does not want to say anything on form of IMA, then it
those defining how IMAs are to be used to define.

But that requirement is symetrical with IDNAs requirement to recognize
labels separated by full width dots.

I have never understod it. As IDNA I now understand does not define
how IDNs should be encoded there is no reason to require anything
about full dots. That will be up to protocol implementors that create
protocols with IDN aware slots.

Dan

Dan Oscarsson

2003-08-14 11:48:41 UTC

Post by Adam M. Costello
That's an argument for normalizing UCS text before displaying it. That
doesn't imply that UCS text should be normalized any earlier than
that. Deferring normalization until it's really necessary would allow
applications with non-buggy display systems (and applications that don't
display the address at all) to opt out of the normalization and save
that cost.

Well, why not allow UTF-8 encoding of ASCII and over long sequences?
That is no problem at all, you just normalise when you need it.

Post by Adam M. Costello
From what I have heard the reasons for me wanting normalised UCS is the same

as people wanting normalised UTF-8.

Dan

Adam M. Costello

2003-08-14 15:23:40 UTC

Post by Dan Oscarsson
Well, why not allow UTF-8 encoding of ASCII and over long sequences?
That is no problem at all, you just normalise when you need it.
From what I have heard the reasons for me wanting normalised UCS is
the same as people wanting normalised UTF-8.

It took me a few minutes to recognize the issue you are referring to.
In case anyone else is still wondering: The basic mechanism of UTF-8
provides several encodings of each code point, where each encoding has
a different length, but only the shortest encoding is allowed, and all
others are forbidden.

The question is why forbid the longer encodings? Why not let receivers
of UTF-8 strings shorten the encodings if they need to? The answer lies
in the original motivation for UTF-8, as indicated in RFC 2279:

US-ASCII values do not appear otherwise in a UTF-8 encoded character
stream. This provides compatibility with file systems or other
software (e.g. the printf() function in C libraries) that parse
based on US-ASCII values but are transparent to other values.

UTF-8 was originally a project of the X/Open Joint
Internationalization Group XOJIG with the objective to specify
a File System Safe UCS Transformation Format [FSS-UTF] that is
compatible with UNIX systems, supporting multilingual text in a
single encoding.

There was a lot of existing software based on the "extended ASCII"
model, in which byte values 0-127 represent ASCII characters, and values
128-255 respresent locale-dependent characters that are treated as
opaque and compared exactly. The primary goal of UTF-8 was to encode
Unicode in a way that could be fed to this existing software. That
means UTF-8 was designed for use in UTF-8-unaware slots. Obviously the
existing software reading such slots would not know how to shorten the
encodings, so the UTF-8 strings needed to be in shortest form before
being put into the slots.

IDNA and IMAA take the same approach for IDN-unaware and IMA-unaware
slots. The existing software reading such slots doesn't know how to
convert between the ASCII and non-ASCII forms, so the IDNs and IMAs need
to be in ASCII form before being put into such slots.

But for IDN-aware and IMA-aware slots, there is no need to use any
particular form, because the software reading the slots knows how to do
all the conversions.

One might ask why the UTF-8 shortest-encoding rule is not relaxed for
UTF-8-aware slots. It certainly could be, but then there would be two
forms of UTF-8, strict UTF-8 (for UTF-8-unaware slots) and loose UTF-8
(for UTF-8-aware slots), which is conceptually more complex, and offers
no advantage.

For IDNs and IMAs, on the other hand, there is a real advantage to
having two forms: ASCII (for IDN/IMA-unaware slots) and non-ASCII (for
IDN/IMA-aware slots). The ASCII form is compatible with old software,
while the non-ASCII form is much friendlier to humans.

AMC

Dan Oscarsson

2003-08-15 08:21:20 UTC

Post by Adam M. Costello
There was a lot of existing software based on the "extended ASCII"
model, in which byte values 0-127 represent ASCII characters, and values
128-255 respresent locale-dependent characters that are treated as
opaque and compared exactly. The primary goal of UTF-8 was to encode
Unicode in a way that could be fed to this existing software. That
means UTF-8 was designed for use in UTF-8-unaware slots. Obviously the
existing software reading such slots would not know how to shorten the
encodings, so the UTF-8 strings needed to be in shortest form before
being put into the slots.
But for IDN-aware and IMA-aware slots, there is no need to use any
particular form, because the software reading the slots knows how to do
all the conversions.

There is no software today that knows how to read those slots as there
is no defined protocol for those.
But there is a lot of software handling UTF-8 and expecting it to
be normalised form NFC. That is the form Unix/Linux selected for
use in UTF-8.
I want to use IDN and IMA without breaking that software. I can see
no reason to extend my software with the complex and unnecessary code
to do normalisation when it is so easy to send it normalised between
systems.

Post by Adam M. Costello
One might ask why the UTF-8 shortest-encoding rule is not relaxed for
UTF-8-aware slots. It certainly could be, but then there would be two
forms of UTF-8, strict UTF-8 (for UTF-8-unaware slots) and loose UTF-8
(for UTF-8-aware slots), which is conceptually more complex, and offers
no advantage.

That is also the reason that I whant only ONE form of UCS data.
More that one form/encoding is conceptually more complex and for software
more complex.

Dan

Adam M. Costello

2003-08-15 10:07:30 UTC

Post by Adam M. Costello
But for IDN-aware and IMA-aware slots, there is no need to use any
particular form, because the software reading the slots knows how to
do all the conversions.

There is no software today that knows how to read those slots as there
is no defined protocol for those.

Very true. All IDN/IMA slots today are ASCII-only. Until new protocols
are introduced, the outcome of this debate has no consequence.

Post by Dan Oscarsson
But there is a lot of software handling UTF-8 and expecting it to be
normalised form NFC. I want to use IDN and IMA without breaking that
software. I can see no reason to extend my software with the complex
and unnecessary code to do normalisation when it is so easy to send it
normalised between systems.

That argument applies equally to all Unicode text, not just IDNs and
IMAs. For example, it applies equally well to message bodies and web
pages. The software you speak of sees IDNs and IMAs as generic text,
not as identifiers. [Why? Because what distinguishes identifiers from
generic text is that identifiers come with precise matching rules. But
matching IDNs and IMAs requires the use of Nameprep, which includes case
folding followed by normalization. You have described the software in
question as lacking the ability to do normalization, so it must not be
doing anything with the IDNs and IMAs beyond what it could do with any
generic text.]

Your paragraph quoted above argues that UTF-8 text should always use
NFC. If that's true, it would imply that IDNA and IMAA should recommend
NFC *if* they recommend UTF-8. But they don't recommend UTF-8. Why
should they? The whole point of using textual identifiers is so that
they can go wherever text can go, in whatever encoding is used for
text in that place. The choice of UTF-8 (versus some other Unicode
transformation format) is a low-level text-encoding issue independent of
IDNA/IMAA. If NFC goes hand-in-hand with UTF-8, then I would conclude
that the choice of NFC (versus NFD, or neither) is likewise a low-level
text-encoding issue independent of IDNA/IMAA.

Perhaps what you really want is a new charset, utf-8-c, which would
be just like utf-8 except that only strings in normalization form C
are valid. Everywhere that utf-8 is used today, you would like to see
utf-8-c used instead. Yes?

That might not be a bad idea.

But I think it's really a charset issue concerning Unicode text in
general, not an issue for any particular kinds of identifiers (like IDNs
or IMAs).

Even before utf-8-c is registered, a protocol could easily specify that
a slot contains "Unicode text in normalization form C encoded as UTF-8".
That could be done for any slot containing Unicode text, including slots
that happen to contain IDNs or IMAs. It's not a restriction on IDNs or
IMAs per se, it's a restriction on how Unicode text is represented in
that protocol; therefore it doesn't belong in the definitions of IDN and
IMA, it belongs in that protocol spec, or in a spec for utf-8-c. Maybe
what you really want to fight for is to change the BCP that currently
recommends utf-8 to recommend utf-8-c instead.

AMC

Paul Hoffman / IMC

2003-08-15 17:16:48 UTC

Dan, I'm going to say the same thing to you here that I said in the
IDN WG: write an Internet Draft so people can see what you are
talking about. Just repeatedly saying on a mailing list "I want
UTF-8" is not sufficient for people to compare the positive and
negative aspects of your proposal. Until you write something that can
be compared to the IMAA document, this thread (a conversation between
you and Adam, really) is useless.

--Paul Hoffman, Director
--Internet Mail Consortium

J-F C. (Jefsey) Morfin

2003-08-16 11:11:28 UTC