Standard mail address

Discussion:

Dan Oscarsson

2003-04-23 06:53:07 UTC

Reading through the IMAA draft I get the same mess of protocol and UI
that I got in IDNA. Also the focus is very much on legacy handling.

I would like to start with an international focus.

I can see three basic areas:
- legacy protocol context (current ASCII context)
- international protocol context
- user interface

I will not say anything more about user interface. The importat area to
start is with the international context. And protocol level.

When I look at mail addresses I start looking at what is needed
to make mail addresses used in an international context as easy as
in the legacy context. When that is done, I can start looking on how
to encode mail addresses from international context to be sent through
legacy context.

In an international context we have: Standard mail address
in the legacy context: Legacy mail address

The Standard mail address must have to be as easy to use as legacy mail address:
- local-***@domain
- UCS
- precomposed required if available for a character
- unambigous code points
- simple case-insensitive matching

The above do not mean NFKC. It more means NFC with all characters having
multiple code points in UCS replaced by one (or alternatie forms forbidden).
It also means that case-insensitively is done by simple one-to-one
character case-insensitively, probably including the SC/TC matching.
NFKC cannot be used as it destroys data.

The above means that there is only the ASCII @ code point separating
local-part from domain part. And no full width dots separating domain
labels. Only ONE code point per character. Use of full width (or green colour)
is a user interface matter and do not belong in a protocol context.

With the above requirement mail addresses in international context is as easy
to handle as in legacy context.

When you have to change between international and legacy context you have
to encode non-ASCII into ASCII or the other way around. An ACE need to
be used. It need not be Punycode with is complex. It could be SCSU with
hex encoding which also give a fairely compact encoding. For the
domain part IDNA will have to be used even if it cannot encode all
domain names.
During encoding into ASCII a Standard mail address may not be changed,
not lower cased or in some other way, so all sematics is preserved.

Two mail addresses are equal if they in Standard mail address form are
equal.

Using the above form there are no problems in having both case-sensitive
and case-insensitive mail addresses.

I hope this can get us to start at the needs of the international
user instead of the needs of the legacy protocols and applications.

Dan

NOTE: I do not use the words "internationalised mail address" or
"traditional mail address" used in the IMAA draft.
The traditional mail address could be "local-***@domain" and not restricted
to ASCII. Internationalised is often used on applications and means
"make possible to handle international characters". You could call
the IDNA form of domain names for internationalised legacy domain names.
But the mail address, domain name or URL are all in international format
from the beginning. They cannot be internationalised because they already
are.

Adam M. Costello

2003-04-23 08:25:21 UTC

Permalink

Post by Dan Oscarsson
Reading through the IMAA draft I get the same mess of protocol and UI
that I got in IDNA.

Not surprising, since IMAA is strongly patterned after IDNA.

Post by Dan Oscarsson
Also the focus is very much on legacy handling.

Yes, same as in IDNA, the goal is to allow non-ASCII mail addresses
to be used in end-user applications without needing to upgrade any
infrastructure.

Post by Dan Oscarsson
- legacy protocol context (current ASCII context)
- international protocol context
- user interface
I will not say anything more about user interface. The importat area
to start is with the international context. And protocol level.

Internationalization (support for non-ASCII characters) is for the
benefit of humans. Humans benefit from it when they can see the
non-ASCII characters in the user interfaces, regardless of what's on the
wire or what's being passed through library interfaces.

BCP-18 says:

Internationalization is for humans. This means that protocols
are not subject to internationalization; text strings are. Where
protocol elements look like text tokens, such as in many IETF
application layer protocols, protocols MUST specify which parts are
protocol and which are text.

Names are a problem, because people feel strongly about them, many
of them are mostly for local usage, and all of them tend to leak
out of the local context at times. RFC 1958 [RFC 1958] recommends
US-ASCII for all globally visible names.

Mail addresses are both protocol elements and text strings. When
they're on the wire, they're primarily protocol elements, and the
software looking at them doesn't care whether they're pretty or ugly.
When they're being displayed to users, they're playing the role of text
strings, and then it matters whether they're pretty or ugly.

The very first places that IDNs will appear is in user interfaces. It
may be a while before non-ASCII IDNs appear in any protocols, because no
existing protocols understand non-ASCII IDNs, and there's no pressing
need to create new protocols that do understand them. IDNA didn't spend
any effort defining how to use non-ASCII IDNs in protocols because
there's no pressing need for that.

The situation will be the same for IMAs.

People who want to design new protocols that use non-ASCII mail
addresses can do that, but there's no need to delay IMAA while waiting
for that, just as there was no need to delay IDNA.

Post by Dan Oscarsson
The above do not mean NFKC. It more means NFC with all characters
having multiple code points in UCS replaced by one (or alternatie
forms forbidden).

But that's what NFKC is for. It's like NFC, except that characters
having multiple code points (like A and fullwidth A) get replaced by
one.

Post by Dan Oscarsson
It also means that case-insensitively is done by simple one-to-one
character case-insensitively,

The Unicode Consortium has a standard for doing case-insensitive
comparisons. Who are we to "fix" it? The decision was made for IDNA,
and I don't see why we should do it differently for IMAA.

Post by Dan Oscarsson
probably including the SC/TC matching.

IDNA has no provisions for SC/TC matching in the domain part, so there's
no point in IMAA providing for SC/TC matching in the local part. People
are working on server-side solutions for domain names (using aliases,
for example), and the same people can work on server-side solutions for
the local part (again, using aliases, for example).

Post by Dan Oscarsson
An ACE need to be used. It need not be Punycode with is complex. It
could be SCSU with hex encoding which also give a fairely compact
encoding.

Since every mail address includes a domain name, and the domain name
will already be using Punycode, the simplest way to encode the local
part is to reuse Punycode. Adding a second encoding, even hex, would be
more complex, not less.

Post by Dan Oscarsson
During encoding into ASCII a Standard mail address may not be changed,
not lower cased or in some other way, so all sematics is preserved.
Using the above form there are no problems in having both
case-sensitive and case-insensitive mail addresses.

Yes there is. Consider the local parts josé and JOSÉ. You say they
are not case-folded before being converted to ASCII. That means
they map to two distinct ACEs, like iesg--jos-dma and iesg--JOS-pia.
Any mail server that now exists will treat those as unrelated local
parts belonging to two distinct mailboxes. For example, one person
might create iesg--jos-***@yahoo.com, and another person might create
IESG--JOS-***@YAHOO.COM. Now josé@yahoo.com and JOSÉ@YAHOO.COM are two
different people. Who wants that?

AMC

Dan Oscarsson

2003-04-23 09:40:28 UTC

Permalink

Post by Adam M. Costello
Mail addresses are both protocol elements and text strings. When
they're on the wire, they're primarily protocol elements, and the
software looking at them doesn't care whether they're pretty or ugly.
When they're being displayed to users, they're playing the role of text
strings, and then it matters whether they're pretty or ugly.

Software cares a lot if they can be handle easy. And it is us humans that
create the software. Internationalisation is very important for
software, not just humans.

Post by Adam M. Costello
The very first places that IDNs will appear is in user interfaces.

What is an IDN? A ACE version of a domain name?
A domain name does not need to be internationalised as it is just a
sequence of characters (any character), but software need to be
internationalised if it cannot handle non-ASCII.

Post by Adam M. Costello
It
may be a while before non-ASCII IDNs appear in any protocols, because no
existing protocols understand non-ASCII IDNs, and there's no pressing
need to create new protocols that do understand them. IDNA didn't spend
any effort defining how to use non-ASCII IDNs in protocols because
there's no pressing need for that.

There is a lot of pressing need of being able to handle normal domain
names (and URLS, and e-mail addresses) directely in protocols and
software. Encoding them into ASCII and sending them over legacy
protocol contexts is NOT acceptible. Only as a transition mechanism.

Post by Adam M. Costello

Post by Dan Oscarsson
The above do not mean NFKC. It more means NFC with all characters
having multiple code points in UCS replaced by one (or alternatie
forms forbidden).

But that's what NFKC is for. It's like NFC, except that characters
having multiple code points (like A and fullwidth A) get replaced by
one.

Not at all NFKC does change many code points, which are not representing
the same character, into one. Only code points that represent the
same character, must be replace by one code point.

Post by Adam M. Costello

Post by Dan Oscarsson
It also means that case-insensitively is done by simple one-to-one
character case-insensitively,

The Unicode Consortium has a standard for doing case-insensitive
comparisons. Who are we to "fix" it? The decision was made for IDNA,
and I don't see why we should do it differently for IMAA.

No need to fix it. The Unicode include in the standard character database
all upper/lower case matchings. That is the one you should use.
You should not use the additional special mappings that includes
one to many character mappings. That is defined by Unicode in a seperate file
that can be used in addition to basic character database.

Post by Adam M. Costello

Post by Dan Oscarsson
probably including the SC/TC matching.

Just because IDNA ignored this does not mean that it should not be here.
matching of domain names should include SC/TC also.
IDNA does only define a way to encode some of all domain names into ASCII.

Post by Adam M. Costello

If you want to use non-ASCII mailboxes you should fix your software to
handle non-ASCII. Only the server having the mailboxes need to be fixed.
We should not create limits because somebody does not want to fix their
software. It is not needed in the mail area. If you want to support
mailboxes with non-ASCII on your server - fix your software first.

Any comparing of equality of e-mail addresses must be done using
Standard mail address form. This removes all difficulties of
case differences or encoded forms.

We cannot make a lot of (for people) stupid restrictions on
mail addresses just because of legacy software. The needs of the
international community must come first.

Dan