Leaving the legacy world

Dan Oscarsson

2003-03-02 11:12:34 UTC

Reading through the massive amount of messages on the IMAA list
I start getting the same type of hopelessness I got while being
part of the IDN list.

Is it not time to start leaving the legacy world behind us?

The amount of messages is so large that I cannot give good comments
to it. Below I will give my view of things. If will be several
different things. You may split comments into separate threads.

To avoid unclear sematics with words like internationalisation, ASCII
and non-ASCII, I will use "legacy" as meaning systems/programs
seeing the world in ASCII.

Standard form of names
-------------------------------------
In the world today names are used a lot. mail addresses, domain names,
host names, URI and URLs are names.
A name is composed of characters, any UCS character.
A legacy name is composed of ASCII characters.

The user do not see names as ACE, UTF-8, %-encoding or other encoded forms.
For them it is a sequence of characters. And by characters they mean
any character from UCS.

Unicode/ISO 10646 groups had one very bad design failure: they
allowed more than one representation of a single character.
For example there are "full width" forms which do not belong in a
character set as that is a display feature.
The NFKC looks like a part way attemt to fix that, but failes due to
also removing sematically different characters.

I have, for a long time, studied all way to have a standard form
for names (and text). I have read and looked at discussions and code
about normalised/unnormalised text, decomposed/precomposed,
form NFC/NFKC and others. I have looked at UTF-8, SCSU, ISO 8859-1,
UCS-2, UCS-4, UTF-16 and many others. I have looked at impact on
code. Some of my conclusions are:
- UTF-8 is amoung the better for interoperability due to simple
format and without endian problems.
- UTF-8 is not good for character handling. It should not be used
inside programs handling characters. UCS-1, UCS-2 or UCS-4 is
much better (due to less complex handling and less CPU usage).
- UTF-8 is not good in all protocols. It would be fine in the
e-mail protocol but not in the DNS protocol due to space constraints.
In DNS, SCSU would be fine.
- Unnormalised text is not good as it allowes multiple forms of a
character.
- Decomposed text is not good as it takes a lot of space, does not
match legacy character set handling and breaks semantics of some
characters.
- NFC preserves all data in the text but allowes multiple forms.
- NFKC does not preserve all data and changes semantics of some
characters.
- A lot of people are saying UTF-8 but do not define what they encode
using UTF-8. To interoperat you have to say encoding, character set
and form used.
- Normalised using NFC combined with all equivalent characters
replaced with one character code.
This means that Kelvin sign -> K
Ligature ij -> characters i and j
Full width a -> a

But not that superscript 2 -> 2.

Transmission of a name in a protocol should use the above form
encoded with UTF-8.
This will give as a single simple form that is easy to handle while
preserving all important data.

At protocol level you never need to know that some users have a
"full width" @ because as protocol level there is only one
form of @.

Case sensitivity/case insensitivity
---------------------------------------
I see no reason to not have case insensitivity in name matching. This includes
all parts of a e-mail address, URLs, domain names, file names.
That is what most user expect.

I have studied different forms of matching and think the following is
best on a global level:
- All single character to single character case insensitive matching
defined by Unicode is used.
- SC/TC matching.
- No single to multiple character matching (like s-sharp to ss).
I have seen so many examples of how this results in a failure
and it also makes code much more complexer.

The matching/mangling of names used in IDNA is unacceptible.

Keep to protocol level
------------------------
In the discussions both user interface and protocol level is discussed.
Could we try to leave user interface matters to later on? Or at least
separate the threads.

For example it is very important in a protocol to have ONE well
defined form of protocol elements.
A mail address is: local-***@domain
In a user interface the @-sign could be displayed using bold, wide,
narrow, green or other display feature.
In a protocol only one form of @ should be allowed.

While a MUA may recognize a green or extra wide @ sign as the @-sign,
the MTA should only recognize the standard @-sign.

E-mail supporting full UCS
-----------------------------
While we have to interoperate with legacy e-mail, I think it is high
time to take a step forward and use the full UCS.

To avoid unneeded mess, we design it so that a system handling
non-legacy mailboxes is not a legacy system.

Let us take a step forward with SMTP:
Add a ESMTP extention that switches to UCS mode.
Like this:
EHLO mail.xxx.com
250 UCS
Meaning the server supports UCS

MAIL FROM: <***@yy.com> UCS
Meaning that transmitting client switches to UCS mode.

In UCS mode of SMTP the following is used:
- All addresses in protocoll uses standard UCS form (this includes
in MAIL FROM, RCPT To,...)
- All headers are in standard UCS form.
(they may not contain MIME encoded headers).
- Default text body parts are in UCS/NFC/UTF-8.
- Other text body format encodings are not recommended and not
required to be supported.

Standard UCS form used in protocol is:
UCS/NFC with muliple definitions removed/UTF-8.

Matching of "local part" should (or moust) be done case insensitively but
case should be preserved in the protocol.

If server/client do not support UCS the e-mail will be downgraded/upgraded
depending on direction. Downgrading is done by converting headers
to MIME encoding, e-mail addresses have local part encoded using
a form preserving encoding into an opaque ASCII part and domain name
encoded using IDNA (which will destroy data in domain name).
Upgrading is doing the reverse.

Summary
------------
- Use a standard form for names:
UCS/NFC with muliple definitions removed/UTF-8.
- Case insensitive matching of names using singe character equivalence
plus SC/TC matching.
- Update SMTP protocol to use UCS with ASCII as legacy downgrading.

That is all. I have probably forgotten a lot of things.
I can write drafts for the update to SMTP and standard form for
text/names, or with someone. But will only do so if I feel
that there is a real will from many people that it is time
to go beyond the legacy ASCII world.
I have not yet written one for DNS, even though most of what needs
to be in it, due to lack of time and lack of feel that people really
want to leave legacy DNS.

Regards,

Dan