draft-klensin-emailaddr-i18n-00

Discussion:

Adam M. Costello

2003-10-11 21:16:58 UTC

I have now read draft-klensin-emailaddr-i18n-00. For convenience,
here's a link:

http://www.ietf.org/internet-drafts/draft-klensin-emailaddr-i18n-00.txt

I'm glad that John took the time to write this. Whatever we end
up with, we'll have more confidence in it knowing that specific
alternatives were explored.

Here are my initial reactions.

The solution proposed in the draft presents itself as an MTA-level
extension to SMTP (RFC 2821). But it implicitly assumes a corresponding
extension to the message header format (RFC 2822). The messages carried
by ESMTP+I18N would not conform to RFC 2822, and existing MUAs could not
be expected to be able to reply to them. I18N support is negotiated at
the MTA level, so that messages bounce if the recipient MTA does not
support the extension, but there is no negotiation at the MUA level;
the message will be successfully delivered, but replying might be
impossible. Deployment of this approach would be critically dependent
on changes in both MTAs and MUAs. Deployment of the IMAA approach could
proceed without changes to MTAs, and would be less critically dependent
on changes in MUAs, because old MUAs could still reply to messages
containing ACE forms, and users could still copy ACE forms between old
MUAs.

One case not addressed in the draft is a mailing list server that
supports the I18N extension. Suppose some of the subscribers' MTAs
support it, and some don't. When the list server receives a message
that depends on the extension (in the From or Cc field, for example),
what should happen?

2.2.1 Obtaining an Internationalized Email Address
In general, users cannot create email accounts, or aliases controlling
delivery of messages from external systems.

Yahoo Mail and similar services let you create a mailbox and choose
its name. With IMAA, you could simply give it an ACE local part and
start taking advantage of IMAA without waiting for Yahoo to do anything.
Similarly, domain name registrars often provide mail hosting, and the
registrant could use an ACE local part without any IMA-awareness at the
hosting server.

2.3.1 MX diversion
If the domain part of an email address is associated with several
MX records and the mail is delivered to one of them that is not the
best preference host, the receiving host is not required to use
SMTP. If, instead, it performs some gateway function, it may need to
inspect or alter the local part to determine how to route and deliver
the message. If the local part were encoded in some fashion that
prevented that inspection process, and the MTA was not aware that it
needed to apply special techniques, mail delivery might well fail.

I don't think this problem is specific to non-primary mail exchangers,
or SMTP-to-other gateways. The general problem is that the local
part structuring conventions used in a domain (by primary exchangers,
secondary exchangers, or gateways) could be incompatible with the IMAA
encoding. For example, if the local parts in this domain use the letter
"x" as a delimiter, it won't be possible to use ACE local parts where
"x" appears as a result of the encoding, unless you upgrade the mail
exchangers to be IMA-aware.

Therefore, although IMAA is designed to allow the creation of
internationalized local parts without upgrades of MTAs, there could be
a few domains in which users will in fact have to wait for an upgrade
of their MTA before creating internationalized local parts. This
inconvenience can arise only in domains that use letters or digits or
positions as delimiters.

The solution proposed in the draft would make users in all domains
(rather than just a few domains) wait for upgrades of their MTA before
they could create internationalized local parts.

2.4 Encoding the Whole Address String
2. Imposing a requirement that MTAs "understand" local-parts so that
they can be partially decoded as part of mail routing would seem to
defeat the main goal of encoding internationalized strings into a
compact ASCII-compatible form, i.e., to keep MTAs from needing to
understand the extended naming system

IMAA does not expect sending/relaying MTAs to understand local parts
at all. IMAA encodes the whole local part at once, with no knowledge
of any local part structuring conventions; the encoding just happens
to have the convenient property that non-alphanumeric ASCII characters
are neither deleted nor inserted nor reordered. Therefore, in any
domain whose existing local part structuring conventions use only
non-alphanumeric delimiters, the encoding will not interfere, and
users can immediately create ACE local parts without any upgrade to
the MTA. Users who know the structuring conventions will be able to
parse/construct/manipulate non-ASCII local parts, while software will
parse/construct/manipulate ASCII local parts, and the two views will be
consistent.

Domains with local part structuring conventions that use letters,
digits, or positions as delimiters will have to upgrade their MTA before
internationalized local parts can be safely created in that domain, so
that both the software and the users will parse/construct/manipulate
non-ASCII local parts.

Perhaps that caveat deserves mention in the IMAA draft.

AMC

John C Klensin

2003-10-13 01:33:56 UTC

Permalink

Adam,

Thanks for the careful reading. You raise several points I
should address in the draft. I will try to do so, and get
another version out, before the posting deadline, but I have a
lot on my plate right now.

Your notes identify one additional issue you should probably
address explicitly: The notion of not encoding ASCII delimiters
in order to permit special addressing arrangements to go through
not only doesn't help with position-based and ASCII-letter-based
address splitting, it doesn't permit one to have an address that
is entirely in some non-ASCII script and still use any of these
special address treatments. I.e., a delimiter-based approach
that uses a non-ASCII delimiter would encounter the same problem
as a string that used an ASCII letter as a delimiter rather than
a conventional ASCII delimiter.

And I am certainly assuming some changes in MUAs: without an
internationalized MUA, simply having internationalized addresses
makes little sense to me in the general case (although I can
come up with edge cases in which that would not be true, as, I'm
sure, can you).

However, as I am sure you understand but others may not, the
primary tradeoff here is really at the level of a fundamental
architectural and strategic issue, not about fine-tuning either
approach. One way of stating it (you may have others, and they
may be better) is that we are looking at a tradeoff between

* A model that optimizes deployment that is as rapid as
possible, especially for individuals who wish to use the
facilities but are operating in areas of the network
that are indifferent to them and

* A model that tries to move toward the best
internationalized use of Email that we can get, even if
it sacrifices short-term deployment in some situations.

There are sacrifices and tradeoffs either way. Your model has,
in my opinion, poor consequences for address presentation and
imposes some constraints on email local-parts I don't believe we
should have to live with in the long term. Mine does require
MTA changes (as well as the MUA ones that I think are
inevitable), but the MUA changes are going to be very natural on
those systems that are Unicode-native (especially if they are
UTF-8-native). By contrast, yours, IMO, requires more fussing
in the MUAs (especially those of UTF-8 native environments),
perhaps to the extent of actually delaying deployment... we can
all speculate, but there is really no way to know. By
requiring MTA changes, my approach essentially forces an "either
you internationalize or you don't" situation. Your approach
permits, I think, several intermediate points -- do nothing,
display and/or process the specially-coded form, display and
process "normal script".

One can argue which of those is better either way, although,
clearly, if one's main concern is about somehow getting
addresses in and out of individual, e.g., Hotmail, accounts,
your clearly has the advantage. Even there, however, we may
disagree a bit about impacts: putting a non-ASCII address into
one of those accounts would probably be easy for you, or me, or
most of the readers of this list... but we don't seem to be
using Hotmail accounts. Teaching the casual user who depends on
a Hotmail account where to go on the network to get a Unicode
string encoded into the encoding needed here, then to paste that
address into a Hotmail signup script, then to teach one's
friends to use that string as necessary (since it would
presumably have little mnemonic value in any language) would, I
think, discourage most such users. Conversely, when all of the
MUAs are upgraded, and Hotmail's web page is upgraded
sufficiently to support more or less direct Unicode input, all
this trouble will probably be for nothing, but we will be stuck
with the circumventions in the email system.

While there are details on which we disagreed, and will continue
to disagree --more of them about what disclaimers should have
been made explicitly, and how the protocol should have been
defined, than about the protocol details-- I really do think
IDNA, or something _very_ like it, was the right approach for
the DNS. But it seems to me that the tradeoffs and
considerations are different for email addresses and that
"worked there" isn't sufficient justification for "the right
thing to do here".

regards,
john

--On Saturday, 11 October, 2003 21:16 +0000 "Adam M. Costello"

Post by Adam M. Costello
I have now read draft-klensin-emailaddr-i18n-00. For
http://www.ietf.org/internet-drafts/draft-klensin-emailaddr-i1
8n-00.txt
I'm glad that John took the time to write this. Whatever we
end up with, we'll have more confidence in it knowing that
specific alternatives were explored.
Here are my initial reactions.
The solution proposed in the draft presents itself as an
MTA-level extension to SMTP (RFC 2821). But it implicitly
assumes a corresponding extension to the message header format
(RFC 2822). The messages carried by ESMTP+I18N would not
conform to RFC 2822, and existing MUAs could not be expected
to be able to reply to them.
...

Adam M. Costello

2003-10-15 03:25:24 UTC

Permalink

a delimiter-based approach that uses a non-ASCII delimiter would
encounter the same problem as a string that used an ASCII letter as a
delimiter rather than a conventional ASCII delimiter.

I wouldn't call it the same problem. Consider an arbitrary non-ASCII
punctuation character, like U+00B7 (middle dot). This character cannot
possibly have any special meaning in local parts in any domain, because
it has never been allowed in local parts in any domain. After IMAA is
introduced, it will be possible for a domain to define (for the first
time ever) a special meaning for middle dot, but only if the domain's
MTA is IMA-aware. That's hardly a surprising limitation.

Now consider an ASCII punctuation character, like U+002B (plus sign).
This character has a long history of having a special meaning in some
domains. After IMAA is introduced, it will be possible to use non-ASCII
local parts in domains served by old MTAs (which might start happening
without any invitation from the administrator of the MTA). But will
ASCII plus signs have the same special meaning in non-ASCII local parts
that they already have in ASCII local parts? If IMAA isn't careful in
its encoding algorithm, the answer will be no. Within a single domain,
plus signs will continue to have their historic effect in ASCII local
parts, and will have a different (null or chaotic) effect in non-ASCII
local parts. Nothing like that can happen with middle dot. This is a
different and worse problem.

With care, that problem can be avoided. IMAA can ensure that the
encoding of text on one side of a protected ASCII character is not
influenced by the text on the other side of the protected character.
This guarantees that any protected characters that were already in use
as delimiters in ASCII local parts will have the same effect in both
ASCII and non-ASCII local parts.

I really do think IDNA, or something _very_ like it, was the right
approach for the DNS. But it seems to me that the tradeoffs and
considerations are different for email addresses

I would like to hear more details about how they are different.
Meanwhile, here are some ways in which they are similar:

Domain names are used in many protocols, so an incompatible change in
domain name syntax would entail changes to all those protocols, not
just DNS. Similarly, mail addresses are used in several protocols. If
message header syntax is extended and SMTP is extended to negotiate
support for extended mail headers, don't POP and IMAP need analogous
negotiation extensions? And news headers and NNTP? And mailto URIs
and the protocols that carry them (HTML, HTTP)? And whatever else I've
forgotten or don't know about...

If IDNs were downright inaccessible (not merely ugly) to existing
protocols, interfaces, software, etc, then people would be quite
relucant to create IDNs (which could have worse implications than just
slow deployment; if the activation energy exceeds some threshold, the
reaction will not even start). Similarly, if IMAs are inaccessible to
existing MTAs, MUAs, mailing list software, news readers, web browsers,
etc, then people will be quite reluctant to create them.

the two approaches are not necessarily incompatible.
Technically speaking, the transport infrastructure could accommodate
specially-encoded local parts as well as UTF-8 ones, just as it could
recognize and accommodate punycode domain names as well as UTF-8 ones.

Yes, the same is true of IDNs. Any protocol that currently uses ASCII
domain names could be extended to support non-ASCII domain names
directly. For example, IRIs are being defined as an extension of URIs,
and IRIs will allow a non-ASCII host name where URIs allow only an ASCII
host name.

That is not the only difference between IRIs and URIs; other fields are
also being extended to allow non-ASCII. If an extended message header
format is to be defined, I would expect it to allow non-ASCII in many
places (like display-name and unstructured), not just addresses.

It is possible to define an extended DNS protocol that supports
non-ASCII domain names directly. This can be done at any time, or
never. There was no need to hold up IDNA while the details of non-ASCII
DNS were worked out. Similarly, there is no need to hold up IMAA while
the details of non-ASCII message headers are worked out.

AMC

John C Klensin

2003-10-13 10:24:11 UTC

Permalink

Adam and others,

One thing I should have added to my previous note...

While it would add implementation work up and down the line, and
would create a "transition strategy" we would never be able to
get rid of, and hence might not be desirable, the two
approaches are not necessarily incompatible.

Technically speaking, the transport infrastructure could
accommodate specially-encoded local parts as well as UTF-8 ones,
just as it could recognize and accommodate punycode domain names
as well as UTF-8 ones. Of course, some local-parts that would
be meaningful in UTF-8 could not be recoded without loss of
information (e.g., those positional cases and cases dependent on
delimiter characters that were not ASCII "specials"), but, as a
transition/ backward compatibility strategy, that might be more
tolerable than it would be as the long-term (exclusive) plan.
One could even revisit the none-too-successful "downgrading"
options that are part of 8BITMIME.

Is it a good idea? I'm personally biased against temporary/
transition strategies which we don't have a good plan for
getting rid of... and I don't see a plausible plan for this
case. But it might be worth some consideration.

john

Dave Crocker

2003-10-13 16:27:34 UTC

Permalink

Folks,

AMC> Therefore, although IMAA is designed to allow the creation of
AMC> internationalized local parts without upgrades of MTAs, there could be
AMC> a few domains in which users will in fact have to wait for an upgrade
AMC> of their MTA before creating internationalized local parts. This
AMC> inconvenience can arise only in domains that use letters or digits or
AMC> positions as delimiters.

IMAA attempts to deal with internal structure of the local-part, by segmenting
the encoded string into multiple sub-parts. Internal structure is a local
matter, except for some very constrained IETF standards. That is why there are
so many different conventions for the internal structure of local-parts.

Internal structure is, in fact, an MUA/MTA convention, for the target system.
To the extent that IMAA feels compelled to produce an IETF standard that
purports compatibility with local conventions for local-part structure, it is
fundamental that the IMAA specification deal with the MUA/MTA interaction.

It would be quite a bit simpler if the IMAA spec did _not_ attempt to juggle
local convention issues, but instead tried to avoid dealing with them
explicitly.

This becomes quite easy, if IMAA uses the perspective of data encoding, in the
same sense as MIME content-transfer-encoding. This is, after all, all that
IMAA is attempting to do: Put fat characters into a space only designed for
thin characters. (IE, encoding characters that use more bits, into a character
space that uses fewer.)

IMAA should list a set of reserved ASCII characters. TOAscii translation
should never create a string that uses any of those characters, and ToUnicode
should simply pass those characters unchanged.

As a starting point, I'll suggest that the set of reserved characters be all
ASCII graphic characters. If folks feel compelled to be more clever than
that, then I suggest:

,./\;'":[]{}=+-()*&$#@!|?`~

I believe every one of those has gotten used in local convention structuring
or global email syntactic standards. (And from what I can tell, this turns out
to be the full set of ascii graphics...)

d/

ps. the use of an infix, rather than prefix, "signal" string, seems pretty
strange. As I understand it, the claimed reason for using it is specifically
to deal with segmentation. This requirement goes away if translation simply
uses the "reserved character" model, thereby having IMAA essentially simplied
into ignoring local convention issues.

--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>

Adam M. Costello

2003-10-13 19:57:29 UTC

Permalink

Post by Dave Crocker
IMAA should list a set of reserved ASCII characters.

It lists a set of "protected" characters.

Post by Dave Crocker
I'll suggest that the set of reserved characters be all ASCII graphic
characters.

The protected characters are defined in section 2 as all ASCII
characters except letters and digits, which is the set you propose
plus the invisible ASCII characters. IMAA has no desire to muck
with invisible characters, so they might as well be included in the
don't-touch set.

Post by Dave Crocker
ToAscii translation should never create a string that uses any of
those characters,

ToASCII never introduces new instances of those characters and never
reorders them. But it also never deletes or hides them. Why should it?
The purpose of ToASCII is to smuggle non-ASCII characters into an ASCII
local part, not to smuggle special non-alphanumeric ASCII characters
into a purely alphanumeric ASCII local part. Let ASCII characters be
themselves to the greatest extent possible. (Quoted-printable takes a
similar approach.)

Post by Dave Crocker
and ToUnicode should simply pass those characters unchanged.

It does. ToUnicode, like ToASCII, never inserts, deletes, or reorders
protected characters.

Post by Dave Crocker
the use of an infix, rather than prefix, "signal" string, seems pretty
strange.

It is necessary if Punycode is used and hyphen is protected. If hyphen
is protected, then the encoding must not allowed introduce hyphens.
But Punycode introduces a hyphen in the middle of the encoded string.
Therefore the hyphen needs to be replaced by an alphanumeric signal
string, which might as well serve as the ACE signal itself.

It looks to me like IMAA is already doing what you propose: It steers
clear of nonalphanumeric ASCII characters so that it will never
accidentally invoke local structuring conventions that use those
characters as delimiters.

Plus, it is well-behaved in one additional respect: Not only are
protected characters left alone, but the substrings between the
protected characters are encoded independently. For example, if
jos0iesg1dma is the encoding of josé, and msica0iesg17ua is the
encoding of música, then jos0iesg1dma+msica0iesg17ua is the encoding
of josé+música (there is no "crosstalk"). IMAA not only avoids
accidentally invoking local structuring conventions, it is neutral
enough to allow those conventions to be deliberately invoked in
internationalized local parts, even in domains using old MTAs.

In IMAA, a "segment" is simply a substring of a local part whose
encoding is independent of the rest of the local part. IMAA does not
know whether those boundaries have any significance to higher layers; it
avoids crosstalk across those boundaries just in case some of them do
have significance.

AMC

Dave Crocker

2003-10-14 00:17:13 UTC

Permalink

Adam,

I was suggesting a simplifying approach to IMAA. Apparently I did not make my
point clearly enough. Or, more likely of course, I am misunderstanding
something pretty basic.

Let's see how this round goes:

AMC> The protected characters are defined in section 2 as all ASCII
AMC> characters except letters and digits, which is the set you propose

Defined, yes. The problem is why they have syntactic import to a global
standard.

Sections 4.1 and 4.2 define mapping algorithms, between pure Unicode and
IMAA-encoded Unicode. Doing segmentation at protected boundaries makes the
imaa mechanism significantly more complicated.

(Normally, this sort of character translation is defined by a grammar, rather
than an algorithm. Using a grammar, reference to special, lexical items is
straightforward. The same point applies to the syntactic definition of
boundaries, such as the *fix operator.)

As long as we are discussing the algorithms:

4.1/#1 is not needed. A Unicode string may contain some Ascii characters
normally, yes? Hence, a string of all-Ascii is just a special case of a
Unicode string of partial Ascii. Yet the special case does not require special
handling.

4.1/#3 does not say what to do if the result is empty. For that matter, it
does not say what to do if it is _not_ empty.

4.1/#4 is where IMAA gets into the problem of attempting to process local
conventions. Note that simply removing steps #4, #5, #6, and #7 makes the
processing of the entire string work just fine. Assuming that "protected"
characters are never translated, then their possible role as delimiters is
preserved, without IMAA having to be cognizant of that role.

4.1/#6 does not specify what to do if the verification works or fails. Also,
why is it important that a segment is altered?

Post by Dave Crocker
ToAscii translation should never create a string that uses any of
those characters,

AMC> ToASCII never introduces new instances of those characters and never
AMC> reorders them. But it also never deletes or hides them.

Good.

Post by Dave Crocker
the use of an infix, rather than prefix, "signal" string, seems pretty
strange.

AMC> It is necessary if Punycode is used and hyphen is protected. If hyphen
AMC> is protected, then the encoding must not allowed introduce hyphens.
AMC> But Punycode introduces a hyphen in the middle of the encoded string.

Having re-scanned the punycode specification, I find myself not understanding
why hyphen gets special concern, for local-part. If you want it to be just
one more "protected" character, that's fine. However nothing about protected
characters requires infix.

It seems to me that this should all be done as a simple extension to RFC2822:

addr-spec = local-part "@" domain

local-part = ascii-local / ace-local

ascii-local = dot-atom / quoted-string / obs-local-part

ace-local = ima-prefix ace-encoded-unicode

ima-prefix = "0iesg1"

ace-encoded-unicode = {here's where your toAscii, etc. algorithm goes}

AMC> Therefore the hyphen needs to be replaced by an alphanumeric signal
AMC> string, which might as well serve as the ACE signal itself.

It seems like there is something wrong with a translation algorithm, when it
needs to protect the translated string from the translation mechanism, itself.

AMC> Plus, it is well-behaved in one additional respect: Not only are
AMC> protected characters left alone, but the substrings between the
AMC> protected characters are encoded independently.

This incurs extra storage and processing overhead and complexity. And it
appears to be only for the purpose of trying to support local conventions --
ie, for doing partial global support for a local convention.

The benefit of all this is.... what?

AMC> For example, if
AMC> jos0iesg1dma is the encoding of josé, and msica0iesg17ua is the
AMC> encoding of música, then jos0iesg1dma+msica0iesg17ua is the encoding
AMC> of josé+música (there is no "crosstalk").

Oh. You are attempting to provide a quoted-printable kind of selective
translation, rather than a total, base64 type of translation?

So Unicode, itself, does not have a means of switching from one set of
characters to another? So it is required that an IETF standard for email
local-part contain this explicitly?

Or perhaps it should simply be possible to have the whole string be in
Unicode, without the complexity and overhead of switching back and forth...

AMC> IMAA not only avoids
AMC> accidentally invoking local structuring conventions, it is neutral
AMC> enough to allow those conventions to be deliberately invoked in
AMC> internationalized local parts, even in domains using old MTAs.

Only if the conventions entail characters from the protected set. But in that
case, the structuring characters are passed transparently.

As to "transparently using old MTA's",

1) the whole string is now in ascii, so old mtas doing simple relaying don't
care about the structuring, and

2) the target MTA already must know about both the local conventions and any
special characteristics of the string characters, in order to register and
process the string.

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>

Adam M. Costello

2003-10-14 03:50:10 UTC

Permalink

Doing segmentation at protected boundaries makes the imaa mechanism
significantly more complicated.

True, but remember that dividing a local part into segments for separate
encoding/decoding is not significantly more complex than dividing the
domain part into labels for separate encoding/decoding.

4.1/#1 is not needed. A Unicode string may contain some Ascii
characters normally, yes? Hence, a string of all-Ascii is just a
special case of a Unicode string of partial Ascii. Yet the special
case does not require special handling.

I assume you mean "4.2" (ToUnicode) rather than "4.1" (ToASCII) here and
below.

Your comment applies equally to IDNA and IMAA. ToUnicode step 1 is
not strictly necessary, but it avoids gratuitously lowercasing ASCII
letters. For example, the internationalized label "xn--Jos-dma" is
converted to "José" by ToUnicode. Without step 1, it would be converted
to "josé".

Another reason for having this bypass in ToUnicode is to mimic the same
bypass from ToASCII, where the the bypass is needed to make sure that
pure ASCII strings are never altered in any way.

4.1/#3 does not say what to do if the result is empty. For that
matter, it does not say what to do if it is _not_ empty.
4.1/#6 does not specify what to do if the verification works or fails.

All the verification steps are governed by this paragraph:

ToUnicode never fails. If any step fails, then the original input
sequence is returned immediately in that step.

Also, why is it important that a segment is altered?

That check is needed in order to comply with the stated function of
ToUnicode:

If the input sequence is a dequoted local part in ACE form, then
the result is an equivalent dequoted internationalized local part
that is not in ACE form, otherwise the original sequence is returned
unaltered.

If no segment was altered in step 5, then the original input was not
an ACE, and therefore it is to be returned unaltered. Without this
check, the original input would be returned altered (specifically,
Nameprepped).

Having re-scanned the punycode specification, I find myself not
understanding why hyphen gets special concern, for local-part. If
you want it to be just one more "protected" character, that's fine.
However nothing about protected characters requires infix.

Yes, IMAA wants hyphen to be just another protected character. But the
Punycode encoder introduces a hyphen where there was none. For example,
if you feed "niño" to the Punycode encoder, it outputs "nio-8ma".

Punycode was designed for domain names, where introducing hyphens was
not a problem. It uses all 37 LDH characters to maximize efficiency.

We could define a new encoding very similar to Punycode that uses "9"
instead of "-" (at a slight cost in efficiency), but I thought it would
be simpler to put a wrapper around Punycode that removes/restores the
hyphen.

ace-local = ima-prefix ace-encoded-unicode
ima-prefix = "0iesg1"
ace-encoded-unicode = {here's where your toAscii, etc. algorithm goes}

This is an attempt to pull one step (the addition/removal of the ACE
prefix) out of the middle of a multi-step algorithm (which includes
Nameprep, Punycode, checking the absence of the ACE prefix, checking the
length) and present it using a different kind of spec (a grammar). I
think it's simpler to present the whole thing together in one kind of
spec.

Also, the grammar above doesn't recognize fullwidth characters in the
ACE prefix, but ToUnicode does (because it performs Nameprep before
looking for and removing the prefix).

Not only are protected characters left alone, but the substrings
between the protected characters are encoded independently.

This incurs extra storage and processing overhead and complexity.
And it appears to be only for the purpose of trying to support local
conventions -- ie, for doing partial global support for a local
convention.
The benefit of all this is.... what?

To avoid astonishing users.

Suppose a user obtains an ACE mailbox at example.net that displays as
josé, and his friends with IMA-aware browsers start sending mail to
josé@example.net, which works fine, even though the powers-that-be at
example.net have not upgraded their MTA. Now suppose that example.net
accepts mail for user+tag and delivers it to user. People will
naturally expect to be able to send mail to josé+***@example.net.
That will work just fine if there is no crosstalk across protected
characters, but it will fail if there is crosstalk.

Suppose the manager of the aliases file at example.net creates an ACE
alias that displays as niño and expands to multiple addresses. The
IMA-unaware MTA, whenever it expands the ACE alias, automatically looks
for a companion alias owner-ACE to use as the envelope From address.
That address will display as owner-niño if there is no crosstalk across
protected characters, but it will display as ASCII garbage if there is
crosstalk.

One could question whether these kinds of benefits are worth the
complexity. In fact, we raised exactly that question in imaa-00
(Feb-05), which did not do segmentation but proposed it as an open
issue. A discussion ensued in which the pros and cons were explored,
and there was more support than opposition for the idea, so it was added
in imaa-01 (Apr-18).

AMC

Dave Crocker

2003-10-28 06:53:30 UTC

Permalink

Adam, et al,

This evening's clarification -- that imaa wants to modify basic Internet mail
parsing rules -- goes a long way towards making clear a basic flaw with IMAA.
It is entirely in line with the concerns I've expressed over the current
specifications:

This specification needs to narrow its scope, not expand it. It
also needs to specify things much more precisely and clearly.

For starters:

1. Do not mess with global parsing rules.

2. Do not mess with local parsing rules. It is fine to try to avoid
well-known lexical separators, used in various local venues, but keep the heck
away from doing. anything clever.

3. Encode a larger character set for _user_ data, into the existing,
permissible set for local-part

4. Do nothing else.

Doing segmentation at protected boundaries makes the imaa mechanism
significantly more complicated.

AMC> True, but remember that dividing a local part into segments for separate
AMC> encoding/decoding is not significantly more complex than dividing the
AMC> domain part into labels for separate encoding/decoding.

For domains, the syntactic rules are global, rigid, and well-specified. For
local-part, they are varied, unspecified. And, of course, they are local.

Internet mail has gotten quite a bit of benefit from avoiding global knowledge
about local-part internals. Please do not mess with that strategic benefit.

AMC> Your comment applies equally to IDNA and IMAA. ToUnicode step 1 is
AMC> not strictly necessary, but it avoids gratuitously lowercasing ASCII
AMC> letters.

IMAA had better not lowercase ASCII, whether gratuitously or not. Local-part
is defined as being case sensitive. IMAA needs to work within that reality.

Maybe your point is that IMAA, in fact, does not do case-mapping. That's fine,
though it is yet-another distinguishing point that I could not tell from the
current specification.

AMC> Another reason for having this bypass in ToUnicode is to mimic the same
AMC> bypass from ToASCII, where the the bypass is needed to make sure that
AMC> pure ASCII strings are never altered in any way.

Parsing/encoding algorithms that need these sorts of special-case, look-ahead
processing invite mis-implementation. They certainly suggest excessive
complexity for a task that is already plenty complicated.

AMC> All the verification steps are governed by this paragraph:

AMC> ToUnicode never fails. If any step fails, then the original input
AMC> sequence is returned immediately in that step.

Then write that into the algorithm.

Do not force implementors to juggle meta-rules,
when reading algorithms.

AMC> Yes, IMAA wants hyphen to be just another protected character. But the
AMC> Punycode encoder introduces a hyphen where there was none. For example,
AMC> if you feed "niño" to the Punycode encoder, it outputs "nio-8ma".
AMC> Punycode was designed for domain names, where introducing hyphens was
AMC> not a problem. It uses all 37 LDH characters to maximize efficiency.
AMC> We could define a new encoding very similar to Punycode that uses "9"
AMC> instead of "-" (at a slight cost in efficiency), but I thought it would
AMC> be simpler to put a wrapper around Punycode that removes/restores the
AMC> hyphen.

So, Punycode is not a general-purpose module, but you can hack around it to
adapt it to the more complex requirements of mail local-part, by making things
even more complex...

ace-local = ima-prefix ace-encoded-unicode
ima-prefix = "0iesg1"
ace-encoded-unicode = {here's where your toAscii, etc. algorithm goes}

AMC> This is an attempt to pull one step (the addition/removal of the ACE
AMC> prefix) out of the middle of a multi-step algorithm (which includes
AMC> Nameprep, Punycode, checking the absence of the ACE prefix, checking the
AMC> length) and present it using a different kind of spec (a grammar). I
AMC> think it's simpler to present the whole thing together in one kind of
AMC> spec.

Then please do that.

As of now, you have enough nesting and indirect reference to make the
specification be a long way from transparent.

It's possible that the real problem is that I simply don't know how to read a
spec, but I put enough effort into reading the imaa draft to suspect that that
is not the problem.

AMC> Also, the grammar above doesn't recognize fullwidth characters in the
AMC> ACE prefix, but ToUnicode does (because it performs Nameprep before
AMC> looking for and removing the prefix).

Oh, good. There is more than one way to do the prefix, too?

Not only are protected characters left alone, but the substrings
between the protected characters are encoded independently.

AMC> To avoid astonishing users.

AMC> Suppose a user obtains an ACE mailbox at example.net that displays as
AMC> josé, and his friends with IMA-aware browsers start sending mail to
AMC> josé@example.net, which works fine, even though the powers-that-be at
AMC> example.net have not upgraded their MTA. Now suppose that example.net
AMC> accepts mail for user+tag and delivers it to user.

Yes, trying to solve this problem certainly is an enticing trap to fall into.

AMC> People will
AMC> naturally expect to be able to send mail to josé+***@example.net.

First of all, there is no public standard for segmented local-parts. (Of
course, I'm not telling the whole truth, but the exceptions are for special
purposes.)

The extent to which random users can expect to generate a segmented local-part
for a particular recipient is entirely outside the current scope for existing
Internet mail. (I tried to get interest in a global standard for local-part
segmentation, some years ago, but folks didn't take the bait.)

So the string that someone should expect to have work is whatever the intended
recipient originally sent.

To the extent that there is consensus to have the local-part be a mixture of
ace-encoding and classic ascii, then define the ace-encoded strings with
left/right framing.

Something like:

local-part = 1*(ascii-local / ace-local)

ascii-local = dot-atom / quoted-string / obs-local-part

ace-local = ima-prefix ace-encoded-unicode ima-suffix

will do the trick.

AMC> That will work just fine if there is no crosstalk across protected
AMC> characters, but it will fail if there is crosstalk.

"Crosstalk"?

AMC> Suppose the manager of the aliases file at example.net creates an ACE
AMC> alias that displays as niño and expands to multiple addresses. The
AMC> IMA-unaware MTA, whenever it expands the ACE alias, automatically looks
AMC> for a companion alias owner-ACE to use as the envelope From address.
AMC> That address will display as owner-niño if there is no crosstalk across
AMC> protected characters, but it will display as ASCII garbage if there is
AMC> crosstalk.
AMC> One could question whether these kinds of benefits are worth the
AMC> complexity.
They aren't.

1. Alias expansion is a function of list processing, not classic MTA
processing. Yes it is useful and popular, but let's be clear about where it
fits in the architecture. Let's not confuse architecture with implementation.

2. Folks probably have not noticed just how much effort you are having to put
into working around all sorts of features in the real world, in order to
delivery all sorts of generalities. Although the intentions behind these
contortions are laudable, the meta-issue is that this much bobbing and weaving
in a specification is usually a good sign of implementation and adoption
difficulties later.

AMC> In fact, we raised exactly that question in imaa-00
AMC> (Feb-05), which did not do segmentation but proposed it as an open
AMC> issue. A discussion ensued in which the pros and cons were explored,
AMC> and there was more support than opposition for the idea, so it was added
AMC> in imaa-01 (Apr-18).

Isn't it nice that we get to review all that, in preparation for bringing this
work into the IETF?

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>

Adam M. Costello

2003-10-28 22:32:50 UTC

Permalink

Post by Dave Crocker
This evening's clarification -- that imaa wants to modify basic
Internet mail parsing rules

IMAA does not modify the parsing rules in any existing context. In
message headers and SMTP commands, IMAA does not alter the syntax of
mail addresses; they continue to be ASCII-only, and the old parsing
rules still work.

But there need to be some contexts where a new syntax is used, a syntax
that allows non-ASCII characters. If there are no such contexts, then
we haven't internationalized anything.

Where does this new syntax get used? In new user interfaces of new
internationalized mail applications. And maybe in new protocols (IMAA
neither encourages nor discourages the introduction of new protocols
that use non-ASCII mail addresses).

When a user types a non-ASCII mail address into a user agent, the agent
needs to parse the address into a non-ASCII local part and a non-ASCII
domain name, and then parse the latter into non-ASCII labels, before it
can perform the encoding. Obviously, the old parsing rules won't work,
because those rules accept only ASCII characters. Therefore we must
have new parsing rules for use in this new context.

When the new parsing rules are applied to ASCII-only mail addresses,
they degenerate into the old parsing rules. Therefore a new application
doesn't actually need to switch between two sets of parsing rules. The
new rules are a backward-compatible extension of the old rules.

Dave> Doing segmentation at protected boundaries makes the imaa mechanism
Dave> significantly more complicated.

AMC> True, but remember that dividing a local part into segments
AMC> for separate encoding/decoding is not significantly more
AMC> complex than dividing the domain part into labels for separate
AMC> encoding/decoding.

Dave> For domains, the syntactic rules are global, rigid, and
Dave> well-specified. For local-part, they are varied, unspecified.
Dave> And, of course, they are local.

That's true, but you're changing the subject. You said segmentation was
complex, and I responded to that criticism by pointing out that we're
already doing an operation of nearly identical complexity on the other
side of the at-sign.

Post by Dave Crocker
Internet mail has gotten quite a bit of benefit from avoiding global
knowledge about local-part internals. Please do not mess with that
strategic benefit.

IMAA does not expect anyone to know how local parts are structured. It
merely uses an encoding that avoids throwing a wrench into the works
whenever possible.

Post by Dave Crocker

ToUnicode step 1 is not strictly necessary, but it avoids
gratuitously lowercasing ASCII letters.

IMAA had better not lowercase ASCII, whether gratuitously or not.
Local-part is defined as being case sensitive. IMAA needs to work
within that reality.

More precisely, ASCII local parts are defined to have the following
tricky properties regarding case sensitivity:

Local parts MAY be case-sensitive, and therefore MUST be treated as
case-sensitive by anyone who doesn't know for sure; however, the
authoritative servers who finally decide the issue are discouraged
from being case-sensitive. [RFC 2821]

For non-ASCII local parts, we found that this model was just too
tricky to pull off without undue complexity in the spec. (You think
segmentation is complex? You should have seen this...) When defining a
new class of local parts (non-ASCII local parts), we had the opportunity
to use a simpler case-sensitivity model, and we did. We could have
either defined non-ASCII local parts to be always case-sensitive,
or always case-insensitive. We chose the latter because it is more
consistent with actual practice and with user expectations. IMAA makes
no change to the case-sensitivity model for ASCII local parts.

Post by Dave Crocker

Another reason for having this bypass in ToUnicode is to mimic the
same bypass from ToASCII, where the the bypass is needed to make
sure that pure ASCII strings are never altered in any way.

Parsing/encoding algorithms that need these sorts of special-case,
look-ahead processing invite mis-implementation. They certainly
suggest excessive complexity for a task that is already plenty
complicated.

The fact that ASCII local parts and non-ASCII local parts use two
different case-sensitivity models (a simple one for non-ASCII local
parts and a trickier one for ASCII local parts) causes ToASCII to need
to check whether its input is pure ASCII, and avoid Nameprep (which
includes case-folding) if it is.

The IDNA ToASCII also contains this bypass, even though domain names are
always case-insensitive and doing case-folding on ASCII labels would not
have altered the domain. Still, why squash the case if you don't have
to?

By the way, another approach, rather than having the bypass in ToASCII,
would have been to define Nameprep to fold non-ASCII letters but leave
ASCII letters unfolded. I suggested this, but was overruled. As I
recall, the main argument against this idea was that deviating from the
Unicode case-folding algorithm in any way would open up a can of worms
for the numerous proposed tweaks and fixes of Unicode case-folding and
normalization, and we'd never reach consensus.

Post by Dave Crocker

Punycode was designed for domain names, where introducing hyphens
was not a problem. It uses all 37 LDH characters to maximize
efficiency. We could define a new encoding very similar to Punycode
that uses "9" instead of "-" (at a slight cost in efficiency), but
I thought it would be simpler to put a wrapper around Punycode that
removes/restores the hyphen.

So, Punycode is not a general-purpose module, but you can hack around
it to adapt it to the more complex requirements of mail local-part, by
making things even more complex...

Right. But I still think it was the simplest of the options, which
were:

* Put a wrapper around Punycode.
* Create a slight variant of Punycode.
* Introduce an entirely new encoding algorithm.

Given that a Punycode implementation is already needed for the domain
part of the address, I think the wrapper idea is the simplest.

Post by Dave Crocker

Also, the grammar above doesn't recognize fullwidth characters in
the ACE prefix, but ToUnicode does (because it performs Nameprep
before looking for and removing the prefix).

Oh, good. There is more than one way to do the prefix, too?

Yes. One of the guiding principles in the design of IDNA (and therefore
IMAA) is that if two strings are equivalent Unicode strings, they had
better be treated the same. Unicode defines two kinds of equivalence:
canonical equivalence, and compatible equivalence. It was decided that
compatible equivalence is what we wanted.

"xn--jos-dma" and fullwidth "xn--jos-dma" are equivalent Unicode
strings. Therefore, if the former gets displayed as "josé", the latter
had better get displayed as "josé" too.

We hope users won't need to type ACE forms, but occasionally they will.
Curiously, while CJK users tend to be quite careful about upper case
versus lower case (much more so than most English speakers), they tend
to be quite careless about fullwidth versus regular width. (At least,
that's my observation from Japanese web pages.) I think they would be
mystified if the fullwidth version of the ACE form didn't work.

Post by Dave Crocker

People will naturally expect to be able to send mail to

The extent to which random users can expect to generate a segmented
local-part for a particular recipient is entirely outside the current
scope for existing Internet mail.

If example.net has an existing policy of accepting mail for user+tag and
delivering it to user, then José himself might expect to be able to tell
his friends to send mail to josé+tag, or at least might be disappointed
to realize that he can't.

Also, if a third party already corresponds with several users at
example.net, and is therefore familiar with the user+tag convention
at example.net, they will be astonished that it doesn't work for

Post by Dave Crocker
"Crosstalk"?

Leakage between two channels. For example, if you're having a
conversation on an analog cell phone, and you hear some other cell phone
conversation, that's crosstalk. In the context of encoding strings, if
FOO1 gets encoded as bar1, and FOO2 gets encoded as bar2, and FOO1-FOO2
gets encoded as bar1-bar2, then there has been no crosstalk across
the boundary marked by the hyphen. But if FOO1-FOO2 gets encoded as
something other than bar1-bar2, that means information must somehow have
leaked across the boundary during the encoding process.

Post by Dave Crocker
To the extent that there is consensus to have the local-part be a
mixture of ace-encoding and classic ascii, then define the ace-encoded
strings with left/right framing.
local-part = 1*(ascii-local / ace-local)
ascii-local = dot-atom / quoted-string / obs-local-part
ace-local = ima-prefix ace-encoded-unicode ima-suffix
will do the trick.

I considered that. It was a precursor to the current IMAA encoding.

Let's explore this idea further. First, we need to fix it up a bit.
The above grammar allowes multiple quoted strings, which is no longer
allowed except in obsolete syntax. Let's ignore obsolete syntax for
this discussion.

Since we can't have multiple quoted-strings, the ACE parts are going
to need to go inside the single quoted-string. It's too hard to write
a grammar for that unless we view the quoting as a separate layer. In
other words, we defined a grammar for a dequoted-ascii-local-part, and
then quote that as necessary.

We don't want the grammar to be ambiguous, since the purpose is to
indicate how to parse a string, not merely to decide whether a string
is valid, right? The above grammar is ambiguous unless dot-atom is
restricted to not contain ima-prefix and ima-suffix.

Here's a grammar that is intended to be the same in spirit as the one
you proposed, while dealing with the above concerns:

dequoted-ascii-local-part = empty-string / 1*segment
segment = literal-segment / encoded-segment
literal-segment = 1*ascii-char ; must not contain prefix or suffix
encoded-segment = prefix nonascii-encoded-as-alphanumeric-ascii suffix

("encoded" refers to more than just Punycode. It's Nameprep and
Punycode.)

What is the nice feature of this syntax? The encoded form is likely
to have the same "structure" as the original form, with a one-to-one
correspondence between the encoded "components" and the original
"components", with each "component" encoded independently of the others
(no crosstalk). Even though we don't know any details of what the
structure or the components might be, we get this nice feature because
the encoding does not muck around with ASCII characters. It doesn't
delete them, reorder them, or introduce them. Therefore, as long as the
structure is based on ASCII delimiters, the encoding hasn't interfered.

Oh wait--the encoding *does* introduce some ASCII characters, namely
alphanumeric ASCII characters. Therefore we do interefere if the
unknown structure uses alphanumeric ASCII characters as delimiters. But
we don't interfere if only non-alphanumeric ASCII characters are used as
delimiters.

Well, as long as we're mucking with alphanumeric ASCII characters by
introducing them, we might as well reorder them too, if it improves the
encoding. Indeed, we can make the encoding more compact by tweaking the
grammar to lump alphanumeric ASCII characters together with non-ASCII
characters:

dequoted-ascii-local-part = empty-string / 1*segment
segment = protected-segment / unprotected-segment
protected-segment = 1*nonalphanumeric-ascii-char
unprotected-segment = 1*alphanumeric-ascii-char

An unprotected-segment might be literal or might be encoded; you just
try to decode it and see what happens. As before, "encoding" and
"decoding" refer to more than just Punycode; this time they include not
only Nameprep and Punycode but also to a special substring. The first
step of decoding is looking for a special substring; if it's not there
then the decoding fails and the segment is literal. Even if the special
substring is present, the decoding might still fail, indicating that the
segment is literal (and misleading, and therefore discouraged). If the
decoding succeeds, then the segment was encoded.

In the previous grammar, the special substring (prefix/suffix combo)
served two purposes: (1) it marked the boundaries of the segments, and
(2) it made it unlikely that a string intended to be literal would
accidentally be interpreted as an encoded string.

In the revised grammar, the special substring serves only the second
purpose. We don't need it to mark the boundaries of the segments,
because the segment boundaries are already marked by the adjacency of an
alphanumeric character and a nonalphanumeric character. Therefore we
are free to use an infix rather than a prefix or suffix, which turns out
to be convenient.

The revised grammar yields a more compact encoding than the previous
grammar, for Latin-based local parts. For example, "résumé" would have
needed two prefixes, two suffixes, and 6 characters of Punycode, whereas
now it needs only one infix and 4 characters of Punycode.

Post by Dave Crocker

Suppose the manager of the aliases file at example.net creates
an ACE alias that displays as niño and expands to multiple
addresses. The IMA-unaware MTA, whenever it expands the ACE alias,
automatically looks for a companion alias owner-ACE to use as the
envelope From address. That address will display as owner-niño
if there is no crosstalk across protected characters, but it will
display as ASCII garbage if there is crosstalk.

1. Alias expansion is a function of list processing, not classic MTA
processing.

Sendmail is a classic MTA, and I'm pretty sure it has been performing
alias expansion as far back as I have been using email (14 years). All
MTAs I've ever heard of perform alias expansion.

Until recently, majordomo was one of the most popular mailing list
processors, and it is still in fairly widespread use. It has never
performed alias expansion; it has always relied on the MTA to do it.

Post by Dave Crocker
Yes it is useful and popular, but let's be clear about where it
fits in the architecture. Let's not confuse architecture with
implementation.

Even if we make the distinction between MTA functions and list
processing functions, they're both infrastructure functions, as opposed
to user-agent functions. IMAA's goal is to work well even without
changes to infrastructure.

AMC