Comments on draft-klensin-emailaddr-i18n-00

Discussion:

Dan Oscarsson

2003-10-26 13:45:22 UTC

I have read John's draft-klensin-emailaddr-i18n-00 and I am glad
to get one draft where we start by looking at extending the
current Internet protocols from ASCII to UCS, instead of just
encoding all non-ASCII into ASCII without fixing the protocol.

As Adam have pointed out there are many things that can be discussed
and problems that can be seen, by doing it this way instead of the
IMAA way. But the IMAA and IDNA way does also introduce many problems.

I think this way is the right one. Yes, both MUAs and MTAs need to
be upgraded. But without that, the use of non-ASCII e-mail addresses
will be so ugly that nobody will use them. Just like MIME was not
very popular before enough clients could handle it.
Nor will IDNA be very good before DNS clients and servers are upgraded.
And it would have been no problem to require upgraded DNS servers
to support non-ASCII in domain names, to get them to be used.
Those who want them would have forced the upgrade.

The draft have many open questions - I recommend that we go for full
internationalisation of all headers so all headers are in UTF-8,
and MIME-encoded headers are forbidden. This would greatly simplify
handling in MTAs/MUAs and gives a clear signal to everybody that
the ASCII only world is going away.
Local part, domain names, subject etc. must all be in UTF-8.
Allowing some parts in punycode or some other encoded form
will just add a lot of unnecessary complexity.

To make handling of UTF-8 text, the standard should require
Unicode normalisation form C (NFC). This does not destroy
any information while making this much simpler for the
implementor.

For e-mail addresses we probably should recommend that only
the smallest code point of the same character to be used.
Otherwise some may use, for example, the wide form of some
latin characters which could not be expected to match
most addresses with latin characters as they will be using
the standard width code point.

When this draft is ready we can look at a way to downgrade
to legacy ASCII e-mail. But as we here require all MTAs/MUAs
handling international mailboxes to be upgraded, we can
use a simpler downgrading to ASCII than IMAA as we only need
the ASCII would so see international local part as an
opaque ASCII string.

Dan

Simon Josefsson

2003-10-26 15:35:07 UTC

Permalink

Dan Oscarsson <***@kiconsulting.se> writes:

> The draft have many open questions - I recommend that we go for full
> internationalisation of all headers so all headers are in UTF-8,
> and MIME-encoded headers are forbidden. This would greatly simplify
> handling in MTAs/MUAs and gives a clear signal to everybody that
> the ASCII only world is going away.
> Local part, domain names, subject etc. must all be in UTF-8.
> Allowing some parts in punycode or some other encoded form
> will just add a lot of unnecessary complexity.

I would like to give my support to this, to Klensin's draft, and echo
some other things you say.

Truly internationalized mail software are already required to
implement Unicode via IDNA. Adding UTF-8 to the list of requirements
for internationalized software would only recognize what applications
already do. (Out of curiosity, are there any widely used and open
standards that encode Unicode into anything but UTF-8 during network
transport or disk storage, except IDNA?)

We need IDNA/IMAA as a fall back mechanism for deployed systems, but
let's not make it more than that. Going forward with the "encode as
ASCII" approach without, in parallel, offering a better approach for
new protocols, and new implementations of old protocols, is bad.

Another argument, that I only recently come to fully understand, is
that adopting the ASCII-encode approach risk decreasing the quality of
protocols that do not have the same restrictions for which IDNA/IMAA
was developed. Instead of using Unicode and UTF-8 directly for
internationalized strings, it could be tempting to propose that
protocols should use ASCII-encoded Unicode a'la IDNA/IMAA because it
is used elsewhere. The argument is that IDNA/IMAA will "leak" into
the protocol, so you better handle it somehow. Two current examples
would be the UseNet news headers and Kerberos.

Thanks,
Simon

Arnt Gulbrandsen

2003-10-26 16:44:38 UTC

Permalink

Simon Josefsson writes:
> (Out of curiosity, are there any widely used and open standards that
> encode Unicode into anything but UTF-8 during network transport or
> disk storage, except IDNA?)

Yes. See RFC 3501 section 5.1.3.

Since UTF-7 is smaller than UTF-8+QP for some/many strings, I assume
there are MUAs that use UTF-7 for body text.

--Arnt

Simon Josefsson

2003-10-26 17:57:51 UTC

Permalink

Arnt Gulbrandsen <***@gulbrandsen.priv.no> writes:

> Simon Josefsson writes:
>> (Out of curiosity, are there any widely used and open standards that
>> encode Unicode into anything but UTF-8 during network transport or
>> disk storage, except IDNA?)
>
> Yes. See RFC 3501 section 5.1.3.

Ah. How could I forget.

> Since UTF-7 is smaller than UTF-8+QP for some/many strings, I assume
> there are MUAs that use UTF-7 for body text.

Right. Good examples.

This doesn't counter Dan's or my arguments, though, since UTF-8 is
still commonly implemented. These examples just prove other
alternatives exist, which I really shouldn't have doubted.

Thanks,
Simon

Adam M. Costello

2003-10-26 20:11:44 UTC

Permalink

This message responds to messages from both Dan Oscarsson and Simon
Josefsson.

Dan Oscarsson <***@kiconsulting.se> wrote:

> IDNA do not support all international domain names due to being made
> to work using unaware DNS servers and clients.

What do you mean by "all international domain names"? There was no
such thing as an "internationalized domain name" until the IETF defined
that term. The definition appears in the IDNA spec. Therefore, by
definition, IDNA supports "all internationalized domain names". You
must have some other definition in mind. What is it, and why is it a
problem that IDNA does not support that definition?

> As the IMAA draft stands today it will not handle all e-mail
> addresses.

Same question. There is no such thing as an internationalized mail
address until we define it.

> Yes, both MUAs and MTAs need to be upgraded. But without that, the use
> of non-ASCII e-mail addresses will be so ugly that nobody will use
> them. Just like MIME was not very popular before enough clients could
> handle it.

And UTF-8 headers will not be very popular before enough clients *and*
servers can handle it. If speed of deployment is an issue, it looks
pretty clear to me that the "in applications" approach has the edge.

> I recommend that we go for full internationalisation of all headers so
> all headers are in UTF-8, and MIME-encoded headers are forbidden.

I have some ideas for how to do that, but I'm not sure that this is
the appropriate mailing list for that discussion (this mailing list is
about internationalized mail addresses, not headers in general). If
someone starts a mailing list for discussion of UTF-8 headers, please
announce it here. I still think there is no need for IMAA to wait for
any results from that discussion, and I think IMAA would still be needed
as a fallback even if a UTF-8 header solution existed.

> To make handling of UTF-8 text, the standard should require Unicode
> normalisation form C (NFC).

In any case, I think it would be good if receivers do not assume that
text is already normalized; they should perform normalization whenever
they want text to be normalized. Then it will not be necessary for
senders to perform normalization. The implementation cost is the
same whether the code is inside the senders or inside the receivers.
Forbidding unnormalized text on the wire doesn't do much except to make
troubleshooting more difficult for humans, who cannot see the difference
between normalized and unnormalized text. It's not as if unnormalized
text is ambiguous. It makes sense to forbid ambiguous constructions
like "***@foo@bar", but unnormalized text is not ambiguous--you
can determine exactly what was intended by normalizing it yourself.
Anal-retentive receivers that refuse to accept unnormalized text would
just frustrate users who telnet to port 25 for troubleshooting, if their
input method doesn't output normalized text. Imagine how that would go:

> RCPT To: <***@host>
> 550 Unknown local part user in <***@host>

"Unknown local part? But I can see it right there in the aliases
file..." Later, after much hair-pulling: "Oh, the supposedly
'internationalized' MTA is doing a brain-dead bytewise comparison, and
the representation produced by my input method is slightly different
from the representation in the aliases file. Why doesn't the MTA
normalize them first?! Lazy-ass piece of... Why didn't the spec
designers require the MTA to normalize its input, to save me this
headache?"

> When this draft is ready we can look at a way to downgrade to
> legacy ASCII e-mail. But as we here require all MTAs/MUAs handling
> international mailboxes to be upgraded, we can use a simpler
> downgrading to ASCII than IMAA as we only need the ASCII world to see
> international local part as an opaque ASCII string.

I don't see how the downgrading is going to get a lot simpler than IMAA.
You're going to need IDNA for the domain part so that ASCII-only MUAs
can reply to messages (which involves looking up the domain in DNS).
That means you already need Punycode and Nameprep, which are the two
main sources of complexity in IMAA.

Simon Josefsson <***@extundo.com> wrote:

> adopting the ASCII-encode approach risk decreasing the quality of
> protocols that do not have the same restrictions for which IDNA/IMAA
> was developed. Instead of using Unicode and UTF-8 directly for
> internationalized strings, it could be tempting to propose that
> protocols should use ASCII-encoded Unicode a'la IDNA/IMAA because it
> is used elsewhere. The argument is that IDNA/IMAA will "leak" into
> the protocol, so you better handle it somehow.

In protocols that are accessible to legacy software, ACE forms *will*
leak in, and therefore they better be handled correctly.

In new protocols that are accessible only to new software, it would be
possible to require that software to act as a gatekeeper to prevent ACE
forms from leaking in. That might make life easier for internal agents
that don't normally interact with users, because they wouldn't need to
have ToASCII and ToUnicode. User-visible agents, however, would still
want those operations, to handle cases where users manually copy ACE
forms from other applications.

AMC

John Cowan

2003-10-27 01:58:37 UTC

Permalink

Adam M. Costello scripsit:

> In any case, I think it would be good if receivers do not assume that
> text is already normalized; they should perform normalization whenever
> they want text to be normalized. Then it will not be necessary for
> senders to perform normalization. The implementation cost is the
> same whether the code is inside the senders or inside the receivers.

This is true for person-to-person email, but not for mailing list
postings, which are formally equivalent. There, the efficiency is
much superior if senders MUST normalize and receivers SHOULD check
normalization, checking being much less costly than normalizing in
most cases.

> Forbidding unnormalized text on the wire doesn't do much except to make
> troubleshooting more difficult for humans, who cannot see the difference
> between normalized and unnormalized text. It's not as if unnormalized
> text is ambiguous. It makes sense to forbid ambiguous constructions
> like "***@foo@bar", but unnormalized text is not ambiguous--you
> can determine exactly what was intended by normalizing it yourself.

But if it is signed, you will destroy the digital signature. What is
worse, a sender of unnormalized text may be able to spoof a receiver.
See the W3C's CharMod draft (http://www.w3.org/TR/charmod) for details.

--
What is the sound of Perl? Is it not the John Cowan
sound of a [Ww]all that people have stopped ***@reutershealth.com
banging their head against? --Larry http://www.ccil.org/~cowan

Paul Hoffman / IMC

2003-10-27 02:09:02 UTC

Permalink

At 8:58 PM -0500 10/26/03, John Cowan wrote:
> > In any case, I think it would be good if receivers do not assume that
>> text is already normalized; they should perform normalization whenever
>> they want text to be normalized. Then it will not be necessary for
>> senders to perform normalization. The implementation cost is the
>> same whether the code is inside the senders or inside the receivers.
>
>This is true for person-to-person email, but not for mailing list
>postings, which are formally equivalent. There, the efficiency is
>much superior if senders MUST normalize and receivers SHOULD check
>normalization, checking being much less costly than normalizing in
>most cases.

It is fairly "costly" both for the sender and for the receiver if the
receiver (which is an MTA, not an MUA) bounces messages because the
sender didn't get the same form as the receiving MTA demanded. That
is, the sender would see an error that he/she could not figure out,
and the receiver would, well, never receive.

Or are you saying that all receivers should have aliases for all
possible un-normalized versions into the intended mailbox name?

> > Forbidding unnormalized text on the wire doesn't do much except to make
>> troubleshooting more difficult for humans, who cannot see the difference
>> between normalized and unnormalized text. It's not as if unnormalized
>> text is ambiguous. It makes sense to forbid ambiguous constructions
>> like "***@foo@bar", but unnormalized text is not ambiguous--you
>> can determine exactly what was intended by normalizing it yourself.
>
>But if it is signed, you will destroy the digital signature.

The mailbox name is kept in the headers, which are not signed.

> What is
>worse, a sender of unnormalized text may be able to spoof a receiver.

Could you give an example?

>See the W3C's CharMod draft (http://www.w3.org/TR/charmod) for details.

Maybe I'm dense, but I don't see any spoofing examples there that
would apply to Internet mail.

--Paul Hoffman, Director
--Internet Mail Consortium

Adam M. Costello

2003-10-27 03:35:14 UTC

Permalink

John Cowan <***@mercury.ccil.org> wrote:

> > In any case, I think it would be good if receivers do not assume
> > that text is already normalized; they should perform normalization
> > whenever they want text to be normalized. Then it will not be
> > necessary for senders to perform normalization. The implementation
> > cost is the same whether the code is inside the senders or inside
> > the receivers.
>
> This is true for person-to-person email, but not for mailing list
> postings, which are formally equivalent. There, the efficiency is
> much superior if senders MUST normalize and receivers SHOULD check
> normalization, checking being much less costly than normalizing in
> most cases.

When I say "implementation cost", I'm not talking about machine cycles.
Cycles are cheap and getting cheaper all the time. I'm talking about
the cost of writing the code. Sender and receiver implementations are
generally one-to-one. Even if there are many receive events for every
send event (as in one-to-many communication), each side gets written
once. The implementation cost of normalization is the same whether the
normalization code is written into the sender or the receiver.

If you want to conserve machine cycles, try this model: Senders SHOULD
normalize, and receivers MUST accept unnormalized text. Receivers can
check whether the text is already normalized (which is cheap, as you
say) and avoid doing the normalization in most cases.

> But if it is signed, you will destroy the digital signature.

Only if I destroy the original unnormalized text, which I'm under no
obligation to do. I can construct a normalized copy of the text to make
it easier for me to interpret the text, and leave the signature attached
to the original unnormalized text. Depending on the situation, I might
soon discard the copy and keep the original & signature, or I might keep
the copy and discard the original & signature, or keep both, or discard
both.

> What is worse, a sender of unnormalized text may be able to spoof a
> receiver.

The receiver will be fooled only if it blindly assumes that the text
is normalized. I don't think receivers should make that assumption.
They can't be fooled if they force the text to be normalized whenever it
matters.

AMC

Arnt Gulbrandsen

2003-10-27 10:23:46 UTC

Permalink

Adam M. Costello writes, about receving signed text:
> Only if I destroy the original unnormalized text, which I'm under no
> obligation to do. I can construct a normalized copy of the text to
> make it easier for me to interpret the text, and leave the signature
> attached to the original unnormalized text. Depending on the
> situation, I might soon discard the copy and keep the original &
> signature, or I might keep the copy and discard the original &
> signature, or keep both, or discard both.

Oh please, not _another_ obstacle to signature support. The state is bad
enough as it is - we really don't need more obstacles.

--Arnt

Paul Hoffman / IMC

2003-10-27 18:01:09 UTC

Permalink

At 11:23 AM +0100 10/27/03, Arnt Gulbrandsen wrote:
>Oh please, not _another_ obstacle to signature support. The state is
>bad enough as it is - we really don't need more obstacles.

Again: this has nothing to do with "signature support". The headers
in email messages are *not* covered by the signature (or by
encryption).

--Paul Hoffman, Director
--Internet Mail Consortium

Arnt Gulbrandsen

2003-10-28 10:03:23 UTC

Permalink

Paul Hoffman / IMC writes:
> At 11:23 AM +0100 10/27/03, Arnt Gulbrandsen wrote:
>> Oh please, not _another_ obstacle to signature support. The state is
>> bad enough as it is - we really don't need more obstacles.
>
> Again: this has nothing to do with "signature support". The headers in
> email messages are *not* covered by the signature (or by encryption).

Sorry, I misunderstood then. I thought this (also?) applied to body text.

But my point stands (in the case of sanely designed MUAs). When a
message/rfc822 is contained by a multipart/signed, a header is
signed/encrypted. Any code that treats a top-level message like a
message/rfc822 bodypart will need to deal with the
signature/normalization problems for the header as well as for the
body.

--Arnt

Paul Hoffman / IMC

2003-10-28 15:08:32 UTC

Permalink

At 11:03 AM +0100 10/28/03, Arnt Gulbrandsen wrote:
>Paul Hoffman / IMC writes:
>>At 11:23 AM +0100 10/27/03, Arnt Gulbrandsen wrote:
>>>Oh please, not _another_ obstacle to signature support. The state
>>>is bad enough as it is - we really don't need more obstacles.
>>
>>Again: this has nothing to do with "signature support". The headers
>>in email messages are *not* covered by the signature (or by
>>encryption).
>
>Sorry, I misunderstood then. I thought this (also?) applied to body text.

The specification says in section 2 "for example, a mail address
appearing in the plain text body of a message is not occupying a mail
address slot". If there are other parts of the specification that you
find confusing about changes to the body, by all means please let us
know so we can clarify them.

>But my point stands (in the case of sanely designed MUAs). When a
>message/rfc822 is contained by a multipart/signed, a header is
>signed/encrypted. Any code that treats a top-level message like a
>message/rfc822 bodypart will need to deal with the
>signature/normalization problems for the header as well as for the
>body.

Sorry, we fully disagree. Changing the body of a message is a
terrible idea for many reasons, security being one of them.

--Paul Hoffman, Director
--Internet Mail Consortium

John Cowan

2003-10-27 15:36:03 UTC

Permalink

Adam M. Costello scripsit:

> The receiver will be fooled only if it blindly assumes that the text
> is normalized. I don't think receivers should make that assumption.
> They can't be fooled if they force the text to be normalized whenever it
> matters.

Oh yes they can, and all the worse. Consider the classical birthday attack:
see http://www.x5.net/faqs/crypto/q96.html for details if you need them.
It depends on the ability to generate 2^(n/2) variants (where n is the
number of bits in the n-bit signature hash function) of a message to be
used for the spoof. Typically this involves things like altering whitespace,
but a close inspection will detect these.

If we play with Unicode canonical equivalence in a world where receivers
normalize, however, we can create variants that are quite undetectable by
the receiver. Typical German text contains about 5% accented characters,
so a 20K message can be given 2^1000 variants, more than enough to break
reasonable hash functions.

--
John Cowan ***@reutershealth.com www.ccil.org/~cowan www.reutershealth.com
"The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague." --Edsger Dijkstra

Adam M. Costello

2003-10-27 19:08:48 UTC

Permalink

John Cowan <***@reutershealth.com> wrote:

> > The receiver will be fooled only if it blindly assumes that the text
> > is normalized. They can't be fooled if they force the text to be
> > normalized whenever it matters.
>
> Oh yes they can, and all the worse. Consider the classical birthday
> attack: see http://www.x5.net/faqs/crypto/q96.html for details if you
> need them. It depends on the ability to generate 2^(n/2) variants
> (where n is the number of bits in the n-bit signature hash function)
> of a message to be used for the spoof.
>
> If we play with Unicode canonical equivalence in a world where
> receivers normalize, however, we can create variants that are quite
> undetectable by the receiver.

A crytographic hash is considered secure if it is infeasible to
reconstruct any part of the message given its hash value and it is
infeasible to find any two messages with the same hash value. That's
*any* two messages, not just two messages that are somehow equivalent.
If the attacker cannot find any two messages with the same hash value,
then he cannot find two equivalent messages with the same hash value
either.

Besides, even if you could find two canonically equivalent messages
(that is, two messages that become equal when normalized) with the same
hash value, how could you base an attack on that? You could substitute
one message for the other without being detected, but if they mean the
same thing, what harm is done? For an effective attack, wouldn't you
want two truly different messages with the same hash value?

> Typical German text contains about 5% accented characters, so a 20K
> message can be given 2^1000 variants, more than enough to break
> reasonable hash functions.

Just because 2^1000 variants exist doesn't mean you can enumerate them
and compute the hash value of each one. I think there are proofs that
doing so would require more energy than is thought to exist in the
universe,

SHA-1 has 160 bits. The computing resources to compute 2^80 hashes
won't exist for another 30 years or so, assuming Moore's law continues
that long. By that time people will have hopefully stopped accepting
signatures based on 160-bit hashes and demand longer ones. 256-bit and
512-bit versions of SHA have already been defined.

AMC

John Cowan

2003-10-27 20:33:14 UTC

Permalink

Adam M. Costello scripsit:

> Besides, even if you could find two canonically equivalent messages
> (that is, two messages that become equal when normalized) with the same
> hash value, how could you base an attack on that? You could substitute
> one message for the other without being detected, but if they mean the
> same thing, what harm is done? For an effective attack, wouldn't you
> want two truly different messages with the same hash value?

Yes. The method of the birthday attack is that you generate 2^(n/2) variants
of the message you wish to forge, and 2^(n/2) variants of an innocuous message.
By the birthday paradox, some pair will match hash values. You then convince
Alice to sign the chosen variant of the innocuous message (perhaps a
petition for some cause), copy the signature to the corresponding variant of
the message to be forged, and forward it to Bob.

> SHA-1 has 160 bits. The computing resources to compute 2^80 hashes
> won't exist for another 30 years or so, assuming Moore's law continues
> that long. By that time people will have hopefully stopped accepting
> signatures based on 160-bit hashes and demand longer ones.

However, old contracts will still exist and their authenticity may need to be
verified.

--
John Cowan ***@reutershealth.com www.ccil.org/~cowan www.reutershealth.com
"The competent programmer is fully aware of the strictly limited size of his own
skull; therefore he approaches the programming task in full humility, and among
other things he avoids clever tricks like the plague." --Edsger Dijkstra

Adam M. Costello

2003-10-27 22:35:39 UTC

Permalink

John Cowan <***@reutershealth.com> wrote:

> The method of the birthday attack is that you generate 2^(n/2)
> variants of the message you wish to forge, and 2^(n/2) variants of an
> innocuous message. By the birthday paradox, some pair will match hash
> values.

Ah, thanks. Still, I think the secure defense against that attack is to
use a hash function with a great enough n to make the attack infeasible,
not to rely on the message format to prevent variants.

> > The computing resources to compute 2^80 hashes won't exist for
> > another 30 years or so, assuming Moore's law continues that long.
> > By that time people will have hopefully stopped accepting signatures
> > based on 160-bit hashes and demand longer ones.
>
> However, old contracts will still exist and their authenticity may
> need to be verified.

Good point. If I had a contract with a signature that looked like it
might become forgeable before the contract expired, I'd ask the signer
to re-sign it with a stronger signature. If they refused, I'd sue them
soon, before new technology appeared that would allow them to plausibly
claim that the old signature was forged.

AMC

Roy Badami

2003-10-27 23:24:54 UTC

Permalink

I know this is going off-topic, but FWIW I think both Adam Costello
and John Cowan have valid points here. (Disclaimer: IANAC)

Adam's definition of a cryptographically strong hash tallies with my
understanding. If it is computationally feasible (by exploiting the
birthday paradox or otherwise) to generate two strings that hash to
the same value, then the hash function is not cryptographically
strong. And what John is proposing is somewhat more computationally
intensive than simply generating two random strings that hash to the
same value. Once the hash function fails to satisfy the traditional
definition, you almost certainly have many other problems than the one
John describes.

On the other hand, cryptography is all about prudent design choices.
Since you can never prove that a cryptographic system is secure, you
always want to design it to be as secure as possible. (ie you're
never _sure_ it's secure enough, so the last thing you want to do is
to weaken it unnecessarily.)

Signing normalized text would seem to be a prudent design choice, even
though signing unnormalized text is still secure, provided the hash is
secure. If the other costs associated with this approach aren't
prohibitive, it would seem to be the correct course of action.

-roy

Paul Hoffman / IMC

2003-10-27 23:56:03 UTC

Permalink

At 11:24 PM +0000 10/27/03, Roy Badami wrote:
>I know this is going off-topic, but FWIW I think both Adam Costello
>and John Cowan have valid points here. (Disclaimer: IANAC)

You do not need to be a cryptographer here. As I have said in
multiple posts: the headers are not signed. We can talk about this
until we are blue in our collective faces, but it is irrelevant.

For those folks who don't understand this, please read RFC 1847. You
will see that the part of the message that is signed is just the
body, not the header. This is good, because RFC 2821 *requires* the
header to be changed at every MTA. Arguing "nothing should change the
header because it is signed" is clearly just silly.

--Paul Hoffman, Director
--Internet Mail Consortium

Dan Oscarsson

2003-10-27 12:17:32 UTC

Permalink

Adam M. Costello wrote:

>
>> IDNA do not support all international domain names due to being made
>> to work using unaware DNS servers and clients.
>
>What do you mean by "all international domain names"? There was no
>such thing as an "internationalized domain name" until the IETF defined
>that term. The definition appears in the IDNA spec. Therefore, by
>definition, IDNA supports "all internationalized domain names". You
>must have some other definition in mind. What is it, and why is it a
>problem that IDNA does not support that definition?
>
>> As the IMAA draft stands today it will not handle all e-mail
>> addresses.
>
>Same question. There is no such thing as an internationalized mail
>address until we define it.

By international domain names I mean a domain name containg non-ASCII
characters. The same for e-mail addresses.

The problem with IDNA and IMAA is that the definition is defined in terms
of an ASCII form and the rules applied to converting to ASCII.

A domain name with mixed case is an international domain name, but that
is not the definition of IDNA.
IMAA also make changes to the e-mail address resulting in a subset of the
possible international e-mail addresses.

I think a international e-mail addess or domain names should be defined
in character semantic, not ASCII encoding rules.

>And UTF-8 headers will not be very popular before enough clients *and*
>servers can handle it. If speed of deployment is an issue, it looks
>pretty clear to me that the "in applications" approach has the edge.

By fixing my MTA or DNS server I can support both legacy and
internationalised applications as well as giving transition support
for application. All new applications do not need to handle legacy
as the enhanced MTA/DNS server does the up/downgrading for it.

>
>> To make handling of UTF-8 text, the standard should require Unicode
>> normalisation form C (NFC).
>
>In any case, I think it would be good if receivers do not assume that
>text is already normalized; they should perform normalization whenever
>they want text to be normalized. Then it will not be necessary for
>senders to perform normalization. The implementation cost is the
>same whether the code is inside the senders or inside the receivers.

The implementation cost is much lower it we agree to use the same format
"on the wire". At each end we only need to implement translation between
"on the wire format" and "internal format".
By having many "on the wire" formats, the code gets much more complex and
increases CPU needs and memory needs.
I think W3C have written some information about "early" normalisation.
Also as messages may pass many systems/applications on the way from sender
to destination, requiring normalised data results in only sender and
possibly final recipient (if not wanting to use normalised data) need to
change the normalisation. All hops between need not do additional
normalisation work, and if they want to do some filtering using some
internal format they only have one "on the wire" format to convert from.

So I can clearly see that implementation cost is much lower, and system
resources is lower, if we only have one "on the wire" format.

Dan

John C Klensin

2003-10-27 14:56:35 UTC

Permalink

Dan, Adam,

I'm tempted to stay out of this and take a "you and him fight"
approach to this. But, in the interest of focusing the
discussion, it seems to me that the issue of ease of
implementation and ease of deployment depends entirely on where
in the system one looks, makes predictions or measurements, and
what one cares about.

For example, if one is trying to require changes in the minimum
number of modules, and to have something that can get through,
somehow, to unconverted systems, then Adam is, I think, right.
But Adam's approach requires one other assumption, which is a
"my responsibilities end with getting the message into, across,
and out of the network". That isn't an irrational position.

But I, and I think you, are postulating a different condition
for success, a much more difficult one. In my version, if the
end user doesn't see her local characters, and the local
characters of the message, correctly, then, whatever we have
done, we haven't provided internationalization. Indeed, we may
have made things worse: my Chinese colleagues don't like writing
their names as some transliterated or phonetic ASCII strings,
especially in communicating with each other, but those strings
have significant mnemonic value. Giving them Chinese characters
would be great, and is the desired target. But forcing them to
see, or use, or even to cope with some complex, non-Chinese
encoding that has not mnemonic value is probably a step
backwards. So I put pretty strong value in avoiding that case.

Similarly, one of the things we have observed about email over
the years is that messages that say, somewhere in the body, "my
email address is fooXbar+***@bogus.domain.name" or, worse,
"George's address is..." are pretty common. Adam's approach
assumes, I think, that one doesn't need to worry about all of
the issues associated with such embedded addresses in order to
design an email address protocol. I disagree -- I think those
situations are important in practice, and that having them
"work" had best not depend on either heuristics that search
through a message trying to figure out what is, or is not, an
email address or on the user typing in
funny-***@punycode-encoding.example.com. The transport
approach says that the email addresses get written into message
bodies in whatever the character set of the message body is.
And, I assume that, if the MUA can't handle the relevant
character set, then, in practice, it is all gibberish anyway and
maybe no one cares.

So, with Adam's constraints and success criteria (at least as
I'm imagining them) the "native characters mapped to UTF-8"
approach is vastly more complicated. The transport needs
tuning, and everything in the path has to be willing to play to
get i18n addresses through. The MUA has to be i18n-compliant at
a fairly high standard. The MTA-MUA interface has to be tuned a
bit. IMAP and POP need to be adjusted to pass the right stuff
around (a topic that draft-klensin-emailaddr-i18n-01 doesn't
address, but -02 should). He is absolutely correct: if that has
to be done de nove, it is a _Big_ deal. Certainly it requires a
lot of protocol changes as well as a lot of code changes if we
were starting from scratch.

However, with my constraints and view of the same problem,
Adam's solution is "finished" only when the users are seeing an
i18n environment. At that point, a variation of your argument
sets in. If we conclude that the job hasn't been done until the
infrastructure (not just the MTAs, but also the MUAs, the
presentation layer, etc., are upgraded... Once we conclude that
it is all pointless until those things are upgraded, and
therefore don't count their costs, and one gets to look only at
the marginal cost of address upgrading, then, well that marginal
cost is pretty low. It would be fair of Adam to argue that is a
very strange way to define the problem or its measurement. But,
at some level, it is reasonable, too.

So I think, personally, that the important questions are about
the total resources, code changes, deployment costs, etc., are
to get it right --as seen by the end user (which almost
certainly involves seeing "native" characters and little or no
leakage of internal codings). If one particular approach
doesn't get it right, defined in those user terms, than how much
easier/ cheaper/ faster it can be implemented is really not a
terribly interesting question. And "right" may be a matter of
user perception or religion, unfortunately.

john

--On Monday, October 27, 2003 13:17 +0100 Dan Oscarsson
<***@kiconsulting.se> wrote:

>
> Adam M. Costello wrote:
>
>>
>>> IDNA do not support all international domain names due to
>>> being made to work using unaware DNS servers and clients.
>>
>> What do you mean by "all international domain names"? There
>> was no such thing as an "internationalized domain name" until
>> the IETF defined that term. The definition appears in the
>> IDNA spec. Therefore, by definition, IDNA supports "all
>> internationalized domain names". You must have some other
>> definition in mind. What is it, and why is it a problem that
>> IDNA does not support that definition?
>>
>>> As the IMAA draft stands today it will not handle all e-mail
>>> addresses.
>>
>> Same question. There is no such thing as an
>> internationalized mail address until we define it.
>
> By international domain names I mean a domain name containg
> non-ASCII characters. The same for e-mail addresses.
>
> The problem with IDNA and IMAA is that the definition is
> defined in terms of an ASCII form and the rules applied to
> converting to ASCII.
>
> A domain name with mixed case is an international domain name,
> but that is not the definition of IDNA.
> IMAA also make changes to the e-mail address resulting in a
> subset of the possible international e-mail addresses.
>
> I think a international e-mail addess or domain names should
> be defined in character semantic, not ASCII encoding rules.
>
>> And UTF-8 headers will not be very popular before enough
>> clients *and* servers can handle it. If speed of deployment
>> is an issue, it looks pretty clear to me that the "in
>> applications" approach has the edge.
>
> By fixing my MTA or DNS server I can support both legacy and
> internationalised applications as well as giving transition
> support for application. All new applications do not need to
> handle legacy as the enhanced MTA/DNS server does the
> up/downgrading for it.
>
>>
>>> To make handling of UTF-8 text, the standard should require
>>> Unicode normalisation form C (NFC).
>>
>> In any case, I think it would be good if receivers do not
>> assume that text is already normalized; they should perform
>> normalization whenever they want text to be normalized. Then
>> it will not be necessary for senders to perform
>> normalization. The implementation cost is the same whether
>> the code is inside the senders or inside the receivers.
>
> The implementation cost is much lower it we agree to use the
> same format "on the wire". At each end we only need to
> implement translation between "on the wire format" and
> "internal format".
> By having many "on the wire" formats, the code gets much more
> complex and increases CPU needs and memory needs.
> I think W3C have written some information about "early"
> normalisation. Also as messages may pass many
> systems/applications on the way from sender to destination,
> requiring normalised data results in only sender and possibly
> final recipient (if not wanting to use normalised data) need to
> change the normalisation. All hops between need not do
> additional normalisation work, and if they want to do some
> filtering using some internal format they only have one "on
> the wire" format to convert from.
>
> So I can clearly see that implementation cost is much lower,
> and system resources is lower, if we only have one "on the
> wire" format.
>
> Dan
>

Mark Davis

2003-10-27 16:50:56 UTC

Permalink

I have always viewed IDNA as a (clever) way to get work around current
limitations, and relatively quickly allow people the ability to create domain
names in their own languages, instead of continuing to impose English
restrictions on them. And the same technique can probably be used in a number of
other areas for the same purpose.

But I agree with John that the end goal should be to upgrade protocols to allow
the use of Unicode (as represented in the UTF-8 encoding scheme) from end to
end, without having to go through special-purpose transformations.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "John C Klensin" <***@jck.com>
To: "Dan Oscarsson" <***@kiconsulting.se>; <ietf-***@imc.org>
Sent: Mon, 2003 Oct 27 06:56
Subject: Re: draft-klensin-emailaddr-i18n-00

>
> Dan, Adam,
>
> I'm tempted to stay out of this and take a "you and him fight"
> approach to this. But, in the interest of focusing the
> discussion, it seems to me that the issue of ease of
> implementation and ease of deployment depends entirely on where
> in the system one looks, makes predictions or measurements, and
> what one cares about.
>
> For example, if one is trying to require changes in the minimum
> number of modules, and to have something that can get through,
> somehow, to unconverted systems, then Adam is, I think, right.
> But Adam's approach requires one other assumption, which is a
> "my responsibilities end with getting the message into, across,
> and out of the network". That isn't an irrational position.
>
> But I, and I think you, are postulating a different condition
> for success, a much more difficult one. In my version, if the
> end user doesn't see her local characters, and the local
> characters of the message, correctly, then, whatever we have
> done, we haven't provided internationalization. Indeed, we may
> have made things worse: my Chinese colleagues don't like writing
> their names as some transliterated or phonetic ASCII strings,
> especially in communicating with each other, but those strings
> have significant mnemonic value. Giving them Chinese characters
> would be great, and is the desired target. But forcing them to
> see, or use, or even to cope with some complex, non-Chinese
> encoding that has not mnemonic value is probably a step
> backwards. So I put pretty strong value in avoiding that case.
>
> Similarly, one of the things we have observed about email over
> the years is that messages that say, somewhere in the body, "my
> email address is fooXbar+***@bogus.domain.name" or, worse,
> "George's address is..." are pretty common. Adam's approach
> assumes, I think, that one doesn't need to worry about all of
> the issues associated with such embedded addresses in order to
> design an email address protocol. I disagree -- I think those
> situations are important in practice, and that having them
> "work" had best not depend on either heuristics that search
> through a message trying to figure out what is, or is not, an
> email address or on the user typing in
> funny-***@punycode-encoding.example.com. The transport
> approach says that the email addresses get written into message
> bodies in whatever the character set of the message body is.
> And, I assume that, if the MUA can't handle the relevant
> character set, then, in practice, it is all gibberish anyway and
> maybe no one cares.
>
> So, with Adam's constraints and success criteria (at least as
> I'm imagining them) the "native characters mapped to UTF-8"
> approach is vastly more complicated. The transport needs
> tuning, and everything in the path has to be willing to play to
> get i18n addresses through. The MUA has to be i18n-compliant at
> a fairly high standard. The MTA-MUA interface has to be tuned a
> bit. IMAP and POP need to be adjusted to pass the right stuff
> around (a topic that draft-klensin-emailaddr-i18n-01 doesn't
> address, but -02 should). He is absolutely correct: if that has
> to be done de nove, it is a _Big_ deal. Certainly it requires a
> lot of protocol changes as well as a lot of code changes if we
> were starting from scratch.
>
> However, with my constraints and view of the same problem,
> Adam's solution is "finished" only when the users are seeing an
> i18n environment. At that point, a variation of your argument
> sets in. If we conclude that the job hasn't been done until the
> infrastructure (not just the MTAs, but also the MUAs, the
> presentation layer, etc., are upgraded... Once we conclude that
> it is all pointless until those things are upgraded, and
> therefore don't count their costs, and one gets to look only at
> the marginal cost of address upgrading, then, well that marginal
> cost is pretty low. It would be fair of Adam to argue that is a
> very strange way to define the problem or its measurement. But,
> at some level, it is reasonable, too.
>
> So I think, personally, that the important questions are about
> the total resources, code changes, deployment costs, etc., are
> to get it right --as seen by the end user (which almost
> certainly involves seeing "native" characters and little or no
> leakage of internal codings). If one particular approach
> doesn't get it right, defined in those user terms, than how much
> easier/ cheaper/ faster it can be implemented is really not a
> terribly interesting question. And "right" may be a matter of
> user perception or religion, unfortunately.
>
> john
>
>
> --On Monday, October 27, 2003 13:17 +0100 Dan Oscarsson
> <***@kiconsulting.se> wrote:
>
> >
> > Adam M. Costello wrote:
> >
> >>
> >>> IDNA do not support all international domain names due to
> >>> being made to work using unaware DNS servers and clients.
> >>
> >> What do you mean by "all international domain names"? There
> >> was no such thing as an "internationalized domain name" until
> >> the IETF defined that term. The definition appears in the
> >> IDNA spec. Therefore, by definition, IDNA supports "all
> >> internationalized domain names". You must have some other
> >> definition in mind. What is it, and why is it a problem that
> >> IDNA does not support that definition?
> >>
> >>> As the IMAA draft stands today it will not handle all e-mail
> >>> addresses.
> >>
> >> Same question. There is no such thing as an
> >> internationalized mail address until we define it.
> >
> > By international domain names I mean a domain name containg
> > non-ASCII characters. The same for e-mail addresses.
> >
> > The problem with IDNA and IMAA is that the definition is
> > defined in terms of an ASCII form and the rules applied to
> > converting to ASCII.
> >
> > A domain name with mixed case is an international domain name,
> > but that is not the definition of IDNA.
> > IMAA also make changes to the e-mail address resulting in a
> > subset of the possible international e-mail addresses.
> >
> > I think a international e-mail addess or domain names should
> > be defined in character semantic, not ASCII encoding rules.
> >
> >> And UTF-8 headers will not be very popular before enough
> >> clients *and* servers can handle it. If speed of deployment
> >> is an issue, it looks pretty clear to me that the "in
> >> applications" approach has the edge.
> >
> > By fixing my MTA or DNS server I can support both legacy and
> > internationalised applications as well as giving transition
> > support for application. All new applications do not need to
> > handle legacy as the enhanced MTA/DNS server does the
> > up/downgrading for it.
> >
> >>
> >>> To make handling of UTF-8 text, the standard should require
> >>> Unicode normalisation form C (NFC).
> >>
> >> In any case, I think it would be good if receivers do not
> >> assume that text is already normalized; they should perform
> >> normalization whenever they want text to be normalized. Then
> >> it will not be necessary for senders to perform
> >> normalization. The implementation cost is the same whether
> >> the code is inside the senders or inside the receivers.
> >
> > The implementation cost is much lower it we agree to use the
> > same format "on the wire". At each end we only need to
> > implement translation between "on the wire format" and
> > "internal format".
> > By having many "on the wire" formats, the code gets much more
> > complex and increases CPU needs and memory needs.
> > I think W3C have written some information about "early"
> > normalisation. Also as messages may pass many
> > systems/applications on the way from sender to destination,
> > requiring normalised data results in only sender and possibly
> > final recipient (if not wanting to use normalised data) need to
> > change the normalisation. All hops between need not do
> > additional normalisation work, and if they want to do some
> > filtering using some internal format they only have one "on
> > the wire" format to convert from.
> >
> > So I can clearly see that implementation cost is much lower,
> > and system resources is lower, if we only have one "on the
> > wire" format.
> >
> > Dan
> >
>
>
>
>
>