Discussion:
First strawman for UTF-8 headers proposal
Paul Hoffman / IMC
2003-11-26 18:57:57 UTC
Permalink
Greetings again. At the Minneapolis meeting, I proposed that if
people were interested in John's proposal to encode the addresses in
a message as UTF-8, they might be interested in making all the
headers UTF-8. (The proposal was initially sparked by Pete Resnick.)
Following the thread from the past few weeks, I have come up with the
following strawman. If no one finds any huge problems with this, I'll
turn it into an Internet Draft in a few weeks.

All comments welcome!

--Paul Hoffman



- The dual motivations are to allow UTF-8 everywhere in the headers and
to not bounce any messages just because they originated with UTF-8
headers.

- Allows current users who have all-ASCII mailbox names to step up
to UTF-8 headers easily.

- Updated sending MUAs will create all headers in UTF-8.

- Transmission is protected by a new ESMTP command, UTF-8-HEADERS.

- Everyone who has a UTF-8 mailbox name MUST also have an all-ASCII
mailbox name that is equivalent.

- The terminal SMTP server is responsible for knowing whether or not the
message store can handle UTF-8 headers.

- If a receiving SMTP server does not support UTF-8-HEADERS, the sending
SMTP client downgrades all headers and continues to send the message.

- Free text fields are downgraded using quoted-printable encoding;
SHOULD be into UTF-8 charset. Downgrading MUST only be done if
necessary.

- Downgrading email addresses that only contain UTF-8 in the domain name
is done with IDNA.

- For every address in a message with a UTF-8 mailbox name, the mail
initiator tries to create a mapping in a new header, Address-maps:. A
message only has one Address-map: header; the header has a string of
maps. The header is only for addresses that have an UTF-8 mailbox name;
it SHOULD NOT be used for addresses that have all-ASCII mailbox names,
even if those addresses have UTF-8 domain names.

- If the initiator has a UTF-8 mailbox name, the initiator MUST also
have an all-ASCII mailbox, and the all-ASCII address MUST appear in the
map header.

- If the initiator knows the mapping for any recipient (through caching
or an address book), they SHOULD put it in the map header. If they
don't include a mapping and the message hits a non-UTF-8-HEADERS
SMTP server, the message will bounce.

- The Address-map: header is downgraded using Base64 for mailbox
names, IDNA for domain names.

- Example:
Address-map: José@example.com,jose-***@example.com;
törbjø***@fältström.se,***@fältström.se
If passed to a non-UTF-8-HEADERS system, this header gets downgraded
to:
Address-map: Sm9zw6k=@example.com,jose-***@example.com;
dMO2cmJqw7hybg==@xn--fltstrm-5wa1o.se,***@xn--fltstrm-5wa1o.se

- Intermediate SMTP servers MAY change the values in the Address-map:
header (such as to add one that is missing or to correct a mapping), but
SHOULD only do so for local domains. This might be a bad idea and might
be removed.

- Terminal SMTP servers should write messages addressed to either the
UTF-8 address or the all-ASCII address into the same mailbox, but this
is not mandatory.

- POP and IMAP might be updated to allow one request to bring in two or
more mailboxes; otherwise, users will have to do two separate requests.

- Digital certificates for addresses that have UTF-8 LHSs should contain
both addresses; this is already supported in PKIX and OpenPGP.

- Other headers that include mailbox names and domain names will need
further definition for downgrading.

- MUAs are encouraged to cache address mappings they see, probably with
a user-settable time-to-live.

- Terminal SMTP servers MAY look into the headers of a message to
determine whether they should upgrade a downgraded set of headers to
UTF-8. This is easy to determine: if the Address-map: header contains
only ASCII, it was downgraded. Upgrading is particularly useful on
bounce messages caused by bad mappings.

- It might be good to have a protocol for determining mappings, but it
is not defined here.

- It might be better to have just one mapping per Address-map: header
and have multiple Address-map: headers per message.


--Paul Hoffman, Director
--Internet Mail Consortium
Simon Josefsson
2003-11-27 19:50:27 UTC
Permalink
Post by Paul Hoffman / IMC
All comments welcome!
- The dual motivations are to allow UTF-8 everywhere in the headers and
to not bounce any messages just because they originated with UTF-8
headers.
...
Post by Paul Hoffman / IMC
- If a receiving SMTP server does not support UTF-8-HEADERS, the sending
SMTP client downgrades all headers and continues to send the message.
Following the example of 8BITMIME, I believe implementations should be
allowed to bounce messages if they do not implement the fall back
mechanism. Otherwise in 20 years, all systems would still be forced
to implement a downgrade mechanism that nobody use or test. Users
will require that implementors support downgrading today, but
eventually they won't have to bother about it.
Post by Paul Hoffman / IMC
- Free text fields are downgraded using quoted-printable encoding;
SHOULD be into UTF-8 charset. Downgrading MUST only be done if
necessary.
Does this intentionally forbid non-QP RFC 2047 encodings? E.g.,
strings like =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=.
Post by Paul Hoffman / IMC
- The Address-map: header is downgraded using Base64 for mailbox
names, IDNA for domain names.
If passed to a non-UTF-8-HEADERS system, this header gets downgraded
It might be nice to use the RFC 2047 encoding instead, so that the
header is rendered properly in MIME aware clients. It would make
cut'n'paste of non-ASCII email addresses possible, even from MUAs that
doesn't support this new standard. Qualify it to MUST use UTF-8
charset and the "B" encoding if you wish. A possible disadvantage
would be if gateways converts RFC 2047 data from one charset to
another, although I think the advantages are larger.
Post by Paul Hoffman / IMC
- Other headers that include mailbox names and domain names will need
further definition for downgrading.
Here there may be dragons. There are many headers, standard and
non-standard ones, that contain mailboxes, although without using the
RFC 2822 BNF 'mailbox'. References: is one. Various List-* headers
are others.

In general, I think the Address-map idea need some further pondering,
especially with regard to modifying in transit and populating them
from address book caches, but also the encoding.

Thanks,
Simon
Paul Hoffman / IMC
2003-11-29 20:43:20 UTC
Permalink
Post by Simon Josefsson
Post by Paul Hoffman / IMC
All comments welcome!
- The dual motivations are to allow UTF-8 everywhere in the headers and
to not bounce any messages just because they originated with UTF-8
headers.
...
Post by Paul Hoffman / IMC
- If a receiving SMTP server does not support UTF-8-HEADERS, the sending
SMTP client downgrades all headers and continues to send the message.
Following the example of 8BITMIME, I believe implementations should be
allowed to bounce messages if they do not implement the fall back
mechanism. Otherwise in 20 years, all systems would still be forced
to implement a downgrade mechanism that nobody use or test. Users
will require that implementors support downgrading today, but
eventually they won't have to bother about it.
This sounds good. One thing I didn't say in the first message, which
I probably should have, is that it is a fairly-obvious extension of
8BITMIME. All the lessons we have learned in the past decade (!) from
8BITMIME should be applied with whatever I propose here.
Post by Simon Josefsson
Post by Paul Hoffman / IMC
- Free text fields are downgraded using quoted-printable encoding;
SHOULD be into UTF-8 charset. Downgrading MUST only be done if
necessary.
Does this intentionally forbid non-QP RFC 2047 encodings? E.g.,
strings like =?ISO-8859-2?B?dSB1bmRlcnN0YW5kIHRoZSBleGFtcGxlLg==?=.
That was my intention. Maybe it is too drastic.
Post by Simon Josefsson
Post by Paul Hoffman / IMC
- The Address-map: header is downgraded using Base64 for mailbox
names, IDNA for domain names.
If passed to a non-UTF-8-HEADERS system, this header gets downgraded
It might be nice to use the RFC 2047 encoding instead, so that the
header is rendered properly in MIME aware clients. It would make
cut'n'paste of non-ASCII email addresses possible, even from MUAs that
doesn't support this new standard. Qualify it to MUST use UTF-8
charset and the "B" encoding if you wish. A possible disadvantage
would be if gateways converts RFC 2047 data from one charset to
another, although I think the advantages are larger.
As long as we choose UTF-8 for the inner encoding of the QP, that's
OK with me. It just seemed like extra characters, but I'm open to
either. (But I'm not open to "the inner encoding can be anything"
because that leads to the same lack of interop we are battling now.)
Post by Simon Josefsson
Post by Paul Hoffman / IMC
- Other headers that include mailbox names and domain names will need
further definition for downgrading.
Here there may be dragons. There are many headers, standard and
non-standard ones, that contain mailboxes, although without using the
RFC 2822 BNF 'mailbox'. References: is one. Various List-* headers
are others.
Standard ones we can deal with easily (as long as we get all of
them). We will need to have a single way of changing non-standard
names.
Post by Simon Josefsson
In general, I think the Address-map idea need some further pondering,
especially with regard to modifying in transit and populating them
from address book caches, but also the encoding.
Fully agree.

--Paul Hoffman, Director
--Internet Mail Consortium
Keith Moore
2003-12-01 01:35:18 UTC
Permalink
One thing I didn't say in the first message, which I probably should
have, is that it is a fairly-obvious extension of 8BITMIME. All the
lessons we have learned in the past decade (!) from 8BITMIME should be
applied with whatever I propose here.
I'm not sure how much the 8BITMIME experience applies. At the time
8BITMIME was adopted, the Internet was much smaller, there were many
fewer UAs, MTAs, and other mail-handling tools (and thus less variety),
messages travelled a simpler path (fewer firewalls, spam filters, virus
checkers, etc.) and the vast majority of Internet users still spoke
English - though this was quickly changing.

Also, 8BITMIME was a much less drastic change than negotiation of UTF-8
would be now. Partially this is because many mail readers in use at
the time of 8BITMIME introduction were still intended for use with text
terminals or terminal emulators (so MUAs that copied 8bit text to the
screen often "did the right thing" even if only by accident, or because
the user had configured the terminal emulator to use the right
charset). Partially this is because MTAs and other intermediaries that
predated 8BITMIME generally did not look at message bodies - they
looked only at the headers of messages that transited their systems,.
Since headers of 8bit MIME messages are still ASCII supporting 8BITMIME
didn't necessarily require any change to a tool's header-parsing code.

One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to the
left rather than from the left to the right, because this requires less
lookahead. It's still possible to do this with UTF-8 (particularly if
you do lexical analysis left-to-right and parsing right-to-left), but
it's probably not a trivial change to existing code.
Martin Duerst
2004-01-02 22:20:28 UTC
Permalink
Hello Keith,
Post by Keith Moore
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to the
left rather than from the left to the right, because this requires less
lookahead. It's still possible to do this with UTF-8 (particularly if you
do lexical analysis left-to-right and parsing right-to-left), but it's
probably not a trivial change to existing code.
Can you give more details? As long as lexing or parsing treats anything
non-ascii the same, things shouldn't change at all (as long as the code
is 8-bit clean). If different non-ASCII characters have to lex or parse
differently, then you have to use tables, do some conversion, or do some
hand-coding with a byte-by-byte approach, and the complexity of this is
virtually the same whether you go one way or the other. If you already
have the UTF-8 forward code, then that's not trivial to change to
reverse scanning code. But if you only have ASCII, the changes to move
to UTF-8 are about the same for both directions, except that you
probably have a bigger chance to find already existing code that
goes forward.


Regards, Martin.
Keith Moore
2004-01-03 02:53:31 UTC
Permalink
Post by Martin Duerst
Hello Keith,
Post by Keith Moore
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to
the left rather than from the left to the right, because this
requires less lookahead. It's still possible to do this with UTF-8
(particularly if you do lexical analysis left-to-right and parsing
right-to-left), but it's probably not a trivial change to existing
code.
Can you give more details?
yes. when parsing ASCII you can look at one octet at a time. so when
parsing

To: Martin Duerst <***@w3.org>

right to left the parser sees ">" then "g", then "r", etc. as soon as
the parser sees ">" it knows that this is a production of the form

[ phrase ] "<" addr-spec ">"

(forgive me for using 822 rather than 2822 - I have never memorized the
latter)

if you're parsing utf-8 then you can't look at one octet at a time -
you first have to parse octets into characters. you can do it, but
it's more of a pain - for instance, you have to do more checking for
boundary conditions. it's certainly not as simple as something like

if (ptr <= bufstart)
break;
c = *ptr--;

i.e. it's not a trivial change to code written to assume that a
character is a fixed width and fits into a single octet.
Post by Martin Duerst
As long as lexing or parsing treats anything
non-ascii the same, things shouldn't change at all (as long as the code
is 8-bit clean). If different non-ASCII characters have to lex or parse
differently, then you have to use tables, do some conversion, or do some
hand-coding with a byte-by-byte approach, and the complexity of this is
virtually the same whether you go one way or the other. If you already
have the UTF-8 forward code, then that's not trivial to change to
reverse scanning code. But if you only have ASCII, the changes to move
to UTF-8 are about the same for both directions, except that you
probably have a bigger chance to find already existing code that
goes forward.
uh, no. not even close. and experience with 2047 indicates that
people don't want to make large changes to their existing codebases.
Keld Jørn Simonsen
2004-01-03 15:33:26 UTC
Permalink
Post by Keith Moore
Post by Martin Duerst
Hello Keith,
Post by Keith Moore
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to
the left rather than from the left to the right, because this
requires less lookahead. It's still possible to do this with UTF-8
(particularly if you do lexical analysis left-to-right and parsing
right-to-left), but it's probably not a trivial change to existing
code.
Can you give more details?
yes. when parsing ASCII you can look at one octet at a time. so when
parsing
right to left the parser sees ">" then "g", then "r", etc. as soon as
the parser sees ">" it knows that this is a production of the form
[ phrase ] "<" addr-spec ">"
(forgive me for using 822 rather than 2822 - I have never memorized the
latter)
if you're parsing utf-8 then you can't look at one octet at a time -
you first have to parse octets into characters. you can do it, but
it's more of a pain - for instance, you have to do more checking for
boundary conditions. it's certainly not as simple as something like
if (ptr <= bufstart)
break;
c = *ptr--;
i.e. it's not a trivial change to code written to assume that a
character is a fixed width and fits into a single octet.
Well, UTF-8 is made so that all characters in the 7-bit ASCII range
have the same codes as in ASCII, so if your grammar only has ASCII
meta-characters, then you can parse the UTF string as ASCII.
This was a design goal for UTF-8.

Best regards
Keld
Keith Moore
2004-01-03 16:05:56 UTC
Permalink
Post by Keld Jørn Simonsen
Well, UTF-8 is made so that all characters in the 7-bit ASCII range
have the same codes as in ASCII, so if your grammar only has ASCII
meta-characters, then you can parse the UTF string as ASCII.
no you can't, because none of the octets in a multiple-octet utf-8
character are valid components of an atom.
Pete Resnick
2004-01-03 16:32:03 UTC
Permalink
Post by Keith Moore
Post by Keld Jørn Simonsen
Well, UTF-8 is made so that all characters in the 7-bit ASCII range
have the same codes as in ASCII, so if your grammar only has ASCII
meta-characters, then you can parse the UTF string as ASCII.
no you can't, because none of the octets in a multiple-octet utf-8
character are valid components of an atom.
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to
the left rather than from the left to the right, because this
requires less lookahead. It's still possible to do this with UTF-8
(particularly if you do lexical analysis left-to-right and parsing
right-to-left), but it's probably not a trivial change to existing
code.
If we're talking about "trivial changes to existing code", then yes,
the change is trivial: You add 128-255 to comment, atom, and
quoted-string (or more specifically in 2822, atext, ctext, dtext,
qtext, and text) and you're done. You can still treat the field
contents as octets. And in fact, if your code is just looking for
specials and has an 'else' clause for all the other octets, it might
need no coding changes at all.

Of course, if you want to treat the field as characters, you'll need
some buffering for UTF-8, but you can still parse right-to-left: Any
octet with '10' in the high two bits is a trailing octet in a UTF-8
sequence, and any octet with '11' in the high two bits is the first
octet in a UTF-8 sequence. That change is non-trivial, but that's not
what you asked.

pr
--
Pete Resnick <http://www.qualcomm.com/~presnick/>
QUALCOMM Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
Keith Moore
2004-01-03 16:52:30 UTC
Permalink
Post by Pete Resnick
Post by Keith Moore
One simple example. Bernstein and others have pointed out that it's
easier to parse header fields with address lists from the right to
the left rather than from the left to the right, because this
requires less lookahead. It's still possible to do this with UTF-8
(particularly if you do lexical analysis left-to-right and parsing
right-to-left), but it's probably not a trivial change to existing
code.
If we're talking about "trivial changes to existing code", then yes,
the change is trivial: You add 128-255 to comment, atom, and
quoted-string (or more specifically in 2822, atext, ctext, dtext,
qtext, and text) and you're done. You can still treat the field
contents as octets. And in fact, if your code is just looking for
specials and has an 'else' clause for all the other octets, it might
need no coding changes at all.
yes, this will work in some cases, though you might get bitten if some
kinds of atoms (or atext, whatever) can contain utf-8 and other kinds
cannot. it even appears to work for gb18030.
Martin Duerst
2004-01-04 17:41:31 UTC
Permalink
Post by Keith Moore
If we're talking about "trivial changes to existing code", then yes, the
change is trivial: You add 128-255 to comment, atom, and quoted-string
(or more specifically in 2822, atext, ctext, dtext, qtext, and text) and
you're done. You can still treat the field contents as octets. And in
fact, if your code is just looking for specials and has an 'else' clause
for all the other octets, it might need no coding changes at all.
yes, this will work in some cases, though you might get bitten if some
kinds of atoms (or atext, whatever) can contain utf-8 and other kinds
cannot. it even appears to work for gb18030.
No, it does not at all work for gb18030. GB uses virtually all of the
US-ASCII bytes not only for denoting US-ASCII characters, but also in
second (or fourth) position of when encoding other characters.
See e.g. the section "Structure" in
http://www-106.ibm.com/developerworks/unicode/library/u-china.html?dwzone=un
icode

Regards, Martin.
Keith Moore
2004-01-05 00:08:41 UTC
Permalink
Post by Martin Duerst
Post by Keith Moore
Post by Pete Resnick
If we're talking about "trivial changes to existing code", then yes,
the change is trivial: You add 128-255 to comment, atom, and
quoted-string (or more specifically in 2822, atext, ctext, dtext,
qtext, and text) and you're done. You can still treat the field
contents as octets. And in fact, if your code is just looking for
specials and has an 'else' clause for all the other octets, it might
need no coding changes at all.
yes, this will work in some cases, though you might get bitten if
some kinds of atoms (or atext, whatever) can contain utf-8 and other
kinds cannot. it even appears to work for gb18030.
No, it does not at all work for gb18030. GB uses virtually all of the
US-ASCII bytes not only for denoting US-ASCII characters, but also in
second (or fourth) position of when encoding other characters.
See e.g. the section "Structure" in
http://www-106.ibm.com/developerworks/unicode/library/u-china.html?
dwzone=unicode
you're right - I misread this earlier. the four octet sequences are
okay because the 2nd and 4th positions are in the range 30-39 (ascii
digits). but the two octet sequences can have values from 40-7e as
their 2nd octet, which includes several 2822 specials.

even if you're scanning utf-8 you can't just scan individual octets if
any of the characters outside the repertoire have special meaning -
e.g. if full width at sign is taken as an equivalent for '@' then the
scanner needs to be able to recognize this as a multiple octet
character.

Adam M. Costello
2003-11-28 00:45:25 UTC
Permalink
Post by Paul Hoffman / IMC
- The dual motivations are to allow UTF-8 everywhere in the headers
and to not bounce any messages just because they originated with UTF-8
headers.
- If the initiator knows the mapping for any recipient (through
caching or an address book), they SHOULD put it in the map header. If
they don't include a mapping and the message hits a non-UTF-8-HEADERS
SMTP server, the message will bounce.
I don't like that bouncing.

I don't understand why there should be such different policies for the
domain part and the local part. For the domain part, your proposal is
willing to downgrade to an ACE rather than bounce the message. We might
as well define an ACE for the local part too, so that there would never
be a need to bounce messages.

On the other hand, your proposal is willing to map local parts, so
that recipients who can't handle non-ASCII local parts can see a
human-friendly (non-ACE) ASCII local part. But they still have to see
an ugly ACE domain part. If the mapping feature is there anyway, we
might as well allow it to be used for the whole address. The proposed
syntax could support that.
Post by Paul Hoffman / IMC
- Updated sending MUAs will create all headers in UTF-8.
What exactly does that mean? There are details to be worked out.
Given an old Foo: header with its old ASCII grammar, what exactly is
the new grammar? Will canonically (or compatibly) equivalent strings
necessarily be parsed the same way?
Post by Paul Hoffman / IMC
- Transmission is protected by a new ESMTP command, UTF-8-HEADERS.
Every protocol that carries messages will need an analogous tagging
mechanism.
Post by Paul Hoffman / IMC
- The terminal SMTP server is responsible for knowing whether or not
the message store can handle UTF-8 headers.
Maybe the message store can handle them (or doesn't care), but what
about the things that retrieve messages from the message store? Or
manipulate messages in the message store? If the message store is a
plain text file, what chaos might ensue? Perhaps the UTF-8 headers
should be segregated somehow, so they don't accidentally fool old
software into thinking it knows what to do with them. For example, they
could use different field names, or they could be inside a shim header.
Post by Paul Hoffman / IMC
- Free text fields are downgraded using quoted-printable encoding;
SHOULD be into UTF-8 charset. Downgrading MUST only be done if
necessary.
I assume you mean encoded-words, which can use either Q or B encoding.

There is something that encoded-words can do that your UTF-8 header
proposal cannot do: encoded-words can indicate the language of the
text. A single field can contain multiple encoded-words, each tagged
with a different language. If one goal of the UTF-8 proposal is
to make encoded-words unnecessary, does the UTF-8 proposal need a
language-tagging capability? Or is this not useful enough to warrant
the added complexity?

Here's one idea I had: Rather than introduce yet another escaping
mechanism (to escape the language tags), extend an existing escaping
mechanism: the folding mechanism. Everywhere a field is folded it could
change the language, something like this:

Subject;en: hello,
jp: konnichi wa,
fr: bonjour

(I'm not bothering to use the proper non-ASCII characters, because
that's irrelevant to this example.)

If the language tag is absent, there is no change. The colon is still
required. You can tell which fields use the extended folding mechanism
because they have a semicolon in the field name (which has always been
allowed, but has never appeared in practice).

Because of the explicit terminator (the colon), the extended folding
mechanism can allow folding anywhere, not just before white space:

Subject;en: supercalifra
:gilisticexpialidocious

But maybe there would still be a recommendation that folding should not
happen within words/atoms when it can be avoided.
Post by Paul Hoffman / IMC
It might be nice to use the RFC 2047 encoding instead, so that the
header is rendered properly in MIME aware clients.
Would it be? I think a MIME-compliant MUA cannot decode things that
look like encoded-words in unrecognized header fields, because the
client has no way of knowing whether encoded-words are allowed in that
field.

AMC
John Cowan
2003-11-28 03:39:02 UTC
Permalink
Post by Adam M. Costello
If the message store is a
plain text file, what chaos might ensue? Perhaps the UTF-8 headers
should be segregated somehow, so they don't accidentally fool old
software into thinking it knows what to do with them.
Note that "plain text" should not be used as a synonym for "ASCII
plain text"; UTF-8 files are plain text if they contain a sequence of
characters each used for its ordinary meaning, neither binary nor
markup.
Post by Adam M. Costello
There is something that encoded-words can do that your UTF-8 header
proposal cannot do: encoded-words can indicate the language of the
text.
Does anyone actually do anything useful with this part of RFC 2231?
Post by Adam M. Costello
If one goal of the UTF-8 proposal is
to make encoded-words unnecessary, does the UTF-8 proposal need a
language-tagging capability? Or is this not useful enough to warrant
the added complexity?
Language-tagging plain text is rarely necessary. When necessary, it is
rarely sufficient.
--
With techies, I've generally found John Cowan
If your arguments lose the first round http://www.reutershealth.com
Make it rhyme, make it scan http://www.ccil.org/~cowan
Then you generally can ***@reutershealth.com
Make the same stupid point seem profound! --Jonathan Robie
Mark Davis
2003-11-29 00:01:38 UTC
Permalink
ditto on both

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "John Cowan" <***@mercury.ccil.org>
To: "IETF IMAA list" <ietf-***@imc.org>
Sent: Thu, 2003 Nov 27 19:39
Subject: Re: First strawman for UTF-8 headers proposal
Post by John Cowan
Post by Adam M. Costello
If the message store is a
plain text file, what chaos might ensue? Perhaps the UTF-8 headers
should be segregated somehow, so they don't accidentally fool old
software into thinking it knows what to do with them.
Note that "plain text" should not be used as a synonym for "ASCII
plain text"; UTF-8 files are plain text if they contain a sequence of
characters each used for its ordinary meaning, neither binary nor
markup.
Post by Adam M. Costello
There is something that encoded-words can do that your UTF-8 header
proposal cannot do: encoded-words can indicate the language of the
text.
Does anyone actually do anything useful with this part of RFC 2231?
Post by Adam M. Costello
If one goal of the UTF-8 proposal is
to make encoded-words unnecessary, does the UTF-8 proposal need a
language-tagging capability? Or is this not useful enough to warrant
the added complexity?
Language-tagging plain text is rarely necessary. When necessary, it is
rarely sufficient.
--
With techies, I've generally found John Cowan
If your arguments lose the first round http://www.reutershealth.com
Make it rhyme, make it scan http://www.ccil.org/~cowan
Make the same stupid point seem profound! --Jonathan Robie
Adam M. Costello
2003-11-29 02:08:49 UTC
Permalink
Post by John Cowan
If the message store is a plain text file, what chaos might ensue?
Perhaps the UTF-8 headers should be segregated somehow, so they
don't accidentally fool old software into thinking it knows what to
do with them.
Note that "plain text" should not be used as a synonym for "ASCII
plain text"; UTF-8 files are plain text if they contain a sequence
of characters each used for its ordinary meaning, neither binary nor
markup.
Sorry, the "plain text" part was irrelevant to the point I was trying to
make. If the message store is a passive file (as opposed to an active
database) then it is not able to negotiate with the program accessing
it; there is nothing analogous to the proposed ESMTP UTF-8-HEADERS
extension to verify that the recipient (the program accessing the file)
understands the new syntax. If an existing program is pointed at an
mbox file and it sees a header fields named "From", "To", etc, it's
going to assume that the field values ought to obey the RFC-822 syntax
for such fields (which allows only ASCII characters). If the field
values violate that syntax, who knows what will happen?

Hence I think it might be a good idea to use new field-names for the
UTF-8-enabled fields, to reduce the chance of accidentally misleading
old software. If there is an algorithmic way to determine the syntax
of the new-style Foo field given the syntax of the old-style Foo field,
then there should also be a way to algorithmically associate the name of
the new-style Foo field with the name Foo.

AMC
Paul Hoffman / IMC
2003-11-29 20:50:24 UTC
Permalink
Post by Adam M. Costello
Hence I think it might be a good idea to use new field-names for the
UTF-8-enabled fields, to reduce the chance of accidentally misleading
old software. If there is an algorithmic way to determine the syntax
of the new-style Foo field given the syntax of the old-style Foo field,
then there should also be a way to algorithmically associate the name of
the new-style Foo field with the name Foo.
This might be OK, but it makes message processing by displaying MUAs
and by other applications much harder.

--Paul Hoffman, Director
--Internet Mail Consortium
Keith Moore
2003-12-01 01:19:43 UTC
Permalink
If you're going to use new field names, you need to include the old
fields (with ASCII equivalent addresses) also, for compatibility with
existing mail handling tools.

And if you're going to do that, you might as well encode the UTF-8
fields somehow, to keep them from causing trouble with existing tools
(though fewer in number) that barf on currently-illegal input even in
header fields that they do not use.

Keith
Post by Adam M. Costello
Hence I think it might be a good idea to use new field-names for the
UTF-8-enabled fields, to reduce the chance of accidentally misleading
old software. If there is an algorithmic way to determine the syntax
of the new-style Foo field given the syntax of the old-style Foo field,
then there should also be a way to algorithmically associate the name of
the new-style Foo field with the name Foo.
Steve Hole
2003-12-01 15:59:30 UTC
Permalink
Post by Keith Moore
And if you're going to do that, you might as well encode the UTF-8
fields somehow, to keep them from causing trouble with existing tools
(though fewer in number) that barf on currently-illegal input even in
header fields that they do not use.
This is really the issue with the UTF-8 proposal. Anything that
*requires* a new SMTP extension is going to take a LONG TIME to deploy on
the internet. You should really seek to keep the solution space in the
land of the MUA and possibly the delivery agent, as much as possible.

Cheers.

---
Steve Hole
Chief Technology Officer - Billing and Payment Systems
ACI Worldwide
<mailto:***@ACIWorldwide.com>
Phone: 780-424-4922
Paul Hoffman / IMC
2003-11-29 21:08:07 UTC
Permalink
Post by Adam M. Costello
Post by Paul Hoffman / IMC
- The dual motivations are to allow UTF-8 everywhere in the headers
and to not bounce any messages just because they originated with UTF-8
headers.
- If the initiator knows the mapping for any recipient (through
caching or an address book), they SHOULD put it in the map header. If
they don't include a mapping and the message hits a non-UTF-8-HEADERS
SMTP server, the message will bounce.
I don't like that bouncing.
I don't understand why there should be such different policies for the
domain part and the local part.
As John Klensin has said before, email message processing is *very*
different than DNS name lookups.
Post by Adam M. Costello
For the domain part, your proposal is
willing to downgrade to an ACE rather than bounce the message. We might
as well define an ACE for the local part too, so that there would never
be a need to bounce messages.
If we add that to my current proposal, then there are *three*
possible names that a mailbox might have; two of them are readable,
one of them isn't. I think that is too complicated.
Post by Adam M. Costello
Post by Paul Hoffman / IMC
- Updated sending MUAs will create all headers in UTF-8.
What exactly does that mean? There are details to be worked out.
Given an old Foo: header with its old ASCII grammar, what exactly is
the new grammar? Will canonically (or compatibly) equivalent strings
necessarily be parsed the same way?
An "old" header has no grammar that produces non-ASCII characters.
New headers will have new rules.
Post by Adam M. Costello
Post by Paul Hoffman / IMC
- Transmission is protected by a new ESMTP command, UTF-8-HEADERS.
Every protocol that carries messages will need an analogous tagging
mechanism.
I'm not clear what you mean here. Please elucidate.
Post by Adam M. Costello
There is something that encoded-words can do that your UTF-8 header
proposal cannot do: encoded-words can indicate the language of the
text.
I agree with the other folks who said that language tagging is
probably not needed.

--Paul Hoffman, Director
--Internet Mail Consortium
Adam M. Costello
2003-11-30 02:28:18 UTC
Permalink
For the domain part, your proposal is willing to downgrade to an ACE
rather than bounce the message. We might as well define an ACE for
the local part too, so that there would never be a need to bounce
messages.
If we add that to my current proposal, then there are *three* possible
names that a mailbox might have; two of them are readable, one of them
isn't.
Sorry, I didn't follow that. Could you please spell it out for me?
Every protocol that carries messages will need an analogous tagging
mechanism.
I'm not clear what you mean here.
POP and IMAP, for example. They transfer messages from one agent
to another, like SMTP does. Therefore, if SMTP needs a negotiation
mechanism (UTF-8-HEADERS) to verify that the receiving agent can
handle the new header format, then POP and IMAP will need an analogous
negotiation mechanism for the same reason. Maybe NNTP too, though I'm
not clear on the relationship between news article headers and mail
headers. And any other protocol that transfers mail messages from one
agent to another.

AMC
Simon Josefsson
2003-11-30 13:14:12 UTC
Permalink
Post by Adam M. Costello
Post by Paul Hoffman / IMC
Post by Adam M. Costello
Every protocol that carries messages will need an analogous tagging
mechanism.
I'm not clear what you mean here.
POP and IMAP, for example. They transfer messages from one agent
to another, like SMTP does. Therefore, if SMTP needs a negotiation
mechanism (UTF-8-HEADERS) to verify that the receiving agent can
handle the new header format, then POP and IMAP will need an analogous
negotiation mechanism for the same reason. Maybe NNTP too, though I'm
not clear on the relationship between news article headers and mail
headers. And any other protocol that transfers mail messages from one
agent to another.
Non-ASCII (even raw binary) is safe in NNTP, and putting UTF-8 in the
"Newsgroups" header have been tested in many clients and servers. I
don't believe the NNTP would need a negotiation mechanism, rather it
would be sufficient to document that headers are UTF-8 in the next
Usenet message syntax document (if IETF ever manage to publish it).

As for IMAP, there has been good discussions about this on the IMAP
list, because IMAP is also used to access netnews. I could not find
any up to date mailing list archive of the IMAP list, but if you can
find it, look for the threads 'IMAP and Netnews' by Charles Lindsey
(<***@clw.cs.man.ac.uk>) and 'Unicode newsgroup name
options' by Russ Allbery (<***@windlord.stanford.edu>).

Both IMAP and POP(3) support capabilities, so it would not be
difficult to add a UTF8HEADER capability. I'm skeptical about that
approach though, there are many protocols that transfer e-mail, adding
a capability negotiation mechanism, and a new capability, to all of
them is not practical.

Perhaps some protocols are not worth over-engineering a solution for.

If RFC 2822bis say headers are UTF-8, which I understand is the point
of this proposal even though it has been focused on the SMTP
consequences, then if POP3 servers start to send 2822bis messages, the
clients will be updated. Adding a UTF8HEADER capability and computing
downgrade to existing POP3 clients would probably lead to worse
results overall, even though it would prevent breakage of someone's
hand-tailored POP client from '85.

Thanks,
Simon
Keith Moore
2003-12-01 01:46:52 UTC
Permalink
Post by Adam M. Costello
POP and IMAP, for example. They transfer messages from one agent
to another, like SMTP does. Therefore, if SMTP needs a negotiation
mechanism (UTF-8-HEADERS) to verify that the receiving agent can
handle the new header format, then POP and IMAP will need an analogous
negotiation mechanism for the same reason. Maybe NNTP too, though I'm
not clear on the relationship between news article headers and mail
headers. And any other protocol that transfers mail messages from one
agent to another.
I was going to point that out, but you beat me to it.

Yes, POP and IMAP servers would need to be able to tell whether their
clients supported UTF-8 headers on a per-session basis (since a lot of
people use multiple mail clients) and to perform appropriate
translation.

Then again, a lot of mail transfer protocols don't have any way to do
negotiation at all. Batch mail transmission (like UUCP) is still used
in some places, and there are a lot of UNIX-style filters in use (like
procmail) where the mail is piped to, or through, a filter, that don't
provide any good way of doing such negotiation.

This is part of why I claim that if you're going to make that drastic a
change to the message format, you need to change the format so much
that it will be obvious to everyone that it's a completely different
format that has to be handled with a completely different signal path.
(Personally I'd prefer a regular, binary format that was designed for
easy processing and extensibility. I'm sure a lot of people would
prefer XML, which would still require us to encode non-text body parts.)

Again, I really don't think having UTF-8 headers puts us much closer to
a solution to the problem at hand - which is to allow multiple
representations of addresses in different languages and scripts. (to
which I might add -- without significant disruption of the mail
system). At best, providing unencoded UTF-8 headers would be
orthogonal to a solution to the problem - actually I suspect it would
impede adoption of a solution.
Simon Josefsson
2003-12-01 02:56:23 UTC
Permalink
Post by Keith Moore
Again, I really don't think having UTF-8 headers puts us much closer
to a solution to the problem at hand - which is to allow multiple
representations of addresses in different languages and scripts. (to
which I might add -- without significant disruption of the mail
system). At best, providing unencoded UTF-8 headers would be
orthogonal to a solution to the problem - actually I suspect it would
impede adoption of a solution.
Could you define the problem you are thinking of here, more closely?
Being able to send UTF-8 in headers, after "fixing" SMTP, POP3 etc,
between aware applications, would appear to give me non-ASCII e-mail
addresses (and also get rid of RFC 2047, which is a nice side effect).
If this can be made to work, it would solve my internationalization
needs for e-mail, but it sounds as if it wouldn't satisfy your needs.

You say you want multiple representations of addresses in different
languages and scripts. Is the "multiple" a goal in itself, that must
be present at the protocol level? Why do you want to support multiple
scripts? What is missing from Unicode, that warrant the added
complexities of character set tagging of data? Applications on
non-Unicode platforms can convert to and from their native encoding.

As for language tagging, I'm not sure I see the benefits from language
tagging e-mail addresses. They are normally treated by humans as
identifiers. However, it wouldn't be difficult to add a language tag
to the UTF-8 strings. I wonder if this is a critical feature though.
There are many human languages that use ASCII or trivial extensions of
ASCII (e.g., most European languages), and language tagging ASCII
strings in headers isn't a popular request, even though e-mail has
been in use in those languages for years. So I agree with the other
people earlier in this thread, who suggested language tagging in RFC
2047 isn't critical to the problem discussed here.

Language tagging sounds like over-engineering to me, at this point.
Let's support UTF-8 directly first. If users come screaming for
language tagging, it can be added. Unless, of course, someone can
provide more insight as to why language tagging is critical to
non-ASCII e-mail addresses...

Thanks,
Simon
Keith Moore
2003-12-01 05:26:34 UTC
Permalink
Post by Simon Josefsson
Post by Keith Moore
Again, I really don't think having UTF-8 headers puts us much closer
to a solution to the problem at hand - which is to allow multiple
representations of addresses in different languages and scripts. (to
which I might add -- without significant disruption of the mail
system). At best, providing unencoded UTF-8 headers would be
orthogonal to a solution to the problem - actually I suspect it would
impede adoption of a solution.
Could you define the problem you are thinking of here, more closely?
I did so a couple of weeks ago in a thread called "what is the real
problem?"
Post by Simon Josefsson
Being able to send UTF-8 in headers, after "fixing" SMTP, POP3 etc,
between aware applications, would appear to give me non-ASCII e-mail
addresses (and also get rid of RFC 2047, which is a nice side effect).
Non-ASCII email addresses are worse than useless if you can't
transcribe them - which means at a minimum being able to display them,
read them, write them down, and type them back in. So either you have
to use different addresses depending on whom you're corresponding with
(and you need a way to keep track of who can use which address), or you
need a means for mapping between different addresses for the same
mailbox. Without a means for mapping between equivalent addresses,
non-ASCII addresses would essentially be used only by people who can be
confident that ALL of their correspondents can display, read, write,
and type those addresses. This could exclude, for instance, your kids
who happen to be studying in another country whose computers use
different keyboards. It certainly makes them impractical for use in
most international businesses.

The problem of providing multiple forms of an address is the same
regardless of whether you encode the addresses in UTF-8 or in some
other (say ASCII-compatible) encoding. In other words, encoding
addresses in raw UTF-8 doesn't help you solve this problem at all. All
it does is impose additional barriers to adoption and cause additional
failures.

(and no, it doesn't even get rid of RFC 2047, however nice that might
be, because it will still be necessary to read old messages long after
all MUAs support the new format)

And even if you argue that the address mapping isn't needed, addresses
encoded in ASCII are still more universally transcribable than
addresses encoded in raw UTF-8. I suspect that such addresses are too
ugly to use and that we'll want a mapping service that will translate
between UTF-8 and "less ugly" ASCII equivalents. But either way this
is a lot simpler than upgrading or replacing every single part of the
email system, which is what going to raw UTF-8 implies.

You might say that MUAs can display an ASCII-encoded version of the
UTF-8 address if the recipient doesn't understand that language. But
then you would be proposing to upgrade every mail handling program in
the Internet just to get a functionality that could be had much more
easily and quickly, and with far less expense and disruption, simply by
encoding the addresses in the message header.
Post by Simon Josefsson
You say you want multiple representations of addresses in different
languages and scripts. Is the "multiple" a goal in itself, that must
be present at the protocol level?
The goal is to allow every recipient of the message to see each address
in the message header in a form that he/she can remember and/or
transcribe. This won't happen, of course, unless there is such a form
of the address for each recipient, and unless there is some way of
providing a suitable form to each recipient.
Post by Simon Josefsson
Why do you want to support multiple scripts?
Because some languages use more than one script, and I'm assuming that
it might be useful to map between different addresses in the same
language but are written in different scripts. Saying "languages and
scripts" is more general than just saying "languages". As I see it, if
you can supply alternates in different languages, you can supply
alternates in different scripts for the same language just as easily.
Post by Simon Josefsson
What is missing from Unicode, that warrant the added
complexities of character set tagging of data? Applications on
non-Unicode platforms can convert to and from their native encoding.
I didn't say anything about character set tagging, I've been assuming
the Unicode repertoire is sufficient (maybe it is, maybe not, but I'm
assuming it is for now). You do need language tags because the
decision of which address to present to a recipient should probably be
based on language, and you can't always infer language from looking at
the sequence of Unicode characters.
Steve Hole
2003-12-01 16:12:15 UTC
Permalink
Post by Simon Josefsson
Both IMAP and POP(3) support capabilities, so it would not be
difficult to add a UTF8HEADER capability. I'm skeptical about that
approach though, there are many protocols that transfer e-mail, adding
a capability negotiation mechanism, and a new capability, to all of
them is not practical.
Yes, it would be difficult. Just because the protocol supports extension
does not mean that it is easy to extend ... particularly in ways like
this. It will take a long time to deploy code that both handles the
display issues (which we have to do no matter what) AND requires protocol
extensions. This is because both the client and the server have to be
appropriately extended. In 13 years of implementation experience with
IMAP clients and servers, not one extension has deployed either quickly or
universally.

You would be much better served to work within the constraints of the
existing protocol if possible. I'm sure that could be done with IMAP
which already supports UTF-7 encodings. You'd likely have to do a
mapping of some kind. Good luck with POP.

Please don't get me wrong. An extension is possible and maybe is the
right thing to do. But please don't pretend that it is either simple to
engineer or (more importantly) to deploy, because they aren't. The same
goes for SMTP as well ... even more so.

Cheers.
---
Steve Hole
Chief Technology Officer - Billing and Payment Systems
ACI Worldwide
<mailto:***@ACIWorldwide.com>
Phone: 780-424-4922
Paul Hoffman / IMC
2003-12-01 18:30:57 UTC
Permalink
Post by Steve Hole
You would be much better served to work within the constraints of the
existing protocol if possible. I'm sure that could be done with IMAP
which already supports UTF-7 encodings. You'd likely have to do a
mapping of some kind. Good luck with POP.
Given the massive preference for POP over IMAP in the marketplace, it
would be unwise to adopt something that assumes wide use of IMAP.
Post by Steve Hole
Please don't get me wrong. An extension is possible and maybe is the
right thing to do. But please don't pretend that it is either simple to
engineer or (more importantly) to deploy, because they aren't. The same
goes for SMTP as well ... even more so.
No one is pretending any of that. All IMA options will require lots
of upgrading, and all have side-effects. The question is whether we
can pick a reduction in side-effects that doesn't come at too high a
cost. That's why I have two proposals on the table.

--Paul Hoffman, Director
--Internet Mail Consortium
Martin Duerst
2004-01-02 22:16:50 UTC
Permalink
Post by Adam M. Costello
For the domain part, your proposal is
willing to downgrade to an ACE rather than bounce the message. We might
as well define an ACE for the local part too, so that there would never
be a need to bounce messages.
If we add that to my current proposal, then there are *three* possible
names that a mailbox might have; two of them are readable, one of them
isn't. I think that is too complicated.
I'm thinking about the tradeoff mentioned here a lot. I haven't made
up my mind, but I'm currently leaning to agreeing with Adam on this
point. The main reason is that it should significantly reduce
bounces, which I think is very importance for acceptance of the
new protocol.

As for three vs. two mailbox names, I'm not sure that's that bad.
First, it can provide an obvious choice for downgrading for cases
where people don't care at all about ASCII-only alternative mailbox
names. Second, there is the usual saying "zero, one, or many", i.e. for
many issues, the difference between two and three will be minimal,
once we get from one to two (i.e. many).

I would also like to note that this is something we have to consider
extremely carefully. We can always add a special lookup protocol for
alternative addresses (as proposed by Keith) later in the game if it
turns out that there is a need. But we can't add an ACE conversion
for downgrading to avoid bouncing later in the game, because it would
be very difficult to go out to everybody and tell them to change their
settings (receiving SMTP alias,...) from two addresses to three.

As for the argument that people don't want to see ACE, I fully and
totally agree with that. But I think there is a big difference if
you can tell them "this is only temporary, it will really truely
go away" and "if you upgrade your software, it will completely
go away".

The disadvantages are that it will lower the pressure on upgrades,
and might get halfway implementations (in particular, MUAs may
get quite a bit more sloppy at providing Address-map headers).
because you cannot go somewhere and tell them "if we want this
mail to go through, we have to upgrade".

Again, as I said, I haven't made up my mind on this really, yet.

Regards, Martin.
Thomas Roessler
2003-11-28 09:33:38 UTC
Permalink
Post by Paul Hoffman / IMC
- If a receiving SMTP server does not support UTF-8-HEADERS, the
sending SMTP client downgrades all headers and continues to send
the message.
...
Post by Paul Hoffman / IMC
- If the initiator knows the mapping for any recipient (through caching
or an address book), they SHOULD put it in the map header. If they
don't include a mapping and the message hits a non-UTF-8-HEADERS
SMTP server, the message will bounce.
What happens to the envelope? I'm reading the current proposal to
mean that mail transfer agents would have to rewrite the envelope
based on parsing the address-map header in the message, and bounce
if no address-map is present.

What to do about BCCs, then? Adapted mapping headers for each
instance, in order to avoid information leakage?


I'm also having some doubts about what will happen when messages are
sent to mixed universe (utf8/non-utf8) recipient set (or are
automatically bounced to systems outside utf8 universe; think
webmail systems). I'll try to elaborate on this later today.


Regards,
--
Thomas Roessler · Personal soap box at <http://log.does-not-exist.org/>.
Arnt Gulbrandsen
2003-11-28 12:48:14 UTC
Permalink
What to do about BCCs, then? Adapted mapping headers for each
instance, in order to avoid information leakage?
Just send to the ascii address right away (if available to the sender).

For a bcc'd recipient, an address is not displayed. Because it's not
displayed anywhere, whether ascii or utf-8 is used is irrelevant.

--Arnt
Paul Hoffman / IMC
2003-11-29 20:48:34 UTC
Permalink
Post by Thomas Roessler
Post by Paul Hoffman / IMC
- If a receiving SMTP server does not support UTF-8-HEADERS, the
sending SMTP client downgrades all headers and continues to send
the message.
...
Post by Paul Hoffman / IMC
- If the initiator knows the mapping for any recipient (through caching
or an address book), they SHOULD put it in the map header. If they
don't include a mapping and the message hits a non-UTF-8-HEADERS
SMTP server, the message will bounce.
What happens to the envelope? I'm reading the current proposal to
mean that mail transfer agents would have to rewrite the envelope
based on parsing the address-map header in the message, and bounce
if no address-map is present.
That was my intention, yes.
Post by Thomas Roessler
What to do about BCCs, then? Adapted mapping headers for each
instance, in order to avoid information leakage?
The originating client would just do their own local fallback. If
none is available, the message doesn't get sent.

Yes, we would need to deal with Bcc handling carefully.
Post by Thomas Roessler
I'm also having some doubts about what will happen when messages are
sent to mixed universe (utf8/non-utf8) recipient set (or are
automatically bounced to systems outside utf8 universe; think
webmail systems). I'll try to elaborate on this later today.
Please do. I didn't see a problem with this in my first guesses, but
I easily could have missed something.

--Paul Hoffman, Director
--Internet Mail Consortium
Adam M. Costello
2003-12-01 04:23:08 UTC
Permalink
Paul's strawman proposal avoids defining an ACE form for local
parts. I'd like to remind everyone that UTF-8-supporting MTAs and
address-mapping servers are harder to deploy than MUAs, and hence there
would be a price to pay for not defining ACE local parts.

Suppose I'm not familiar with ASCII characters, so you tell me your
non-ASCII address, which I remember and later type into my MUA. If
there is no ACE form for my MUA to use, then the only way for that
message to find you is if you can associate your domain name with a mail
exchanger that supports non-ASCII directly, or with an address-mapping
server. Either way, you need to wait for some sort of new server to
appear before you can have a usable non-ASCII address to tell me. (And
even then, if I'm behind a firewall, I might not have access to those
servers; if I can access only a local SMTP gateway then you'll have to
wait for that to be upgraded.)

If an ACE form is defined, then you can register an IDN and give it
an MX record pointing at any existing mail-hosting service, and it
will just work (assuming I have a new MUA, but that's needed in any
approach). You don't need to wait for any new servers to appear.

Address-mapping and UTF-8 headers can add value beyond what ACE
provides, but they don't render ACE superfluous, because they don't
match its ease of incremental deployment.

AMC
Paul Hoffman / IMC
2003-12-01 18:25:31 UTC
Permalink
Post by Adam M. Costello
Suppose I'm not familiar with ASCII characters, so you tell me your
non-ASCII address, which I remember and later type into my MUA. If
there is no ACE form for my MUA to use, then the only way for that
message to find you is if you can associate your domain name with a mail
exchanger that supports non-ASCII directly, or with an address-mapping
server.
Er, no. If your MUA is enabled for UTF-8 headers, it works fine,
assuming that the intervening SMTP servers also support UTF-8
headers. You don't *need* to have a map, you only need one if you
want to assure no bounces.

--Paul Hoffman, Director
--Internet Mail Consortium
Loading...