Discussion:
I-D ACTION:draft-hoffman-utf8headers-00.txt
Paul Hoffman / IMC
2003-12-16 22:01:27 UTC
Permalink
OK, so here's my first pass at the "UTF-8 headers" strawman I put up
a few weeks ago. I kinda rushed it together, so I might have missed
some suggestions that I want to put in eventually.

Please remember to start threads with new Subject lines.

--Paul Hoffman
To: IETF-Announce: ;
Subject: I-D ACTION:draft-hoffman-utf8headers-00.txt
Date: Tue, 16 Dec 2003 16:01:56 -0500
A New Internet-Draft is available from the on-line Internet-Drafts
directories.
Title : SMTP Service Extensions or Transmission of
Headers in UTF-8 Encoding
Author(s) : P. Hoffman
Filename : draft-hoffman-utf8headers-00.txt
Pages : 0
Date : 2003-12-16
Mailbox names often represent the names of human users. Many of these
users throughout the world have names that are not normally represented
by the users with just the ASCII repertoire of characters, and would
therefore like to use their real names in their mailbox names. These
users are also likely to use non-ASCII text in their common names
and subjects of email messages, both in what they send and what they
receive. This protocol specifies how to represent all headers
of email messages encoded in UTF-8.
http://www.ietf.org/internet-drafts/draft-hoffman-utf8headers-00.txt
To remove yourself from the IETF Announcement list, send a message to
ietf-announce-request with the word unsubscribe in the body of the message.
Internet-Drafts are also available by anonymous FTP. Login with the username
"anonymous" and a password of your e-mail address. After logging in,
type "cd internet-drafts" and then
"get draft-hoffman-utf8headers-00.txt".
A list of Internet-Drafts directories can be found in
http://www.ietf.org/shadow.html
or ftp://ftp.ietf.org/ietf/1shadow-sites.txt
Internet-Drafts can also be obtained by e-mail.
"FILE /internet-drafts/draft-hoffman-utf8headers-00.txt".
NOTE: The mail server at ietf.org can return the document in
MIME-encoded form by using the "mpack" utility. To use this
feature, insert the command "ENCODING mime" before the "FILE"
command. To decode the response(s), you will need "munpack" or
a MIME-compliant mail reader. Different MIME-compliant mail readers
exhibit different behavior, especially when dealing with
"multipart" MIME messages (i.e. documents which have been split
up into multiple messages), so check your local documentation on
how to manipulate these messages.
Below is the data which will enable a MIME compliant mail reader
implementation to automatically retrieve the ASCII version of the
Internet-Draft.
[The following attachment must be fetched by mail. Command-click the
URL below and send the resulting message to get the attachment.]
[The following attachment must be fetched by ftp. Command-click the
URL below to ask your ftp client to fetch it.]
<ftp://ftp.ietf.org/internet-drafts/draft-hoffman-utf8headers-00.txt>
Paul Hoffman / IMC
2003-12-22 17:14:25 UTC
Permalink
Er, any comments at all?

--Paul Hoffman, Director
--Internet Mail Consortium
Martin Duerst
2003-12-22 18:16:28 UTC
Permalink
Hello Paul,

I'm surprised too that there haven't been any comments so far on your
draft. I have read about half of your draft, and already have quite a few
comments (mostly positive/clarifying), but I want to finish reading
it before writing things up.

Regards, Martin.
Post by Paul Hoffman / IMC
Er, any comments at all?
--Paul Hoffman, Director
--Internet Mail Consortium
James Seng
2003-12-23 01:32:42 UTC
Permalink
perhaps it is christmas session? i know i been running like crazy the
last few days.

x-mas :-)

james
Post by Martin Duerst
Hello Paul,
I'm surprised too that there haven't been any comments so far on your
draft. I have read about half of your draft, and already have quite a few
comments (mostly positive/clarifying), but I want to finish reading
it before writing things up.
Regards, Martin.
Post by Paul Hoffman / IMC
Er, any comments at all?
--Paul Hoffman, Director
--Internet Mail Consortium
Charles Lindsey
2003-12-22 23:54:53 UTC
Permalink
Post by Paul Hoffman / IMC
Er, any comments at all?
OK, I'll bite. It has taken me a few days to get onto this mailing list,
and some time to digest the document.

First let me introduce myself as the Editor of the Usefor Working Group.
As some of you likely know, Usefor had intended UTF-8 headers to become
the norm in Usenet, allowing for internationalized newsgroup-names. But it
got bogged down in gatewaying into email, and forwards/backwards
compatibility arguments, and "why didn't we invent yet another 8bit->7bit
encoding". So our arms were twisted and we have now agreed to remove all
that from the draft and, instead, produce an Experimental Protocol to deal
with I18N issues. In the meantime, Usenet will have to get by with RFC
2047 and RFC 2231.

So I was delighted to see this proposal, because it it gets accepted for
Email, there is a greater chance that our Experimental I18N Protocol will
be able to build upon it.

Now I note that the proposal comes in two parts. How to deal with local
parts, and how to introduce UTF-8 in headers. These are somewhat
orthogonal issues, so I will reserve my comments to the UTF-8 part, though
I do have some concerns about local-parts too.

So here is your section 3, with my (indented) remarks.

3.1 UTF-8-HEADERS extension

[snip]

The terminal SMTP server is responsible for knowing whether or not the
message store can handle UTF-8 headers. A terminal SMTP server MUST NOT
advertise the UTF-8-HEADERS extension if the message store for which it
is responsible cannot
handle UTF-8 headers.

If an SMTP client does not see the UTF-8-HEADERS extension advertised
by an SMTP server, the SMTP client MUST downgrade the
non-ASCII contents of all header bodies before continuing to send
the message. The SMTP client SHOULD send the message with the downgraded
header bodies as a normal message.
If any header body cannot be downgraded, the SMTP client
MUST bounce the message with an error code of 558.

No, I don't think that works, since the concept of a "terminal" server
is not well defined. A typical SMTP server might be able to do
UTF-8-HEADERS for some addresses, but not others. It needs to see a
RCPT TO before it really knows. For example, if it is acting as the
secondary MX relay for another server it might be able to accept the
message, but not so it if was for local delivery, or maybe only for
certain known users. Again, if the server is a 'smarthost' willing to
deliver mail anywhere worldwide (but also to some local users/stores),
what is it to say?

So I think you need to say something more like:

A server which advtertizes the UTF-8-HEADERS extension accepts
responsibility for forwarding to other servers with that capability,
or to enabled POP3/IMAP stores, or to enabled MUAs. Absent such
capability in those other servers/stores/MUAs, it MUST/SHOULD/MAY
downgrade before forwarding. If it cannot downgrade (for whatever
reason), it MUST respond with 558.

Note that I do not believe downgrading is as easy as you have
suggested (see below), hence the MUST/SHOULD/MAY. I would be happy to
regard a server which never downgraded as being minimally compliant
(and rather easy to implement, to get us started).

All UTF-8 headers bodies can be downgraded to being all-ASCII.
However, any header body that contains a non-ASCII mailbox name might
not be able to be downgraded if there is no Address-map header that
gives a mapping for the downgrading.

BTW, I see that you use the term "header body". Can I persuade you to
use the term "header content"? That was used in Son-of-1036, and is
being used in Usefor. And the term "body" has the widely recognized
connotion of the message body. Also, you are using the term "header"
when you really mean "header field". That is very naughty, and your
wrist should be slapped (I entirely approve of your usage, of course
:-) ).

3.2 Downgrading header bodies

This section defines how to downgrade header bodies. Note that
downgrading MUST only be done if necessary. That is, downgrading
MUST never be done on fields or bodies that are all-ASCII.

3.2.1 Mailboxes

Mailboxes appear in many standard headers, such as To:, From:, Sender:,
Reply-to:, Cc:, Bcc:, Received:, and some of the Resent-: headers.
Downgrading mailboxes is done as follows:

Yes, but that list is not exhaustive. There are many headers
containing mailboxes not in that list (Approved: is the obvious
example), and many more will be invented over time. How is a server
supposed to know which headers contain them and with what syntax? For
example, do you know, off the top of your head, whether
Mail-Copies-To: includes a mailbox (in fact, it does)?

Note that there is no easy solution to this problem (essentially the
same problem that makes RFC 2047 unusable as written). Maybe a system
will try to recognize mailboxes by their syntax and will get it right
often enough to be useful. Maybe it would help if you were to insist
that all Non-ASCII addr-specs were REQUIRED to have <...> around them.

1) If necessary, convert the domain using IDNA.

2) If necessary,convert the local-parts using values from an
Address-map: header in the message

3) If necessary,convert any display-name or comment using
quoted-printable with UTF-8 encoding

3.3.2 Message-ids

Downgrading message-ids is done as follows

AAAAARRRRRRRRRRRGGGGGGGGGGGGGGGHHHHHHHHHHHHHHHHHHHHHHHHHHH!

PLEASE no downgrading of Message-ids. No Non-ASCII in Message-ids.
Netnews is the main user of Message-ids (they are hardly useful in
email except for copying into References so as to make threading
work). But if the slightest amount of
munging/downgrading/rewriting/whatever were ever allowed in
Message-ids, the whole of Usenet would collapse in a heap, even before
it was time to show the Film at 11. The RFC 2822 msg-ids are already
too liberal to work on Usenet.

Generally speaking, it would be far better to restrict the use of
UTF-8 in headers to those contexts where it is explicitly allowed. In
pure RFC 2822 mail, that would be local-parts, domains, phrases,
unstructureds and comments. Period. Extensions might specify other
contexts (parameters in Content-Type raises its ugly head) and would
also specify how to downgrade (if at all). I believe some headers
currently allow URIs; they could be extended to allow IRIs (for which
a downgrading is already defined). Usenet would add Newsgroups. And
so on.

[snip]

3.3.3 Informational headers

If necessary, downgrading the bodies of informational headers (Subject:,
Comments:, and Keywords:) is done using quoted-printable with UTF-8
encoding.

Yes, but it might be wiser (though uglier) to use the existing RFC
2047 downgrade, which is at least understood by many/most MUAs now.
Otherise, you have to define when to upgrade (and maybe the man who
wrote
Subject: =20 considered harmful
really didn't want it to be upgraded).

3.3.4 Address-map headers

If necessary, the Address-map: header is downgraded using Base64 for
local-parts, and IDNA for domain names.

[snip]

As another example:

Address-map: bj<oumlaut>***@r<aumlaut>ksm<oumlaut>rg<aring>s.se,
bjorn-***@rksmrgs-5wao1o.se

would be downgraded to:

Address-map: ***@rksmrgs-5wao1o.se,
bjorn-***@rksmrgs-5wao1o.se

All right, but how to you know when/whether to upgrade again? If the
LHS of an Address-Map pair is
***@rksmrgs-5wao1o.se
how do you know that 'frederic' is not a base 64 represenation of some
unpronounceable Mongolian name?

3.3 Things not changed from RFC 2822

No, before you do that you need to consider all the other headers that
might contain Non-ASCII. For example,

Content-Disposition: attachment; filename="Jos<eacute>'s_file"

To which the answer might (or might not) be RFC 2231. Yes, it is ugly,
but it is already in the field.

And I am sure there are lots of other problem cases to be considered
(not forgetting X-headers).

3.3 Things not changed from RFC 2822

Note that this protocol does change the definition of header field
names. That is, only the bodies of headers are allowed to have non-ASCII
characters; the rules in RFC 2822 for header names are not changed.

Similarly, this protocol does not change the date and time specification
in RFC 2822.

Agreed about those cases but, as I said above, it is better to specify
where Non-ASCII IS allowed, rather than where is ISN'T.

3.4 Additional processing rules

[snip]

Terminal SMTP servers MAY look into the headers of a message to
determine whether they should upgrade a downgraded set of headers to
UTF-8. This is easy to determine: if the Address-map: header contains
only ASCII, it was downgraded earlier in the chain of SMTP server.
Upgrading is particularly useful on bounce messages caused by bad
mappings.

No, that doesn't work. It may be that the message contained no
Non-ASCII local-parts or domains. Maybe it had been downgraded because
of UTF-8 in the Subject, or in some comment or display-name.


Indeed, the next big problem is how servers and other agents are to
recognize whether any of the headers of a message contain any Non_ASCII.
Yes, you could scan the headers of every message looking for an octet
Post by Paul Hoffman / IMC
127, but that is a great expenditure of effort considering that 99.9% of
the world's emails will have pure ASCII headers for several years to come.
Far better to have some indication in the message that it is contains 8bit
stuff (most likely an extra header to say so). Indeed, Mark Crispin is on
record as saying that, if he is to have his arm twisted into having UTF-8
headers in IMAP, he would insist on such a header).

In addition to that, SMTP is not the only mechanism for transporting email
(or netnews). There is UUCP. There is NNTP. There is X.400 (complete
with complex gatewaying rules in and out). There are satellites and
carrier pigeons and goodness knows what. Not all of these protocols will
want to implement a UTF-8-HEADERS extension. Indeed, for UUCP and NNTP it
is quite unnecessary, because they are 8bit clean already, and the
upcoming NNTP draft already assumes UTF-8 (in the few places where it
would notice).

So if a message passes through one of these protocols, it must carry
something with it that warns of Non-ASCII characters should it enter a
"normal"/SMTP environment at the far end.

But far more than that is the political advantage in having such a header.
Today, the great bulk of the internet message system uses ASCII headers
and nothing else. A few brave souls are determined to use UTF-8 (or,
shudder! GBxxxx) in their headers. OK. They should bear the cost of
bringing it in. That includes the trouble of having to mark their
messages as "unclean". Of causing suitable user agents to be implemented.
Of persuading their server admins to provide enabled POP3 and IMAP
servers. But, most of all, to persuade SMTP servers around the world to
carry their stuff at least without destroying/munging it. Their own user
agents and local servers are more or less under their control. Not so the
uncaring SMTP relays through which their messages may have to pass (we may
assume that the bulk of the people they want to communicate with will be
speaking their own languages, and will thus also have enabled software
available). But to get random SMTP servers worldwide to upgrade will be a
hard slog, and it will only be the dedicated people who want to use the
facility who will have reason to apply the pressure to make it happen.

Which is why I think it better for this to be an Experimental Protocol in
the first instance. It is less "threatening" to the IETF establishment; it
silences those people who will not allow anything incompatible with what
is already deployed without workarounds and kludges and yet more encodings
already in place. By all means, if you can get it through on the standards
track, then good luck to you, but not at the price of holding it up for 5
years. Time is not on our side. People are already using UTF-8 (and,
shudder!, GBxxxx) in headers because "it works for them". They are not
going to wait.

Usefor has already been through this. Internationalized newsgroup-names
were to have been the major advance of the project. But we have been
persuaded to remove them from the draft and to bring them forth later as
an experimental protocol. Even though they had been shown to work without
problem within the existing Usenet without any server upgrades.

So let me suggest a header so that UTF-8 users can mark their messages as
"unclean".

Header-Transfer-Encoding : "Header-Transfer-Encoding:" ( "8bit" / "7bit" )
*( ";" parameter )

OK, it needs CFWS and all that jazz in the proper places. We can argue
later whether the operative keyword is "8bit" or "utf-8". Note the
optional parameters (syntax as RFC 2045) which allow extensibility. The
only parameter I would propose initially is "language = <language-code>".
I explicitly OMIT a charset parameter, because the REQUIRED charset for
Non-ASCII headers is UTF-8. And I make that omission very EXPLICIT because
it indicates to the Chinese how they could workaround using GBxxxx within
their own borders supposing that they refuse to use UTF-8, as they most
assuredly will.

It might be argued that this header SHOULD precede and use of Non-ASCII in
the headers (but given the propensity for transports to reorder headers, I
doubt that would survive).

Some people have doubts about including a language header. I put it there
to forestall Bruce Lilly who will otherise come before us pointing out
that the word "boot" has different meanings in German and English, and
more importantly by pointing out that there is an IETF requirment to
include language specifications in all protocols. And even with that
parameter in place, he will still complain that it does not allow
different languages to be specified in different headers :-( .

So you now say that this header MUST be present, with "8bit", in any
message which includes any Non-ASCII character in any of its headers (or
included body part header fields or message types). And it MAY be present,
with "7bit" in pure ASCII messages. And that "MUST" MAY be removed in a
future version of the dicument (in 20 years time when all non-compliant
implementations are long dead).

And with a header like that in place, I think this idea might very well
fly.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Adam M. Costello
2003-12-27 08:02:12 UTC
Permalink
Post by Charles Lindsey
Far better to have some indication in the message that it is contains
8bit stuff (most likely an extra header to say so).
So let me suggest a header so that UTF-8 users can mark their messages
as "unclean".
Header-Transfer-Encoding : "Header-Transfer-Encoding:" ( "8bit" /
"7bit" ) *( ";" parameter )
It might be argued that this header SHOULD precede any use of
Non-ASCII in the headers (but given the propensity for transports to
reorder headers, I doubt that would survive).
Consider this:

UTF-8-header-field = "8:" field-name ":" utf-8-field-body

where field-name is the same as always, and utf-8-field-body is like
the normal field body for that field-name except that certain Unicode
characters are allowed in certain places (encoded as UTF-8) (details to
be worked out).

For example:

8:From: blah blah <***@blah>
Date: Fri, 26 Dec 2003 12:00:00 -0000
8:Subject: blah blah blah
8:Reply-To: blah blah <***@blah>
In-Reply-To: <***@bar>

(Pretend "blah" is non-ASCII text. At the moment I'm using a crippled
terminal and cannot generate such examples.)

This would automatically satisfy the goals you describe above. Every
message that contained any non-ASCII header text would contain a
particular field whose presence could be easily checked for ("8:"),
and this special field would automatically appear before the first
occurrence of non-ASCII text, even if the fields were reordered.

User agents might want to elide the "8:" for display purposes. (That
probably won't be the only alteration made for display purposes. For
example, I imagine that a regular Date: field would get displayed with
the word "Date" and the date itself translated into the local language.)

There is room for future expansion simply by creating a new special
field (like "8a:"). Or we could insert an extra colon in the syntax
now:

8::From: blah blah <***@blah>

and allow parameters between the first two colons.

There would be two methods for downgrading. For fields whose syntax
is known, you can remove the "8:" and use encoded-words, IDNA, IMAA,
Address-Map, and/or whatever. For fields whose syntax is unknown, you
can use another special field:

downgraded-header-field = "7:" FWS field-name ":" downgraded-field-body

for example, given an unrecognized UTF-8 field:

8:Prior-Subject: blah blah blah

it could be downgraded to:

7:Prior-Subject: ASCII-ENCODED-GARBAGE

The conversion from UTF-8 to 7bit would need to be worked out, but it
would be an opaque reversible string conversion. Any user agent that
understands 8: would easily understand 7: as well.

AMC
John C Klensin
2003-12-27 16:07:52 UTC
Permalink
Adam,

While this idea is an interesting one in principle, the
particular proposal you make would break a very large fraction
of the RFC822/2822 parsers in the world, which assume
Header = *C ":"
where "C" is an instance of a permitted character.

They may then treat the character after the colon, if it is
not a space, as an error or as the first character in the field
that follows the header. They will break either way.

While Paul and I continue to disagree about the level of badness
associated with transport bouncing of an extension, I think
that "deliver and then fail badly" is the worst of all possible
cases, since it may not even permit delivering a competent error
message.

john


--On Saturday, 27 December, 2003 08:02 +0000 "Adam M. Costello"
Post by Adam M. Costello
Post by Charles Lindsey
Far better to have some indication in the message that it is
contains 8bit stuff (most likely an extra header to say so).
So let me suggest a header so that UTF-8 users can mark their
messages as "unclean".
Header-Transfer-Encoding : "Header-Transfer-Encoding:" (
"8bit" / "7bit" ) *( ";" parameter )
It might be argued that this header SHOULD precede any use of
Non-ASCII in the headers (but given the propensity for
transports to reorder headers, I doubt that would survive).
UTF-8-header-field = "8:" field-name ":" utf-8-field-body
where field-name is the same as always, and utf-8-field-body
is like the normal field body for that field-name except that
certain Unicode characters are allowed in certain places
(encoded as UTF-8) (details to be worked out).
Date: Fri, 26 Dec 2003 12:00:00 -0000
8:Subject: blah blah blah
(Pretend "blah" is non-ASCII text. At the moment I'm using a
crippled terminal and cannot generate such examples.)
This would automatically satisfy the goals you describe above.
Every message that contained any non-ASCII header text would
contain a particular field whose presence could be easily
checked for ("8:"), and this special field would automatically
appear before the first occurrence of non-ASCII text, even if
the fields were reordered.
User agents might want to elide the "8:" for display purposes.
(That probably won't be the only alteration made for display
purposes. For example, I imagine that a regular Date: field
would get displayed with the word "Date" and the date itself
translated into the local language.)
There is room for future expansion simply by creating a new
special field (like "8a:"). Or we could insert an extra colon
and allow parameters between the first two colons.
There would be two methods for downgrading. For fields whose
syntax is known, you can remove the "8:" and use
encoded-words, IDNA, IMAA, Address-Map, and/or whatever. For
fields whose syntax is unknown, you can use another special
downgraded-header-field = "7:" FWS field-name ":"
downgraded-field-body
8:Prior-Subject: blah blah blah
7:Prior-Subject: ASCII-ENCODED-GARBAGE
The conversion from UTF-8 to 7bit would need to be worked out,
but it would be an opaque reversible string conversion. Any
user agent that understands 8: would easily understand 7: as
well.
AMC
Adam M. Costello
2003-12-27 22:21:51 UTC
Permalink
Post by John C Klensin
While this idea is an interesting one in principle, the
particular proposal you make would break a very large fraction
of the RFC822/2822 parsers in the world, which assume
Header = *C ":"
where "C" is an instance of a permitted character.
Either I don't understand your objection, or I didn't make my proposal
clear enough.

The previous UTF-8-headers proposals have been redefining the syntax
of existing fields (like From:, Subject:, etc.) to allow UTF-8. I am
suggesting creating a new field, 8: (which may appear multiple times),
and allowing UTF-8 only inside 8:, and leaving existing field syntax
unchanged. Existing software, which does not recognize 8:, will not try
to interpret it (unrecognized fields are ignored). New software that
recognizes 8: will know that the first thing inside the contents of an
8: field is a sub-field-name, with the same semantics as a top-level
field name, but with slightly different syntax beyond that (UTF-8 is
allowed).

The intention of this proposal is to make breakage less likely than it
would be if UTF-8 were used directly inside today's standard fields
(From:, To:, etc.).

We can't expect existing parsers to understand non-ASCII fields, so
the best we can do is try to hide the non-ASCII fields from them. An
SMTP extension keyword is one line of defense, but Charles Lindsey was
concerned that it wouldn't be enough. This new 8: field would serve as
a second line of defense and as a convenient flag.

AMC
Nathaniel Borenstein
2003-12-28 17:37:47 UTC
Permalink
I would worry a bit that there may still be mailers out there that
don't always convey all instances of a header field that appears more
than once -- e.g. if they convert into, say, a database representation
and can only have one value indexed to the field name "8", there might
be information lost when that gets converted back into RFC [2]822
format. -- Nathaniel
Post by Adam M. Costello
Post by John C Klensin
While this idea is an interesting one in principle, the
particular proposal you make would break a very large fraction
of the RFC822/2822 parsers in the world, which assume
Header = *C ":"
where "C" is an instance of a permitted character.
Either I don't understand your objection, or I didn't make my proposal
clear enough.
The previous UTF-8-headers proposals have been redefining the syntax
of existing fields (like From:, Subject:, etc.) to allow UTF-8. I am
suggesting creating a new field, 8: (which may appear multiple times),
and allowing UTF-8 only inside 8:, and leaving existing field syntax
unchanged. Existing software, which does not recognize 8:, will not try
to interpret it (unrecognized fields are ignored). New software that
recognizes 8: will know that the first thing inside the contents of an
8: field is a sub-field-name, with the same semantics as a top-level
field name, but with slightly different syntax beyond that (UTF-8 is
allowed).
The intention of this proposal is to make breakage less likely than it
would be if UTF-8 were used directly inside today's standard fields
(From:, To:, etc.).
We can't expect existing parsers to understand non-ASCII fields, so
the best we can do is try to hide the non-ASCII fields from them. An
SMTP extension keyword is one line of defense, but Charles Lindsey was
concerned that it wouldn't be enough. This new 8: field would serve as
a second line of defense and as a convenient flag.
AMC
John Cowan
2003-12-28 19:02:44 UTC
Permalink
Post by Nathaniel Borenstein
I would worry a bit that there may still be mailers out there that
don't always convey all instances of a header field that appears more
than once -- e.g. if they convert into, say, a database representation
and can only have one value indexed to the field name "8", there might
be information lost when that gets converted back into RFC [2]822
format. -- Nathaniel
Such mailers are obviously broken, and can't represent the "Received:"
header as commonly used -- not to mention that the RFC allows most
headers to be repeated.
--
Schlingt dreifach einen Kreis vom dies! John Cowan <***@reutershealth.com>
Schliesst euer Aug vor heiliger Schau, http://www.reutershealth.com
Denn er genoss vom Honig-Tau, http://www.ccil.org/~cowan
Und trank die Milch vom Paradies. -- Coleridge (tr. Politzer)
Martin Duerst
2003-12-31 21:41:55 UTC
Permalink
Post by Charles Lindsey
Indeed, the next big problem is how servers and other agents are to
recognize whether any of the headers of a message contain any Non_ASCII.
Yes, you could scan the headers of every message looking for an octet
Post by Paul Hoffman / IMC
127, but that is a great expenditure of effort considering that 99.9% of
the world's emails will have pure ASCII headers for several years to come.
Far better to have some indication in the message that it is contains 8bit
stuff (most likely an extra header to say so). Indeed, Mark Crispin is on
record as saying that, if he is to have his arm twisted into having UTF-8
headers in IMAP, he would insist on such a header).
I think such a header is not a bad idea. I don't think it's particularly
important, but if it helps, why not. As for actually scanning the headers,
I'm not sure about the 'great expediture'. If you have to scan all
headers to find the header that says it's UTF-8, doing the > 127 check
on the side is almost free.
Post by Charles Lindsey
In addition to that, SMTP is not the only mechanism for transporting email
(or netnews). There is UUCP. There is NNTP. There is X.400 (complete
with complex gatewaying rules in and out). There are satellites and
carrier pigeons and goodness knows what. Not all of these protocols will
want to implement a UTF-8-HEADERS extension. Indeed, for UUCP and NNTP it
is quite unnecessary, because they are 8bit clean already, and the
upcoming NNTP draft already assumes UTF-8 (in the few places where it
would notice).
For X.400 and UUCP, my assumption would be that things would be
downgraded anyway, which would mean to remove the header. Satellites
are not a protocol, and carrier pigeons carry paper, where we don't
even need UTF-8 :-). But in connection with NNTP, and for certain kinds
of local processing (procmail,...), it would probably make sense.
It may also ease implementation because it gives guidance for
internal (mail spool) formats.

I definitely like a header much more than the 8: header prefix
proposal, because it looks to me that it is much more straight-
forward to implement. There are no issues such as "what happens
if there is a To: and an 8:To: header?", and 8-bit-clean software
can just work on headers without having to care about 7-bit/8-bit
issues except at very specific points (downgrading/upgrading).
Post by Charles Lindsey
But far more than that is the political advantage in having such a header.
Today, the great bulk of the internet message system uses ASCII headers
and nothing else. A few brave souls are determined to use UTF-8 (or,
shudder! GBxxxx) in their headers. OK. They should bear the cost of
bringing it in. That includes the trouble of having to mark their
messages as "unclean". Of causing suitable user agents to be implemented.
Of persuading their server admins to provide enabled POP3 and IMAP
servers. But, most of all, to persuade SMTP servers around the world to
carry their stuff at least without destroying/munging it. Their own user
agents and local servers are more or less under their control. Not so the
uncaring SMTP relays through which their messages may have to pass (we may
assume that the bulk of the people they want to communicate with will be
speaking their own languages, and will thus also have enabled software
available). But to get random SMTP servers worldwide to upgrade will be a
hard slog, and it will only be the dedicated people who want to use the
facility who will have reason to apply the pressure to make it happen.
I can see the 'political advantage' of such a header. But I don't see
the relationship to server upgrade patterns.
Post by Charles Lindsey
Which is why I think it better for this to be an Experimental Protocol in
the first instance. It is less "threatening" to the IETF establishment; it
silences those people who will not allow anything incompatible with what
is already deployed without workarounds and kludges and yet more encodings
already in place. By all means, if you can get it through on the standards
track, then good luck to you, but not at the price of holding it up for 5
years. Time is not on our side. People are already using UTF-8 (and,
shudder!, GBxxxx) in headers because "it works for them". They are not
going to wait.
Usefor has already been through this. Internationalized newsgroup-names
were to have been the major advance of the project. But we have been
persuaded to remove them from the draft and to bring them forth later as
an experimental protocol. Even though they had been shown to work without
problem within the existing Usenet without any server upgrades.
I think there are various ways to see this. You seem to be saying
"we didn't get further than experimental for usefor, so better not
try to get it for email". But I think it is better to see this as
"usefor alone didn't make it, but email and usefor together should
make it". Email carries a lot more weight within the IETF. The main
issue with the UTF-8 extension for usefor only going to experimental,
as far as I understand, was the interaction with email. This of course
is gone once email is also moving towards UTF-8.
Post by Charles Lindsey
So let me suggest a header so that UTF-8 users can mark their messages as
"unclean".
I don't see anything 'unclean' in UTF-8.
Post by Charles Lindsey
Header-Transfer-Encoding : "Header-Transfer-Encoding:" ( "8bit" / "7bit" )
*( ";" parameter )
OK, it needs CFWS and all that jazz in the proper places. We can argue
later whether the operative keyword is "8bit" or "utf-8".
Allow me to start now: I think the name "Header-Transfer-Encoding"
is problematic, because it will further increase confusion about
the various encoding layers. Second, I very much think the
distinction should be between US-ASCII and UTF-8, not 8bit and 7bit.
Post by Charles Lindsey
Note the
optional parameters (syntax as RFC 2045) which allow extensibility. The
only parameter I would propose initially is "language = <language-code>".
I explicitly OMIT a charset parameter, because the REQUIRED charset for
Non-ASCII headers is UTF-8. And I make that omission very EXPLICIT because
it indicates to the Chinese how they could workaround using GBxxxx within
their own borders supposing that they refuse to use UTF-8, as they most
assuredly will.
Some people have doubts about including a language header. I put it there
to forestall Bruce Lilly who will otherise come before us pointing out
that the word "boot" has different meanings in German and English, and
more importantly by pointing out that there is an IETF requirment to
include language specifications in all protocols. And even with that
parameter in place, he will still complain that it does not allow
different languages to be specified in different headers :-( .
I'm definitely very doubtful about this. There is already a
Content-Language: header, and except for the odd case where all
the headers are in one language, and the body in another, this
parameter would not add anything.


Regards, Martin.
Paul Hoffman / IMC
2004-01-01 00:12:03 UTC
Permalink
Glad to see this being discussed more heavily. Adam's "8:" proposal
is interesting and easy to describe. However, I think Martin has
Post by Martin Duerst
I definitely like a header much more than the 8: header prefix
proposal, because it looks to me that it is much more straight-
forward to implement. There are no issues such as "what happens
if there is a To: and an 8:To: header?", and 8-bit-clean software
can just work on headers without having to care about 7-bit/8-bit
issues except at very specific points (downgrading/upgrading).
Comments on this balance are most welcome!
Post by Martin Duerst
Post by Charles Lindsey
Header-Transfer-Encoding : "Header-Transfer-Encoding:" ( "8bit" / "7bit" )
*( ";" parameter )
OK, it needs CFWS and all that jazz in the proper places. We can argue
later whether the operative keyword is "8bit" or "utf-8".
Allow me to start now: I think the name "Header-Transfer-Encoding"
is problematic, because it will further increase confusion about
the various encoding layers. Second, I very much think the
distinction should be between US-ASCII and UTF-8, not 8bit and 7bit.
Martin is correct. If we go with a new "flagging" header, it could
simply be "Has-UTF-8-headers: yes".

--Paul Hoffman, Director
--Internet Mail Consortium
Keith Moore
2004-01-01 05:15:34 UTC
Permalink
Post by Martin Duerst
Post by Charles Lindsey
Indeed, the next big problem is how servers and other agents are to
recognize whether any of the headers of a message contain any
Non_ASCII.
Yes, you could scan the headers of every message looking for an octet
Post by Paul Hoffman / IMC
127, but that is a great expenditure of effort considering that
99.9% of
the world's emails will have pure ASCII headers for several years to come.
Far better to have some indication in the message that it is contains 8bit
stuff (most likely an extra header to say so). Indeed, Mark Crispin is on
record as saying that, if he is to have his arm twisted into having UTF-8
headers in IMAP, he would insist on such a header).
I think such a header is not a bad idea. I don't think it's
particularly
important, but if it helps, why not. As for actually scanning the headers,
I'm not sure about the 'great expediture'. If you have to scan all
headers to find the header that says it's UTF-8, doing the > 127 check
on the side is almost free.
having a single flag to say that fields are in utf-8 is ridiculous -
first because the fields aren't all generated at the same place, and
second because (as you point out) you potentially have to scan the
whole header anyway to find the new header field.

but as far as I'm concerned putting utf-8 in headers is a nonstarter
anyway. there's simply no justification for it.
Charles Lindsey
2004-01-01 15:29:13 UTC
Permalink
Post by Keith Moore
having a single flag to say that fields are in utf-8 is ridiculous -
first because the fields aren't all generated at the same place, and
second because (as you point out) you potentially have to scan the whole
header anyway to find the new header field.
He who knowingly generates a UTF-8 header would be responsible for
ensuring that the "Foobar" header was added, if not already present. IOW,
those who want to use this new-fangled UTF-8 stuff would be the ones to
bear the cost.
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a nonstarter
anyway. there's simply no justification for it.
Maybe not, but the problem is that it is going to happen whether you like
it or not, because people will find that it "just works" (well, mostly).
Indeed it is already happening, except that the code used is usually not
UTF-8.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Keith Moore
2004-01-01 16:20:11 UTC
Permalink
Post by Charles Lindsey
Post by Keith Moore
having a single flag to say that fields are in utf-8 is ridiculous -
first because the fields aren't all generated at the same place, and
second because (as you point out) you potentially have to scan the
whole header anyway to find the new header field.
He who knowingly generates a UTF-8 header would be responsible for
ensuring that the "Foobar" header was added, if not already present.
IOW, those who want to use this new-fangled UTF-8 stuff would be the
ones to bear the cost.
that's missing the point. adding an extra field is easy, making sure
that all non-ascii text that is in the header is in utf-8 at the time
that that field is added is hard. making sure that any nonascii text
that is subsequently added by other agents is also in utf-8 is
impossible.
Post by Charles Lindsey
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a nonstarter
anyway. there's simply no justification for it.
Maybe not, but the problem is that it is going to happen whether you
like it or not, because people will find that it "just works" (well,
mostly).
lots of people do stupid things. it's naive to believe that IETF can
stop people from doing stupid things by defining other ways to do those
things.
Post by Charles Lindsey
Indeed it is already happening, except that the code used is usually
not UTF-8.
which is exactly why tagging the entire header as either being utf-8 or
not doesn't work.
Charles Lindsey
2004-01-01 19:18:10 UTC
Permalink
Post by Keith Moore
Post by Charles Lindsey
He who knowingly generates a UTF-8 header would be responsible for
ensuring that the "Foobar" header was added, if not already present.
IOW, those who want to use this new-fangled UTF-8 stuff would be the
ones to bear the cost.
that's missing the point. adding an extra field is easy, making sure
that all non-ascii text that is in the header is in utf-8 at the time
that that field is added is hard. making sure that any nonascii text
that is subsequently added by other agents is also in utf-8 is
impossible.
No, the proposal is for a standard which says all Non-ASCII in headers
MUST be in UTF-8 (well, you can still use RFC 2047 if you really need more
flexibility). So if anyone includes other charsets naked in a set of
headers that contains the Foobar header, then he is con-compliant (as is
anybody who uses even UTF-8 without Foobar). He who generates
non-compliant messages must put up with the consequences (and that
includes anyone who invents some local variant of the Foobar header
allowing other charsets, if his message escapes from his local
environment).
.
Post by Keith Moore
Post by Charles Lindsey
Indeed it is already happening, except that the code used is usually
not UTF-8.
which is exactly why tagging the entire header as either being utf-8 or
not doesn't work.
Which is exactly why allowing the header to be tagged as being in UTF-8
*might* just encourage them to change their ways, because I don't see
anything else that will stop the rot.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Keith Moore
2004-01-02 05:02:36 UTC
Permalink
Post by Charles Lindsey
Post by Keith Moore
Post by Charles Lindsey
He who knowingly generates a UTF-8 header would be responsible for
ensuring that the "Foobar" header was added, if not already present.
IOW, those who want to use this new-fangled UTF-8 stuff would be the
ones to bear the cost.
that's missing the point. adding an extra field is easy, making sure
that all non-ascii text that is in the header is in utf-8 at the time
that that field is added is hard. making sure that any nonascii text
that is subsequently added by other agents is also in utf-8 is
impossible.
No, the proposal is for a standard which says all Non-ASCII in headers
MUST be in UTF-8 (well, you can still use RFC 2047 if you really need
more flexibility). So if anyone includes other charsets naked in a set
of headers that contains the Foobar header, then he is con-compliant
(as is anybody who uses even UTF-8 without Foobar).
the question is not whether an implementation that used some other
charset without encoding it would violate the standard. the question
is whether this would work well in practice given that various other
charsets are already being used without encoding, and also given that
even within the same header field different bits of text can come from
different places and be in different charsets.

as for the extra header, I suspect it would be about as useless as
MIME-Version.
Post by Charles Lindsey
He who generates non-compliant messages must put up with the
consequences (and that includes anyone who invents some local variant
of the Foobar header allowing other charsets, if his message escapes
from his local environment).
.
Post by Keith Moore
Post by Charles Lindsey
Indeed it is already happening, except that the code used is usually
not UTF-8.
which is exactly why tagging the entire header as either being utf-8
or not doesn't work.
Which is exactly why allowing the header to be tagged as being in
UTF-8 *might* just encourage them to change their ways, because I
don't see anything else that will stop the rot.
I don't see how adding more rot is going to stop the existing rot.
Paul Hoffman / IMC
2004-01-03 01:31:12 UTC
Permalink
Post by Keith Moore
having a single flag to say that fields are in utf-8 is ridiculous -
first because the fields aren't all generated at the same place, and
second because (as you point out) you potentially have to scan the
whole header anyway to find the new header field.
Neither of those arguments seems that relevant.

- It doesn't matter if all are generated in the same place, just that
they are all generated the same way. Non-updated MUAs and MTAs
generate headers in UTF-8 (that is, in ASCII, a proper subset of
UTF-8), and updated MUAs and MTAs generate headers in UTF-8.
Non-compliant MUAs and MTAs will mess up whatever we do.

- What's the problem with having to scan the whole header? Why is
this onerous for a terminal MTA? (It is already done by the MUA.)
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a nonstarter
anyway. there's simply no justification for it.
The justification is that the only proposal that doesn't involve
non-ASCII in the headers, draft-hoffman-imaa-03.txt, has two fairly
significant side-effects, namely that senders who have not updated
their MUAs will not sanely be able to initiate mail to non-ASCII
mailboxes and that recipients who have not updated their MUAs will
see gibberish.
Post by Keith Moore
there's no justification given for utf-8 headers. the desired
functionality can be accomplished by the address-map fields and
encoding the fields in ascii.
Maybe I'm being dense, but I don't see how. Are you saying that this
would be an MUA-only type protocol (like IMAA), except that the
sender would use an "upgrade" address in the address map?
Post by Keith Moore
there's no explanation as to where the address-map information would
be obtained.
I'll make that clearer. They would be bootstrapped from incoming
address-map headers. That is, your downgrade map would be in your
outgoing mail, and receiving MUAs would be able to build a cache it.
Of course, you can also simply tell people your downgrade address.
Post by Keith Moore
there's no cost analysis for a proposal which would appear to have a
huge cost.
There's no cost analysis for your statement that it appears to have a
huge cost.
Post by Keith Moore
even accepting that it's a good idea to allow email addresses in raw
utf-8 (and this is a stretch) many fields should remain ascii so
that they can be read anywhere. it will often make more sense to
put ascii-encoded addresses, message-ids, etc. into log files than
to put raw utf-8 there.
There is nothing in the protocol that prevents that, of course. The
thing that writes into the log file can convert from UTF-8 to its
desired encoding.
Post by Keith Moore
there are too many mail transport boundaries that don't use SMTP and
thus may have no way to negotiate utf-8.
What you are saying is that the 2822 format is locked into stone
because different protocols that use 2822 do not talk. Others would
disagree with that assessment.
Post by Keith Moore
whether email addresses are in raw utf-8 or encoded in ascii there
is still a need to define how they are compared, because there will
often be more than one utf-8 representation of an address.
Just to be clear, are you talking about stringprep-based comparison,
or some other "utf-8 representation" issue?
Post by Keith Moore
nit: the document repeatedly says that non-ASCII text is encoded in
quoted-printable; this is incorrect. RFC 2047 allows either a
variant of quoted-printable ("Q" encoding, which isn't quite the
same thing) or base64 ("B" encoding).
Right, I realized that after I sent it in. I'll be much more careful
on the next version.

--Paul Hoffman, Director
--Internet Mail Consortium
Keith Moore
2004-01-03 02:46:21 UTC
Permalink
Post by Paul Hoffman / IMC
Post by Keith Moore
having a single flag to say that fields are in utf-8 is ridiculous -
first because the fields aren't all generated at the same place, and
second because (as you point out) you potentially have to scan the
whole header anyway to find the new header field.
Neither of those arguments seems that relevant.
- It doesn't matter if all are generated in the same place, just that
they are all generated the same way. Non-updated MUAs and MTAs
generate headers in UTF-8 (that is, in ASCII, a proper subset of
UTF-8),
actually, they generate headers in a variety of charsets.
Post by Paul Hoffman / IMC
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a nonstarter
anyway. there's simply no justification for it.
The justification is that the only proposal that doesn't involve
non-ASCII in the headers, draft-hoffman-imaa-03.txt, has two fairly
significant side-effects, namely that senders who have not updated
their MUAs will not sanely be able to initiate mail to non-ASCII
mailboxes
and that recipients who have not updated their MUAs will see gibberish.
both of those side-effects also exist for your utf-8 header proposal.
Post by Paul Hoffman / IMC
Post by Keith Moore
there's no justification given for utf-8 headers. the desired
functionality can be accomplished by the address-map fields and
encoding the fields in ascii.
Maybe I'm being dense, but I don't see how. Are you saying that this
would be an MUA-only type protocol (like IMAA), except that the sender
would use an "upgrade" address in the address map?
more-or-less, yes.
Post by Paul Hoffman / IMC
Post by Keith Moore
there's no explanation as to where the address-map information would
be obtained.
I'll make that clearer. They would be bootstrapped from incoming
address-map headers. That is, your downgrade map would be in your
outgoing mail, and receiving MUAs would be able to build a cache it.
Of course, you can also simply tell people your downgrade address.
I suspect that we will still need the address mapping lookup server,
but that's a separate issue.
Post by Paul Hoffman / IMC
Post by Keith Moore
there's no cost analysis for a proposal which would appear to have a
huge cost.
There's no cost analysis for your statement that it appears to have a
huge cost.
how hard is it to figure out that this impacts every component of the
mail system, and that the cost is therefore huge?
Post by Paul Hoffman / IMC
Post by Keith Moore
even accepting that it's a good idea to allow email addresses in raw
utf-8 (and this is a stretch) many fields should remain ascii so that
they can be read anywhere. it will often make more sense to put
ascii-encoded addresses, message-ids, etc. into log files than to put
raw utf-8 there.
There is nothing in the protocol that prevents that, of course. The
thing that writes into the log file can convert from UTF-8 to its
desired encoding.
then the logs are meaningless. there's no reason for message-ids to be
in utf-8.
Post by Paul Hoffman / IMC
Post by Keith Moore
there are too many mail transport boundaries that don't use SMTP and
thus may have no way to negotiate utf-8.
What you are saying is that the 2822 format is locked into stone
because different protocols that use 2822 do not talk. Others would
disagree with that assessment.
no, I'm saying that SMTP is the wrong place to try to negotiate a
change in message format (massive leakage of 8bit MIME into non-8-bit
SMTP demonstrates this), that expecting SMTP to handle the conversion
is moving complexity in the wrong direction (and bouncing is a
nonstarter), and trying to cram utf-8 into a format designed for ASCII
(and pretending that this is a minor change) is a much worse idea than
designing a format that is obviously distinct (and which can actually
result in a simplification, unlike 2822 with utf-8 header fields).
Post by Paul Hoffman / IMC
Post by Keith Moore
whether email addresses are in raw utf-8 or encoded in ascii there is
still a need to define how they are compared, because there will
often be more than one utf-8 representation of an address.
Just to be clear, are you talking about stringprep-based comparison,
or some other "utf-8 representation" issue?
stringprep would be one way of doing it. but it's not sufficient to
say "all addresses are in utf-8" and be done with that.

Keith
Paul Hoffman / IMC
2004-01-03 03:25:13 UTC
Permalink
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
having a single flag to say that fields are in utf-8 is ridiculous
- first because the fields aren't all generated at the same place,
and second because (as you point out) you potentially have to scan
the whole header anyway to find the new header field.
Neither of those arguments seems that relevant.
- It doesn't matter if all are generated in the same place, just
that they are all generated the same way. Non-updated MUAs and MTAs
generate headers in UTF-8 (that is, in ASCII, a proper subset of
UTF-8),
actually, they generate headers in a variety of charsets.
I'm not sure what you mean. All headers are in the ASCII character
set currently.
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a
nonstarter anyway. there's simply no justification for it.
The justification is that the only proposal that doesn't involve
non-ASCII in the headers, draft-hoffman-imaa-03.txt, has two fairly
significant side-effects, namely that senders who have not updated
their MUAs will not sanely be able to initiate mail to non-ASCII
mailboxes
and that recipients who have not updated their MUAs will see gibberish.
both of those side-effects also exist for your utf-8 header proposal.
That is false on both counts. Please show examples.
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
there's no justification given for utf-8 headers. the desired
functionality can be accomplished by the address-map fields and
encoding the fields in ascii.
Maybe I'm being dense, but I don't see how. Are you saying that
this would be an MUA-only type protocol (like IMAA), except
that the sender would use an "upgrade" address in the address map?
more-or-less, yes.
OK, this is an interesting proposal. Lemme think about it more. (Feel
free to post relevant protocol stuff for it, but I think I see where
you are going.)
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
there's no explanation as to where the address-map information
would be obtained.
I'll make that clearer. They would be bootstrapped from incoming
address-map headers. That is, your downgrade map would be in your
outgoing mail, and receiving MUAs would be able to build a cache
it. Of course, you can also simply tell people your downgrade
address.
I suspect that we will still need the address mapping lookup server,
but that's a separate issue.
If you can show that need, I would certainly have to deal with it in
this draft. But I don't see the need, just the desire.
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
even accepting that it's a good idea to allow email addresses in
raw utf-8 (and this is a stretch) many fields should remain ascii
so that they can be read anywhere. it will often make more sense
to put ascii-encoded addresses, message-ids, etc. into log files
than to put raw utf-8 there.
There is nothing in the protocol that prevents that, of course. The
thing that writes into the log file can convert from UTF-8 to its
desired encoding.
then the logs are meaningless. there's no reason for message-ids to
be in utf-8.
You're the second one to say this, so I'm happy to have Message-ID:
remain ASCII, like Date:.
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
there are too many mail transport boundaries that don't use SMTP
and thus may have no way to negotiate utf-8.
What you are saying is that the 2822 format is locked into stone
because different protocols that use 2822 do not talk. Others would
disagree with that assessment.
no, I'm saying that SMTP is the wrong place to try to negotiate a
change in message format (massive leakage of 8bit MIME into
non-8-bit SMTP demonstrates this), that expecting SMTP to handle the
conversion is moving complexity in the wrong direction (and bouncing
is a nonstarter), and trying to cram utf-8 into a format designed
for ASCII (and pretending that this is a minor change) is a much
worse idea than designing a format that is obviously distinct (and
which can actually result in a simplification, unlike 2822 with
utf-8 header fields).
OK, that's much clearer than what you said before. Don't do
"pretty-much-like-2822", do "really new header format".

--Paul Hoffman, Director
--Internet Mail Consortium
Keith Moore
2004-01-03 03:46:26 UTC
Permalink
I'm not sure what you mean. All headers are in the ASCII character set
currently.
according to the standards, yes. that's not entirely true in the real
world.
Post by Keith Moore
Post by Paul Hoffman / IMC
Post by Keith Moore
but as far as I'm concerned putting utf-8 in headers is a
nonstarter anyway. there's simply no justification for it.
The justification is that the only proposal that doesn't involve
non-ASCII in the headers, draft-hoffman-imaa-03.txt, has two fairly
significant side-effects, namely that senders who have not updated
their MUAs will not sanely be able to initiate mail to non-ASCII
mailboxes
and that recipients who have not updated their MUAs will see
gibberish.
both of those side-effects also exist for your utf-8 header proposal.
That is false on both counts.
existing MUAs can't sanely initiate mail to non-ASCII mailboxes.
they're not set up to accept UTF-8 input. they're not set up to look
up IDNs. they're not set up to stringprep local parts or to encode
them in a way that's compatible with either SMTP or other MUAs.

existing MUAs can't display non-ASCII mailboxes as anything but
gibberish. if they're in utf-8 form, they look like gibberish unless
the output device happens to display utf-8. if they're in ACE form,
they still look like gibberish, but for a different reason.

this is going to be true regardless of whether the revised message
format ends up representing addresses in raw utf-8 or encoded in ASCII.
the on-the-wire encoding is completely orthogonal to these issues.
Post by Keith Moore
I suspect that we will still need the address mapping lookup server,
but that's a separate issue.
If you can show that need, I would certainly have to deal with it in
this draft. But I don't see the need, just the desire.
I don't know how much "need" there is either. Maybe we need market
research.

though I would want to think about the security implications of caching
address maps from previous messages. getting them from an oracle
associated with the recipient's domain seems much safer.
Thomas Roessler
2004-01-03 10:47:00 UTC
Permalink
Post by Keith Moore
though I would want to think about the security implications of
caching address maps from previous messages. getting them from
an oracle associated with the recipient's domain seems much
safer.
Using a well-defined encoding to calculate them would seem even
safer, and also works for clients not directly connected to the
Internet.

Regards,
--
Thomas Roessler · Personal soap box at <http://log.does-not-exist.org/>.
Keith Moore
2004-01-04 06:39:12 UTC
Permalink
Post by Thomas Roessler
Post by Keith Moore
though I would want to think about the security implications of
caching address maps from previous messages. getting them from
an oracle associated with the recipient's domain seems much
safer.
Using a well-defined encoding to calculate them would seem even
safer, and also works for clients not directly connected to the
Internet.
that lets them be conveyed in protocol fields that expect ASCII but it
doesn't really make them easy to remember or transcribe, which seems to
be part of the goal.
Charles Lindsey
2004-01-01 11:17:57 UTC
Permalink
Post by Martin Duerst
Post by Charles Lindsey
In addition to that, SMTP is not the only mechanism for transporting email
(or netnews). There is UUCP. There is NNTP. ... Not all of these
protocols will
want to implement a UTF-8-HEADERS extension.
For X.400 and UUCP, my assumption would be that things would be
downgraded anyway, which would mean to remove the header. Satellites
are not a protocol, and carrier pigeons carry paper, where we don't
even need UTF-8 :-). But in connection with NNTP, and for certain kinds
of local processing (procmail,...), it would probably make sense.
It may also ease implementation because it gives guidance for
internal (mail spool) formats.
I don't think you would need to downgrade for UUCP, because it is already
8-bit clean. But my point was that a message might happily wander around
within one protocol (UUCP or NNTP) without anybody needing to care about
the encoding or to check for "UTF-8-HEADERS". Then suddenly it arrives at
a gateway into something else (e.g. SMTP or an IMAP store) where the
distinction really matters. So the implementor of the gateway needs some
quick way to discover whether this particular message needs special
handling, and the presence of an extra header is probably the simplest way
to do it.

That is also the reason why I don't like the "8:" header prefix. In some
environments (notably Netnews) it would be much simpler to leave the
headers in their present form (otherwise, all agents will have to learn to
recognise a new set of headers which are really just synonyms for existing
ones - that could be true of mail user agents too). The advantage of the
special header is that agents that don't need to be aware of the
distinction can just ignore it.
Post by Martin Duerst
Post by Charles Lindsey
...But to get random SMTP servers worldwide to upgrade will be a
hard slog, and it will only be the dedicated people who want to use the
facility who will have reason to apply the pressure to make it happen.
I can see the 'political advantage' of such a header. But I don't see
the relationship to server upgrade patterns.
My point was that a message may pass through several servers en route, and
the intermediate ones are unlikely to be under the control of the end
users (who are the ones who will actually benefit from having headers
written in their own languages). But it is still desirable that those
intermediate servers be upgraded so that UTF-8 stuff passes straight
through them without unnecessary down- and up-gradings or, worse, 558
bounces. Therefore, it is in our interests to make upgrading a server as
simple and straightforward as possible, at least so far as stuff that is
just passed through to other servers is concerned. That is why I spoke of
a 'political advantage'.
Post by Martin Duerst
I think there are various ways to see this. You seem to be saying
"we didn't get further than experimental for usefor, so better not
try to get it for email". But I think it is better to see this as
"usefor alone didn't make it, but email and usefor together should
make it". Email carries a lot more weight within the IETF. The main
issue with the UTF-8 extension for usefor only going to experimental,
as far as I understand, was the interaction with email. This of course
is gone once email is also moving towards UTF-8.
Yes, Email carrries more weight within IETF, and if that means this can be
brought straight to standards track, then I would be delighted. But I am
not so sure. It is a matter of timescale, and if an Experimental Protocol
can get it in the field sooner, then that might be better. Again, it is a
matter of politics - we should just go ahead, make our proposal, and then
take soundings as to which way to play it.
Post by Martin Duerst
Post by Charles Lindsey
So let me suggest a header so that UTF-8 users can mark their messages as
"unclean".
I don't see anything 'unclean' in UTF-8.
You know that, and I know that, but some others out there don't. So maybe
these messages need to go around waving handbells and shouting 'unclean',
just so that other people can keep out of their way :-) .
Post by Martin Duerst
Allow me to start now: I think the name "Header-Transfer-Encoding"
is problematic, because it will further increase confusion about
the various encoding layers. Second, I very much think the
distinction should be between US-ASCII and UTF-8, not 8bit and 7bit.
Yes, maybe we shoud just call it the "Foobar header" until we have decided
exactly what it is to contain. As to whether the distinction is on the
basis of "UTF-8" or of "8bit", there is just one problem, and that is the
Chinese.

UTF-8 is official IETF policy. The French, the Scandinavians and even the
Japanese will probably go along with it. But if you look at the 8bit
headers that are already sloshing around the internet (and certainly in
Usenet) you will observe that the code most commonly employed is some
GBxxxx, and people are very reluctant to give up things that are "already
working".

Yes, our standard will say that the code used in headers MUST be UTF-8,
and other codes MUST NOT be used. That, sadly, is not sufficient to
prevent it from happening. Which is why I suggest that our Foobar header
should contain a possible handle to indicate other usages, though clearly
that handle "MUST NOT be used".
Post by Martin Duerst
Post by Charles Lindsey
Some people have doubts about including a language header. I put it
there ........
I'm definitely very doubtful about this. There is already a
Content-Language: header, and except for the odd case where all
the headers are in one language, and the body in another, this
parameter would not add anything.
It may be that the Content-Language header is sufficient. I was just
pointing out that Bruce Lilly will be along presently, and that he has
some IETF BCPs on his side :-( .
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Keith Moore
2004-01-01 14:04:11 UTC
Permalink
Post by Charles Lindsey
I don't think you would need to downgrade for UUCP, because it is
already 8-bit clean. But my point was that a message might happily
wander around within one protocol (UUCP or NNTP) without anybody
needing to care about the encoding or to check for "UTF-8-HEADERS".
Then suddenly it arrives at a gateway into something else (e.g. SMTP
or an IMAP store) where the distinction really matters.
the possibility exists that the gateway isn't aware of the utf-8
extension, so it injects messages with utf-8 headers and addresses into
the legacy mail system without doing a conversion.
Post by Charles Lindsey
Yes, Email carrries more weight within IETF, and if that means this
can be brought straight to standards track, then I would be delighted.
I think it's exactly the opposite. email is viewed as an essential
service; usenet isn't. also, many people feel that usenet is already a
hopeless mess and they haven't quite gotten to feeling that way about
email (though there is a trend in this direction). so there is
considerable reluctance to making disruptive changes to email, whereas
with usenet, the attitude is more likely to be "who cares?" or "why
are you bothering to upgrade usenet anyway?"
Post by Charles Lindsey
Yes, our standard will say that the code used in headers MUST be
UTF-8, and other codes MUST NOT be used. That, sadly, is not
sufficient to prevent it from happening. Which is why I suggest that
our Foobar header should contain a possible handle to indicate other
usages, though clearly that handle "MUST NOT be used".
at the time we were working on what became RFC 1342 we realized that a
single header field to tag the charset used throughout the header would
not be sufficient, because different parts of the header are generated
by different agents on different machines. one of the reasons for
1342 was to be able to encode such things in ASCII, but another reason
was to be able to tag each bit of human-readable text with a separate
charset. what we might be finding out is that it's not reasonable to
expect everyone to use utf-8, and that we're going to continue to need
to deal with multiple charsets (though perhaps fewer than are in use
now) perhaps including different charsets in different parts of the
message header.
John C Klensin
2004-01-01 18:05:02 UTC
Permalink
--On Thursday, 01 January, 2004 09:04 -0500 Keith Moore
Post by Keith Moore
Post by Charles Lindsey
I don't think you would need to downgrade for UUCP, because
it is already 8-bit clean. But my point was that a message
might happily wander around within one protocol (UUCP or
NNTP) without anybody needing to care about the encoding or
to check for "UTF-8-HEADERS". Then suddenly it arrives at a
gateway into something else (e.g. SMTP or an IMAP store)
where the distinction really matters.
the possibility exists that the gateway isn't aware of the
utf-8 extension, so it injects messages with utf-8 headers and
addresses into the legacy mail system without doing a
conversion.
Or, worse, that it "downgrades" the UTF-8 by zeroing out all of
the high bits. For anyone who doesn't know (I know Keith
does), we have seen both behaviors many times. And what this
really says is that, if UUCP-based mail is now defined as "8-bit
clean", it is a requirement of a gateway that conforms to RFC
2821 that it detect the presence of 8bit characters and do
something intelligent. Now that requirement exists today, and
existed long before this particular discussion and mailing list
got started. If a UUCP-based mail message that contains 8bit
information in the body gets to a gateway into an [E]SMTP
environment, it must (sorry, MUST) tag that information
appropriately with MIME headers and must either generate an
8BITMIME negotiation or convert the relevant body parts with
some appropriate content-transfer-encoding.

Similarly, if there are eight bit fields in the headers, they
had better be fields that can be tagged and converted according
to RFC 2047, and the gateway must perform those actions. If
elements of the headers, such as address fields or elements that
are not known to the gateway and defined as "text" or "word",
cannot be converted, then the gateway must decide between some
form of encapsulation and dropping the field -- as RFC 2047 puts
it (end of section 1):

It specifically DOES NOT define any translation between
"8-bit headers" and pure ASCII headers, nor is any such
translation assumed to be possible.
Post by Keith Moore
Post by Charles Lindsey
Yes, Email carrries more weight within IETF, and if that
means this can be brought straight to standards track, then
I would be delighted.
I think it's exactly the opposite. email is viewed as an
essential service; usenet isn't. also, many people feel that
usenet is already a hopeless mess and they haven't quite
gotten to feeling that way about email (though there is a
trend in this direction). so there is considerable reluctance
to making disruptive changes to email, whereas with usenet,
the attitude is more likely to be "who cares?" or "why are
you bothering to upgrade usenet anyway?"
I would have said, instead,... The Internet's email standards
have been designed to be as accommodating to the requirements
of other environments that might generate mail that will be
injected into the Internet as reasonably possible. However, the
responsibility for ensuring that mail that is injected into the
SMTP environment really conforms to that environment's rules
rests, of necessity, with the gateways that do the conversion
and injection. And that is true whether the "other" system is a
"hopeless mess" or the most wonderfully-designed environment
around.
Post by Keith Moore
Post by Charles Lindsey
Yes, our standard will say that the code used in headers MUST
be UTF-8, and other codes MUST NOT be used. That, sadly, is
not sufficient to prevent it from happening. Which is why I
suggest that our Foobar header should contain a possible
handle to indicate other usages, though clearly that handle
"MUST NOT be used".
at the time we were working on what became RFC 1342 we
realized that a single header field to tag the charset used
throughout the header would not be sufficient, because
different parts of the header are generated by different
agents on different machines. one of the reasons for 1342
was to be able to encode such things in ASCII, but another
reason was to be able to tag each bit of human-readable text
with a separate charset. what we might be finding out is that
it's not reasonable to expect everyone to use utf-8, and that
we're going to continue to need to deal with multiple charsets
(though perhaps fewer than are in use now) perhaps including
different charsets in different parts of the message header.
Keith, while I agreed with (and strongly supported) that
reasoning at the time, in the ensuing eleven or so years Unicode
(and maybe UTF-8) have achieved sufficient adoption that it
might now be reasonable to say "systems injecting non-ASCII
characters into header fields or equivalent contexts are
required to take responsibility for converting to UTF-8"...
rather than tagging everything with what it started out being
and then hoping that the receiving system can sort things out.
As you are aware, MIME has gotten a bad reputation in some
quarters as a mechanism for well-documented incompatibility
rather than assuring interoperability. Or, as I and others keep
saying in other contexts: fewer options, and fewer profiles,
lead to better interoperability. More options tend in the other
direction. If we are going to something new here, it may be an
appropriate time to draw the line, at least to the extent of
"all headers in the same character set for a given message" and,
ideally, to "all headers in _one_ character set".

john
Keith Moore
2004-01-01 19:11:34 UTC
Permalink
Post by John C Klensin
Post by Keith Moore
at the time we were working on what became RFC 1342 we
realized that a single header field to tag the charset used
throughout the header would not be sufficient, because
different parts of the header are generated by different
agents on different machines. one of the reasons for 1342
was to be able to encode such things in ASCII, but another
reason was to be able to tag each bit of human-readable text
with a separate charset. what we might be finding out is that
it's not reasonable to expect everyone to use utf-8, and that
we're going to continue to need to deal with multiple charsets
(though perhaps fewer than are in use now) perhaps including
different charsets in different parts of the message header.
Keith, while I agreed with (and strongly supported) that
reasoning at the time, in the ensuing eleven or so years Unicode
(and maybe UTF-8) have achieved sufficient adoption that it
might now be reasonable to say "systems injecting non-ASCII
characters into header fields or equivalent contexts are
required to take responsibility for converting to UTF-8"...
that's onerous but more-or-less doable. what doesn't seem doable is
to prevent subsequent systems that handle the message (or reply
to it) from adding their own, non-utf-8, header contents.
John C Klensin
2004-01-01 19:42:13 UTC
Permalink
--On Thursday, 01 January, 2004 14:11 -0500 Keith Moore
Post by Keith Moore
Post by John C Klensin
Post by Keith Moore
at the time we were working on what became RFC 1342 we
realized that a single header field to tag the charset used
throughout the header would not be sufficient, because
different parts of the header are generated by different
agents on different machines. one of the reasons for 1342
was to be able to encode such things in ASCII, but another
reason was to be able to tag each bit of human-readable text
with a separate charset. what we might be finding out is
that it's not reasonable to expect everyone to use utf-8,
and that we're going to continue to need to deal with
multiple charsets (though perhaps fewer than are in use
now) perhaps including different charsets in different
parts of the message header.
Keith, while I agreed with (and strongly supported) that
reasoning at the time, in the ensuing eleven or so years
Unicode (and maybe UTF-8) have achieved sufficient adoption
that it might now be reasonable to say "systems injecting
non-ASCII characters into header fields or equivalent
contexts are required to take responsibility for converting
to UTF-8"...
that's onerous but more-or-less doable. what doesn't seem
doable is to prevent subsequent systems that handle the
message (or reply to it) from adding their own, non-utf-8,
header contents.
I'm not sure I see the issue. At one level, nothing can prevent
anything or anyone from adding trash, anywhere they like. In
particular, nothing prevents someone from putting things into
2047 (or content-type: text/plain, charset=foo) form and lying
about the charset in use today. If they do, they are violating
the standard and screwing their users, but, obviously, some
folks won't care. At another level, if there is a
specification that says "if you add 8bit header content, it must
be UTF-8; anything else must either be converted into RFC 2047
form or must be converted to UTF-8", then we are probably ok.
If receiving systems are going to interpret any 8bit content
they get as UTF-8 (or invalid, if it doesn't meet UTF-8 coding
rules) then people (or sending MUAs or MTAs) who violate that
rule are just going to screw their users. This is a "today, any
8bit info is invalid; when we make it valid, we are going to
define the one valid case" situation not one in which codings
that are valid today are suddenly given a new meaning or
interpretation.

We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings. It would have made many things more simple.
But it would not have been practical, since, despite the claims
and optimism of their advocates, Unicode (in UTF-8 form or
otherwise) wasn't nearly widely enough deployed. But, in the
ensuing years, it has gotten more widely deployed and, at least
as important, we have made a number of other decisions and
standards, including IDNA, that assume it. So a "UTF-8 only"
decision just takes a step further down a path to which we are
clearly already committed.

regards,
john
Keith Moore
2004-01-01 20:55:32 UTC
Permalink
Post by John C Klensin
Post by Keith Moore
Post by John C Klensin
Keith, while I agreed with (and strongly supported) that
reasoning at the time, in the ensuing eleven or so years
Unicode (and maybe UTF-8) have achieved sufficient adoption
that it might now be reasonable to say "systems injecting
non-ASCII characters into header fields or equivalent
contexts are required to take responsibility for converting
to UTF-8"...
that's onerous but more-or-less doable. what doesn't seem
doable is to prevent subsequent systems that handle the
message (or reply to it) from adding their own, non-utf-8,
header contents.
I'm not sure I see the issue. At one level, nothing can prevent
anything or anyone from adding trash, anywhere they like.
no, but the "reply" operation is fairly normal - in particular,
taking existing to/cc/reply-to and subject header fields, adding
new things to them, removing other things, and generally rearranging
them are all common operations - and these "new things" can be
from machine or human sources.
Post by John C Klensin
At another level, if there is a
specification that says "if you add 8bit header content, it must
be UTF-8; anything else must either be converted into RFC 2047
form or must be converted to UTF-8", then we are probably ok.
well, we already have widespread practice of taking rfc 2047 and
decoding it into whatever charset the MUA happens to want to use -
mixing utf-8 with that probably produces unpredictable results,
and insisting that all non-tagged non-ASCII text is utf-8 is probably
just naive.
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
Post by John C Klensin
It would have made many things more simple.
But it would not have been practical, since, despite the claims
and optimism of their advocates, Unicode (in UTF-8 form or
otherwise) wasn't nearly widely enough deployed. But, in the
ensuing years, it has gotten more widely deployed and, at least
as important, we have made a number of other decisions and
standards, including IDNA, that assume it. So a "UTF-8 only"
decision just takes a step further down a path to which we are
clearly already committed.
arguably utf-8 only isn't even workable today, because of Chinese
government regulations. but I'm not arguing that we should further
encourage diversity in character encodings; rather I'm arguing that
we should avoid disrupting the installed base - and that we to the
extent that we are going to disrupt it, IMAA is very dubious as a
sole justification for doing so.
n***@mrochek.com
2004-01-01 23:24:59 UTC
Permalink
Post by Keith Moore
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
A decade ago isn't a particularly relevant date in the development of MIME --
at that point in time MIME was already a draft standard, making such a change
quite difficult to make.

A much more relevant date is November 18-22, 1991. This is the date of the last
IETF meeting prior to the approval of MIME as a proposed standard.
Realistically, this was the last point in time at which a change as major as
uncategorical endorsement of a single, universal charset specification could
have been made. The actual MIME specifications were subsequently submitted to
the IESG for approval in January 1992.

According to the Unicode web site the complete specification of Unicode 1.0
wasn't published until June, 1992. (Amusingly enough, that was the same month
in which RFC 1341 appeared.) In November 1991 the universal charset situation
was far from clear: What what then called 10646 seemed to be on the way
out and Unicode seemed to be on the way in but no conclusions had been reached.

This led to the following text appearing in RFC 1341:

NOTE: Beyond US-ASCII, an enormous proliferation of
character sets is possible. It is the opinion of the IETF
working group that a large number of character sets is NOT a
good thing. We would prefer to specify a single character
set that can be used universally for representing all of the
world's languages in electronic mail. Unfortunately,
existing practice in several communities seems to point to
the continued use of multiple character sets in the near
future. For this reason, we define names for a small number
of character sets for which a strong constituent base
exists. It is our hope that ISO 10646 or some other
effort will eventually define a single world character set
which can then be specified for use in Internet mail, but in
the advance of that definition we cannot specify the use of
ISO 10646, Unicode, or any other character set whose
definition is, as of this writing, incomplete.

Even with 20:20 hindsight I fail to see any other reasonable course of action
we could have taken at the time.

Ned
John C Klensin
2004-01-02 00:05:06 UTC
Permalink
Ned,

Thanks for clarifying the dates in my deliberately-vague
"decade".

We are, I think, in complete agreement: we couldn't have
rationally done anything else than what we did, and the text you
cite explains exactly why we made that decision. I was only
suggesting that, faced with similar decisions, but a new
context, today, we are not, and should not be, obligated to
replicate the decision of what is now thirteen or fourteen years
ago.

john


--On Thursday, 01 January, 2004 15:24 -0800
Post by n***@mrochek.com
Post by Keith Moore
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a
decade ago rather than doing charset-specific tagging in
text content types and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it
certainly wasn't widely adopted, and it's hard to imagine
that we could have gotten consensus on such a rule. it's
only after 10 years' experience with unicode that we have
some confidence in its character repertoire, and we have even
less experience with other aspects of it.
A decade ago isn't a particularly relevant date in the
development of MIME -- at that point in time MIME was already
a draft standard, making such a change quite difficult to make.
A much more relevant date is November 18-22, 1991. This is the
date of the last IETF meeting prior to the approval of MIME as
a proposed standard. Realistically, this was the last point in
time at which a change as major as uncategorical endorsement
of a single, universal charset specification could have been
made. The actual MIME specifications were subsequently
submitted to the IESG for approval in January 1992.
According to the Unicode web site the complete specification
of Unicode 1.0 wasn't published until June, 1992. (Amusingly
enough, that was the same month in which RFC 1341 appeared.)
In November 1991 the universal charset situation was far from
clear: What what then called 10646 seemed to be on the way out
and Unicode seemed to be on the way in but no conclusions had
been reached.
NOTE: Beyond US-ASCII, an enormous
proliferation of character sets is possible.
It is the opinion of the IETF working group that a
large number of character sets is NOT a good
thing. We would prefer to specify a single character
set that can be used universally for representing all of the
world's languages in electronic mail. Unfortunately,
existing practice in several communities seems to point to
the continued use of multiple character sets in the near
future. For this reason, we define names for a small number
of character sets for which a strong constituent base
exists. It is our hope that ISO 10646 or some other
effort will eventually define a single world character set
which can then be specified for use in Internet mail, but in
the advance of that definition we cannot specify the use of
ISO 10646, Unicode, or any other character set whose
definition is, as of this writing, incomplete.
Even with 20:20 hindsight I fail to see any other reasonable
course of action we could have taken at the time.
Ned
Keith Moore
2004-01-02 04:58:27 UTC
Permalink
We are, I think, in complete agreement: we couldn't have rationally
done anything else than what we did, and the text you cite explains
exactly why we made that decision. I was only suggesting that, faced
with similar decisions, but a new context, today, we are not, and
should not be, obligated to replicate the decision of what is now
thirteen or fourteen years ago.
Of course not, but nobody has really suggested that we do so. At the
same time there are some things about email that were true then that
remain true now - one of which is that different pieces of the message
header are generated at different places by different agents, which
won't all use the same native character encoding and won't all get
upgraded to utf-8 at the same time.
John C Klensin
2004-01-02 16:47:36 UTC
Permalink
--On Thursday, 01 January, 2004 23:58 -0500 Keith Moore
Post by Keith Moore
Post by John C Klensin
We are, I think, in complete agreement: we couldn't have
rationally done anything else than what we did, and the text
you cite explains exactly why we made that decision. I was
only suggesting that, faced with similar decisions, but a
new context, today, we are not, and should not be, obligated
to replicate the decision of what is now thirteen or
fourteen years ago.
Of course not, but nobody has really suggested that we do so.
At the same time there are some things about email that were
true then that remain true now - one of which is that
different pieces of the message header are generated at
different places by different agents, which won't all use the
same native character encoding and won't all get upgraded to
utf-8 at the same time.
Of course not.

Keith, might I respectively suggest that you stop firing off
responses and, instead, try reading the notes to which you are
responding carefully enough to understand what they really say
before reacting to them.

I'm certainly not naive enough to believe in "all get upgraded
to UTF-8 at the same time" and didn't suggest that. It is
obvious to me that we will be living with (and _should_ be
living with) 2047 formats for some time, probably indefinitely.
I am suggesting only that:

(1) Anything that puts in an 8-bit header field must do
so using UTF-8, so that it is not necessary to tag those
header fields as to which "charset" they represent. In
other words, the question isn't whether everyone
upgrades to UTF-8 at once, only whether people are
encouraged to upgrade to other things on the way to
UTF-8.

(2) MTAs wanting to send UTF-8 headers be required to
negotiate that capability, using ESMTP options, with
recepients. This eliminates what would otherwise be a
requirement for message-body-scanning and heuristics
about whether "binary" headers are present. If properly
defined, it should also eliminate having to wonder
whether some header containing 8-bit information is
really UTF-8 and not some random, non-conforming,
nonsense.

(3) If the "UTF-8 header" capability cannot be
negotiated, the MTA wishing to send that information be
required to either bounce or encapsulate the message.

What I am arguing against is getting involved with any more
tagging at this stage, e.g., the sort of "put any charset you
like there, just identify it" logic that 2047 uses (and for
which I argued very strongly at the time, as did almost everyone
else). If binary stuff goes in, let's restrict it to UTF-8.
As Ned pointed out, that decision was controversial when we made
it. I was on the negative side, for several reasons, and still
think I was right. But, at this point, incompatibility and
optionality, IMO, would be a worse choice for the Internet than
permitting UTF-8 and some other binary option, even if the
latter were objectively better.

Note that (3) is approximately what we have today for 8BITMIME;
it is not a new strategy or one that has been generally
rejected. Indeed, it seems to have worked fairly well as a
transition strategy. And, personally, I prefer (for the reasons
described in my previous note) an encapsulate strategy for
downgrades to a fancy-coding one. One can argue for fancy
encoding instead, but more fancy encodings feels like more
options and complexity to me, and I think we are going to need
an encapsulation option anyway.

Also note that we shouldn't go overboard about "different pieces
of the message header are generated at different places by
different agents". The reality, you will recall, is that
"headers" are generated or altered only under the following
circumstances:

(1) All base and optional headers by the generating MUA

(2) A Received header and some fix-up and substitution
headers by the initial submission MTA.

(3) Received headers (_only_) by relay MTAs

(4) Received headers and a Return-path, plus
non-conforming and extension headers to accommodate
communication with the message store or receiving MUA,
by the delivery MTA.

(5) Any header rewritten by an actual gateway to bring
it into conformance with the spec.

Anything else is a protocol violation and, while we can (and
should) sensibly talk about damage control in the context of
existing practices, we shouldn't do crazy and permanent things
in order to make (hopefully-temporary) non-conforming behavior
work better.

john
Martin Duerst
2004-01-02 17:03:50 UTC
Permalink
Of course not, but nobody has really suggested that we do so. At the same
time there are some things about email that were true then that remain
true now - one of which is that different pieces of the message header are
generated at different places by different agents, which won't all use the
same native character encoding and won't all get upgraded to utf-8 at the
same time.
This is a valid point, and it seems to lead to an interesting question
that I haven't seen discussed yet: For the SMPT extension proposed in
Paul's draft, and for Charles' header, what's the policy with respect
to stuff encoded with RFC 2047? In detail:

- Is a message tagged with Charles' header allowed to contain RFC 2047 stuff?
(I would propose we say: MAY contain RFC 2047-encoded stuff)
- Is a message passed over SMTP with UTF-8-HEADERS allowed to contain
RFC 2047 stuff? The way I understand SMTP extensions (experts on this
list, please correct me if I'm wrong), this is a somewhat moot question,
because it's the server that says what extensions it supports; the client
doesn't say which extensions it uses (unless through the use of
parameters in commands, but there are none for UTF-8-HEADERS).
So my understanding is that RFC 2047-encoded headers are not disallowed.
- Does 'upgrade' include conversion from RFC 2047-encoded headers to
raw UTF-8 (even if the RFC 2047 encoding doesn't use UTF-8)?
I didn't find this in Paul's current draft; there is at the moment
not yet much about upgrading overall. I would propose we say
"upgrading MUST convert RFC 2047-encoded text to UTF-8 if the charset
used in the RFC 2047-encoding is UTF-8, and SHOULD (or MAY?) convert
RFC 2047-encoded text to UTF-8 if the charset used in the
RFC 2047-encoding is not UTF-8.

So overall, it seems to me that there is neither a need nor an intention
to forbid cases where headers are added in different encodings, although
of course streamlining the encoding is highly advantageous.

Regards, Martin.
Charles Lindsey
2004-01-02 17:53:11 UTC
Permalink
Post by Martin Duerst
- Is a message tagged with Charles' header allowed to contain RFC 2047 stuff?
(I would propose we say: MAY contain RFC 2047-encoded stuff)
Yes, I would think so.
Post by Martin Duerst
- Is a message passed over SMTP with UTF-8-HEADERS allowed to contain
RFC 2047 stuff?
Yes, I don't see why not. In the fullness of time, one hopes that the use
of RFC 2047 will gradually disappear, but it is for the marketplace to
determine when.
Post by Martin Duerst
The way I understand SMTP extensions (experts on this
list, please correct me if I'm wrong), this is a somewhat moot question,
because it's the server that says what extensions it supports; the client
doesn't say which extensions it uses (unless through the use of
parameters in commands, but there are none for UTF-8-HEADERS).
So my understanding is that RFC 2047-encoded headers are not disallowed.
But here I have a problem. RFC 2821 seems written on the assumption that
the client is just the last server in the chain, so of course it can
announce whether it supports UTF-8-HEADERS or not.

But, in practice, what happens is that the last SMTP server in the chain
hands of to a "local delivery agent", which may turn out to be a POP3
store or an IMAP store of a huge MBOX file or whatever else. So in
practice the local delivery agent needs to be able to say the equivalent
of "I do/do not do UTF-8-HEADERS" and then the server knows whether to
downgrade/bounce/etc. Alternatively, the local delivery agent is smart
enough to downgrade/bounce itself (that sort of thing could be configured
into procmail, for example).

However, there don't seem to be any standards covering this area (you just
have to learn how to hack sendmail.cf or its equivalents), so there is
nowhere for us to specify such expectations.
Post by Martin Duerst
- Does 'upgrade' include conversion from RFC 2047-encoded headers to
raw UTF-8 (even if the RFC 2047 encoding doesn't use UTF-8)?
My view is that upgrading is best left to the latest possible point in the
chain. The less that intermediate servers do to messages, the better
(because they usually do more harm than good). So leave it to the IMAP
store, or the MUA it at all possible.

The one possible exception is where it was clear that the header in
question has previously been downgraded in accordance with this same
standard.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Dave Crocker
2004-01-02 07:05:59 UTC
Permalink
Thanks for clarifying the dates in my deliberately-vague "decade".
We are, I think, in complete agreement: we couldn't have rationally done
anything else than what we did, and the text you cite explains exactly why
we made that decision.
I believe the situation was stronger than "we couldn't have rationally done
anything else." My recollection is that MIME was delayed roughly a year due
to the claims that Unicode was the One True solution, in spite of it not yet
having attained stability or installed base. Indeed, we could not delay MIME
longer.

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <http://brandenburg.com>
Keld Jørn Simonsen
2004-01-02 00:23:46 UTC
Permalink
Post by n***@mrochek.com
Post by Keith Moore
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
A decade ago isn't a particularly relevant date in the development of MIME --
at that point in time MIME was already a draft standard, making such a change
quite difficult to make.
A much more relevant date is November 18-22, 1991. This is the date of the last
IETF meeting prior to the approval of MIME as a proposed standard.
Realistically, this was the last point in time at which a change as major as
uncategorical endorsement of a single, universal charset specification could
have been made. The actual MIME specifications were subsequently submitted to
the IESG for approval in January 1992.
According to the Unicode web site the complete specification of Unicode 1.0
wasn't published until June, 1992. (Amusingly enough, that was the same month
in which RFC 1341 appeared.) In November 1991 the universal charset situation
was far from clear: What what then called 10646 seemed to be on the way
out and Unicode seemed to be on the way in but no conclusions had been reached.
Well, we could have built MIME on ISO 10646, and indeed the character
set support in MIME was built on 10646 as the reference character set,
as is also indicated by RFC 1345, which is the MIME related RFC that
made the ground for the charset definition. Later on we did advocate
10646 UTF-8 as the building block for all new IETF protocols, eg in
RFC 2130.

I know that the UTF-8 RFC now references Unicode as the normative
specification, but I think that was a mistake from IETF.
We should refer to international standards where they exist, and not to
specifications which are merely industry standards.

Best regards
keld
n***@mrochek.com
2004-01-02 04:35:24 UTC
Permalink
Post by Keld Jørn Simonsen
Post by n***@mrochek.com
Post by Keith Moore
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
A decade ago isn't a particularly relevant date in the development of MIME --
at that point in time MIME was already a draft standard, making such a change
quite difficult to make.
A much more relevant date is November 18-22, 1991. This is the date of the last
IETF meeting prior to the approval of MIME as a proposed standard.
Realistically, this was the last point in time at which a change as major as
uncategorical endorsement of a single, universal charset specification could
have been made. The actual MIME specifications were subsequently submitted to
the IESG for approval in January 1992.
According to the Unicode web site the complete specification of Unicode 1.0
wasn't published until June, 1992. (Amusingly enough, that was the same month
in which RFC 1341 appeared.) In November 1991 the universal charset situation
was far from clear: What what then called 10646 seemed to be on the way
out and Unicode seemed to be on the way in but no conclusions had been reached.
Well, we could have built MIME on ISO 10646, and indeed the character
set support in MIME was built on 10646 as the reference character set,
as is also indicated by RFC 1345, which is the MIME related RFC that
made the ground for the charset definition.
Keld: Please reread my message. This was a complete nonstarter at the time; the
10646 draft in 1991 had failed its ballot and there was no clear indication of
what it would turn into. Had we tried to move forward with a "just use 10646"
solution the IESG at the time would have flat-out rejected MIME. And they would
have been completely correct in doing so.
Post by Keld Jørn Simonsen
Later on we did advocate
10646 UTF-8 as the building block for all new IETF protocols, eg in
RFC 2130.
Yes, in April 1997. That's over five years later. And it was a VERY contentious
recommendaton at that point.
Post by Keld Jørn Simonsen
I know that the UTF-8 RFC now references Unicode as the normative
specification, but I think that was a mistake from IETF.
We should refer to international standards where they exist, and not to
specifications which are merely industry standards.
Well, all I can say is that the consensus in the IETF went against you on
this.

Ned
Keld Jørn Simonsen
2004-01-02 17:19:11 UTC
Permalink
Post by n***@mrochek.com
Post by Keld Jørn Simonsen
Post by n***@mrochek.com
Post by Keith Moore
Post by John C Klensin
We _could_ have adopted a rigid, Unicode-only, rule a decade ago
rather than doing charset-specific tagging in text content types
and 2047 encodings.
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
A decade ago isn't a particularly relevant date in the development of MIME --
at that point in time MIME was already a draft standard, making such a change
quite difficult to make.
A much more relevant date is November 18-22, 1991. This is the date of the last
IETF meeting prior to the approval of MIME as a proposed standard.
Realistically, this was the last point in time at which a change as major as
uncategorical endorsement of a single, universal charset specification could
have been made. The actual MIME specifications were subsequently submitted to
the IESG for approval in January 1992.
According to the Unicode web site the complete specification of Unicode 1.0
wasn't published until June, 1992. (Amusingly enough, that was the same month
in which RFC 1341 appeared.) In November 1991 the universal charset situation
was far from clear: What what then called 10646 seemed to be on the way
out and Unicode seemed to be on the way in but no conclusions had been reached.
Well, we could have built MIME on ISO 10646, and indeed the character
set support in MIME was built on 10646 as the reference character set,
as is also indicated by RFC 1345, which is the MIME related RFC that
made the ground for the charset definition.
Keld: Please reread my message. This was a complete nonstarter at the time; the
10646 draft in 1991 had failed its ballot and there was no clear indication of
what it would turn into. Had we tried to move forward with a "just use 10646"
solution the IESG at the time would have flat-out rejected MIME. And they would
have been completely correct in doing so.
I am in complete agreement that we could only have done what we did at
that time, that is install a regime of multiple charsets, properly
labelled.

I am not sure that we today should only go for UTF-8 for the
enhancements on email addresses, as UTF-8 as used in IETF specs is not
an international standard. We should probably then rather use what we
already did for names, and the functions to handle this is already
there. I also fear that just sending 8 bit will harm existing conforming
implementations.

Best regards
keld
John C Klensin
2004-01-02 17:49:07 UTC
Permalink
--On Friday, 02 January, 2004 18:19 +0100 Keld Jørn Simonsen
Post by Keld Jørn Simonsen
...
I am not sure that we today should only go for UTF-8 for the
enhancements on email addresses, as UTF-8 as used in IETF
specs is not an international standard. We should probably
then rather use what we already did for names, and the
functions to handle this is already there. I also fear that
just sending 8 bit will harm existing conforming
implementations.
Keld,

I'll let others respond to others of your points if they feel
like it. To them, the strongest thing I can say is what I have
said before -- fewer options promote better interoperability
and, while I'm not a huge fan of UTF-8, I consider, in the
absence of overwhelming arguments and something _clearly_
better, UTF-8 alone to be a better choice than UTF-8 plus
something.

I don't know that anyone has proposed "just sending 8 bit". An
existing implementation that has any 8 bit characters in its
headers is not conforming, so the question of a conforming
implementation interpreting 8 bit strings as other than UTF-8
(or whatever a standard specifies) is just not an issue. And,
because of the history of non-conforming implementations sending
assorted unlabeled 8bit stuff (much more in message bodies than
in headers in my experience), existing conforming
implementations that cannot accept such things and do something
plausible (including optionally cleaning rejecting it) would
have self-destructed years ago.

john
Martin Duerst
2004-01-02 18:27:00 UTC
Permalink
Post by Keld Jørn Simonsen
I am not sure that we today should only go for UTF-8 for the
enhancements on email addresses, as UTF-8 as used in IETF specs is not
an international standard.
Please explain what the difference is between "UTF-8 as used in IETF
specs" and UTF-8 as defined in ISO 10646? And in what way is this
difference relevant to the task at hand?
Post by Keld Jørn Simonsen
I also fear that just sending 8 bit will harm existing conforming
implementations.
As far as I understand, that's not what the UTF-8-HEADERS extension
is about.


Regards, Martin.
Martin Duerst
2004-01-02 21:51:40 UTC
Permalink
Post by Keith Moore
Post by John C Klensin
At another level, if there is a
specification that says "if you add 8bit header content, it must
be UTF-8; anything else must either be converted into RFC 2047
form or must be converted to UTF-8", then we are probably ok.
well, we already have widespread practice of taking rfc 2047 and
decoding it into whatever charset the MUA happens to want to use -
mixing utf-8 with that probably produces unpredictable results,
and insisting that all non-tagged non-ASCII text is utf-8 is probably
just naive.
This is an interesting point. But it doesn't exactly work that way.
There are MUAs that can handle only one charset, or a few very
related ones (I still use such a beast, but I'm thinking hard
about how to get rid of it). Those won't feel very well with UTF-8,
but they also won't feel well with IDNA, and a few other things.
The MUAs that can handle a wide variety of encodings will just
use Unicode inside. Being able to handle something like
Subject: =?iso-8859-1?Q?...?= =?iso-2022-jp?B?...?=
in a way that the user can actually look at means using
Unicode (or it means using some very messy code that would get a
lot easier if they switched to Unicode).
Post by Keith Moore
IIRC, unicode wasn't known to be stable at that time. it certainly
wasn't widely adopted, and it's hard to imagine that we could have
gotten consensus on such a rule. it's only after 10 years' experience
with unicode that we have some confidence in its character repertoire,
and we have even less experience with other aspects of it.
Others have discussed the first part of this paragraph.
For the second part (from the "it's only"), I think the
'some confidence' and 'even less experience' is a clear
understatement if the 'we' means the overall Internet or
IT community.


Regards, Martin.
Keith Moore
2004-01-03 02:58:20 UTC
Permalink
Post by Martin Duerst
Post by Keith Moore
well, we already have widespread practice of taking rfc 2047 and
decoding it into whatever charset the MUA happens to want to use -
mixing utf-8 with that probably produces unpredictable results,
and insisting that all non-tagged non-ASCII text is utf-8 is probably
just naive.
This is an interesting point. But it doesn't exactly work that way.
There are MUAs that can handle only one charset, or a few very
related ones (I still use such a beast, but I'm thinking hard
about how to get rid of it). Those won't feel very well with UTF-8,
but they also won't feel well with IDNA, and a few other things.
The MUAs that can handle a wide variety of encodings will just
use Unicode inside.
MUAs don't have to handle a wide variety of encodings in order to
translate encodings that they do understand to a character encoding
other than utf-8 and put *that* in the message header.
Post by Martin Duerst
Post by Keith Moore
It's only after 10 years' experience with unicode that we have some
confidence in its character repertoire, and we have even less
experience with other aspects of it.
Others have discussed the first part of this paragraph.
For the second part (from the "it's only"), I think the
'some confidence' and 'even less experience' is a clear
understatement if the 'we' means the overall Internet or
IT community.
I disagree. It has taken most of those 10 years to get enough software
written and widely deployed enough to have a good sense of how well
Unicode works. The same would have been true for any other solution.
Adam M. Costello
2004-01-03 10:11:59 UTC
Permalink
Post by Martin Duerst
I definitely like a header much more than the 8: header prefix
proposal, because ... 8-bit-clean software can just work on headers
without having to care about 7-bit/8-bit issues except at very
specific points (downgrading/upgrading).
...I don't like the "8:" header prefix. In some environments (notably
Netnews) it would be much simpler to leave the headers in their
present form (otherwise, all agents will have to learn to recognise a
new set of headers which are really just synonyms for existing ones
- that could be true of mail user agents too). The advantage of the
special header is that agents that don't need to be aware of the
distinction can just ignore it.
There seems to be assumption here that existing "8-bit clean" software
will automagically understand "UTF-8 header fields" that use the same
field-names as existing ASCII header fields. But "UTF-8 header fields"
have not even been defined yet, and there are plenty of important
details to work out. All standard header fields (like To:) are defined
by grammars that currenly allow only ASCII characters. UTF-8 header
fields would have different grammars. Exactly which Unicode characters
would be allowed, and where? The Unicode standard recommends that
equivalent strings be treated the same. Will that be true for UTF-8
header fields? If so, it means normalization needs to be done at some
point. At what point? When the field is created, or when it is parsed?
Which normalization, canonical or compatible? Or some profile of
Stringprep? What profile?

Given that none of these questions have been answered yet, how can we
expect existing "8-bit clean" software agents to interoperate if we
throw "UTF-8 header fields" at them? I don't think we can. I think
if we define UTF-8 header fields, we'll need to acknowledge that they
are new header fields with a new syntax and new requirements, and that
they can be properly handled only by new software that is aware of the
new rules. If we need new software anyway, to accomodate the various
changes mentioned above, then it's no big deal to accomodate one more
slight change, like an "8:" prefix. Encapsulating each UTF-8 header
field inside a new 8: header field would ensure that existing "8-bit
clean" software does not attempt to digest things that it has no proper
understanding of.
Post by Martin Duerst
the possibility exists that the gateway isn't aware of the utf-8
extension, so it injects messages with utf-8 headers and addresses
into the legacy mail system without doing a conversion.
Yes, that's a good argument that UTF-8 header fields are likely to
fall unexpectedly into the hands of software that doesn't know how
to handle them. If UTF-8 header fields use the same field names as
the corresponding ASCII header fields, there's no telling what will
happen. By using new never-before-used field names for the UTF-8 header
fields (or even an entirely different header format) we could avoid that
pitfall. The 8: idea is one way to do it; we could certainly imagine
others.

By the way, although I'm trying to help sort out what would be a good
way to define non-ASCII headers, I have no strong position on the
question of whether they should be defined at all. I see good arguments
on both sides of the issue, and I'm biased toward sticking with the
status quo unless the alternative clearly has more benefit than cost.

AMC
Keith Moore
2004-01-03 15:36:12 UTC
Permalink
Post by Adam M. Costello
Post by Keith Moore
the possibility exists that the gateway isn't aware of the utf-8
extension, so it injects messages with utf-8 headers and addresses
into the legacy mail system without doing a conversion.
Yes, that's a good argument that UTF-8 header fields are likely to
fall unexpectedly into the hands of software that doesn't know how
to handle them. If UTF-8 header fields use the same field names as
the corresponding ASCII header fields, there's no telling what will
happen. By using new never-before-used field names for the UTF-8 header
fields (or even an entirely different header format) we could avoid that
pitfall.
but using new field names for utf-8 versions of existing fields has
other pitfalls - e.g. that the utf-8 and ascii versions can get out of
sync, that the utf-8 fields will get converted to 2047 or bit-stripped,
etc.
Adam M. Costello
2004-01-05 01:08:37 UTC
Permalink
Post by Keith Moore
UTF-8 header fields are likely to fall unexpectedly into the hands
of software that doesn't know how to handle them. If UTF-8 header
fields use the same field names as the corresponding ASCII header
fields, there's no telling what will happen.
but using new field names for utf-8 versions of existing fields has
other pitfalls - e.g. that the utf-8 and ascii versions can get out of
sync,
That is indeed a concern that does not arise if UTF-8 header fields use
the same field names as ASCII header fields.
Post by Keith Moore
that the utf-8 fields will get converted to 2047 or bit-stripped, etc.
That concern applies equally regardless of whether UTF-8 header fields
use the same field-names as ASCII header fields.
Post by Keith Moore
There seems to be assumption here that existing "8-bit clean"
software will automagically understand "UTF-8 header fields" that
use the same field-names as existing ASCII header fields.
Eh? Of course they will, because those ASCII header fields are already
correct UTF-8. Nothing automagic needed there.
Non-ASCII field contents are invalid according to the spec that was
in force when all existing implementations were written. Therefore
feeding non-ASCII field contents to those implementations is asking for
unpredictable behavior. Some implementations will say "sorry, can't
parse that", but others will think they can parse it, and what are the
chances that they parse it correctly, and do the correct thing with
whatever protocol elements they find, according to a spec that didn't
exist when the implementation was created?
Post by Keith Moore
But "UTF-8 header fields" have not even been defined yet, and there
are plenty of important details to work out.
Right, these are issues not discussed in Pete's present draft, but
[[ many suggestions for new rules and new implementations ]]
You suggest several things that new implementations will need to do to
properly handle UTF-8 headers. As long as they're going to be doing all
that, it would be a negligible additional burden to handle a prefix like
"8:". In exchange for that tiny burden, we would get the benefit that
old software, which knows nothing of the new rules, will be much less
likely to blindly charge ahead and try to process data that it was never
designed to handle properly.

AMC
Keith Moore
2004-01-05 04:16:53 UTC
Permalink
Post by Adam M. Costello
Post by Keith Moore
but using new field names for utf-8 versions of existing fields has
other pitfalls - e.g. that the utf-8 and ascii versions can get out of
sync,
That is indeed a concern that does not arise if UTF-8 header fields use
the same field names as ASCII header fields.
no, the concern is when the utf-8 fields are expected to contain the
same information as the ascii fields, but in a different format.
Charles Lindsey
2004-01-03 19:30:39 UTC
Permalink
On Sat, 3 Jan 2004 10:11:59 +0000, Adam M. Costello
Post by Adam M. Costello
Post by Martin Duerst
...I don't like the "8:" header prefix. In some environments (notably
Netnews) it would be much simpler to leave the headers in their
present form (otherwise, all agents will have to learn to recognise a
new set of headers which are really just synonyms for existing ones
- that could be true of mail user agents too). The advantage of the
special header is that agents that don't need to be aware of the
distinction can just ignore it.
There seems to be assumption here that existing "8-bit clean" software
will automagically understand "UTF-8 header fields" that use the same
field-names as existing ASCII header fields.
Eh? Of course they will, because those ASCII header fields are already
correct UTF-8. Nothing automagic needed there.
Post by Adam M. Costello
But "UTF-8 header fields"
have not even been defined yet, and there are plenty of important
details to work out. All standard header fields (like To:) are defined
by grammars that currenly allow only ASCII characters. UTF-8 header
fields would have different grammars. Exactly which Unicode characters
would be allowed, and where? The Unicode standard recommends that
equivalent strings be treated the same. Will that be true for UTF-8
header fields? If so, it means normalization needs to be done at some
point. At what point? When the field is created, or when it is parsed?
Which normalization, canonical or compatible? Or some profile of
Stringprep? What profile?
Right, these are issues not discussed in Pete's present draft, but they
need to be. So here is a stab at it:

For RFC 2822 headers, you allow UTF-8 in all phrases, comments and
unstructureds, and maybe in quoted-strings too.

You allow them in domains, subject possibly to some limitations regarding
allowed characters and normalization/nameprep. Precise details would need
to be looked at rather carefully, but it does not seem inherently
difficult. Maybe 'atom' gets redefined in the process.

You allow them in local-parts, subject to some similar limitations.

And for RFC 2822, that is ALL. Everything else (header-names, date-times,
msgids) remains in ASCII.

Then you look at the MIME headers, and maybe you find some places where
UTF-8 would be useful, though I think allowing them in quoted-strings
would probably suffice. But you would have to say something about body
part header fields (e.g. that they can use UTF-8 aubject to the whatever
downgrading rules you have set for top-level headers). Note that transport
of such headers is no problem in transports that already support 8BITMIME.

And then you look at headers defined in other assorted documents (e.g.
headers that currently allow URIs might now allow IRIs, and Usefor would
presumably provide a suitable rule for Newsgroups).


Next problem is implementation of MUAs.

1. You ensure that all internal data paths are 8bit clean. This is
probably true already in most MUAs.

2. You arrange to display (and print) UTF-8 wherever it occurs in headers.
Modern MUAs tend to use Unicode internally, so accepting UTF-8 on the
front should be rather easy (indeed, some present agents would likely do
it out of the box). If the agent does not currently use Unicode
internally, then you do the best you can (what do you currently do with
=?utf-8=...?= ?). If that means downgrading to hex of QP or rowns of
"???????"s, then that will do for a start.

3. If you provide value-added services, like displaying lists of Subjects
in alphabetic order, then you do what you can. Note that sorting UTF-8
based on octet order actually produces tolerable results.

4. You deal with domains and local-parts. This AFAICS is the only bit that
might be "hard". But you are going to have to do that hard bit for IDNA
and for whatever other solution whis WG invents for local-parts, and doing
it this way does not seem inherently harder than that.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Keith Moore
2004-01-01 06:06:38 UTC
Permalink
Post by Paul Hoffman / IMC
Er, any comments at all?
there's no justification given for utf-8 headers. the desired
functionality can be accomplished by the address-map fields and
encoding the fields in ascii.

there's no explanation as to where the address-map information would be
obtained.

there's no cost analysis for a proposal which would appear to have a
huge cost. of course this is just a -00 version, but it's hard to
evaluate the desirability of this proposal without considering the cost
fairly quickly.

even accepting that it's a good idea to allow email addresses in raw
utf-8 (and this is a stretch) many fields should remain ascii so that
they can be read anywhere. it will often make more sense to put
ascii-encoded addresses, message-ids, etc. into log files than to put
raw utf-8 there.

there are too many mail transport boundaries that don't use SMTP and
thus may have no way to negotiate utf-8.

whether email addresses are in raw utf-8 or encoded in ascii there is
still a need to define how they are compared, because there will often
be more than one utf-8 representation of an address.

nit: the document repeatedly says that non-ASCII text is encoded in
quoted-printable; this is incorrect. RFC 2047 allows either a variant
of quoted-printable ("Q" encoding, which isn't quite the same thing) or
base64 ("B" encoding).
Arnt Gulbrandsen
2003-12-29 14:33:25 UTC
Permalink
Post by Nathaniel Borenstein
I would worry a bit that there may still be mailers out there that
don't always convey all instances of a header field that appears more
than once -- e.g. if they convert into, say, a database
representation and can only have one value indexed to the field name
"8", there might be information lost when that gets converted back
into RFC [2]822 format. -- Nathaniel
Sure. I've seen such code code convert

To: ***@b.com
To: ***@d.com

into

To: ***@b.com, ***@d.com

I don't think it's a problem. The former is illegal to start with IIRC,
so forwarding it unchanged is just as illegal as forwarding it changed.

I can't, right now, remember seeing problems caused by such code. I'm
sure there are some, for example with non-standard X- fields.

--Arnt
Keith Moore
2003-12-29 16:16:46 UTC
Permalink
Post by Arnt Gulbrandsen
Sure. I've seen such code code convert
into
I don't think it's a problem. The former is illegal to start with
IIRC, so forwarding it unchanged is just as illegal as forwarding it
changed.
nope, that's bad layering. the job of an MTA is to forward each
message intact, not to potentially change every message so that the
message it forwards is correct syntax. the latter behavior is much
more error prone as the code that detects and corrects errors may be
buggy, and often errors cannot be corrected without resorting to
unreliable heuristics. it also makes it much harder to upgrade the
message format. imagine what happens when utf-8 headers leak (as they
inevitably will) into an MTA that tries to insist that all headers on
messages that it forwards are 7bit only. the output is not likely to
be usable even if it is valid syntax.

we do recognize that there are some conversions and gateway operations
that require well-formed input in order to produce well-formed output.
such conversions are occasionally necessary but should be avoided when
possible.
Arnt Gulbrandsen
2003-12-29 16:40:35 UTC
Permalink
Post by Keith Moore
Post by Arnt Gulbrandsen
Sure. I've seen such code code convert
...
Post by Keith Moore
nope, that's bad layering.
It is now. Back when the internet had lots of mail gateways that needed
to do format conversion it wasn't, IMO.
Post by Keith Moore
such conversions are occasionally necessary but should be avoided when
possible.
Agree 100%.

Luckily I've never seen this particular sort of conversion mess things
up. Perhaps the consequences are so visible and so bad that problems
get fixed before release.

Anyway, unless people have seen worse problems, the "8" field approach
("8: from: ü@ü.de") shouldn't cause run into canonicalization problems.
It may be zealously removed, it may be mishandled, but I don't see any
reason that it'll be mashed together with other 8s.

--Arnt
John Cowan
2003-12-29 18:51:26 UTC
Permalink
Post by Arnt Gulbrandsen
Sure. I've seen such code code convert
into
I don't think it's a problem. The former is illegal to start with IIRC,
so forwarding it unchanged is just as illegal as forwarding it changed.
Multiple TO fields are discouraged by RFC 822, illegal according to
RFC 2822. But the Comments, Keywords, Received, and Resent-* fields MAY
(and in some circumstances MUST) occur multiple times even according to
the stricter standards of 2822.
--
Her he asked if O'Hare Doctor tidings sent from far John Cowan
coast and she with grameful sigh him answered that www.ccil.org/~cowan
O'Hare Doctor in heaven was. Sad was the man that word www.reutershealth.com
to hear that him so heavied in bowels ruthful. All ***@reutershealth.com
she there told him, ruing death for friend so young,
algate sore unwilling God's rightwiseness to withsay. _Ulysses_, "Oxen"
Dan Oscarsson
2004-01-04 09:59:29 UTC
Permalink
Post by Martin Duerst
This is a valid point, and it seems to lead to an interesting question
that I haven't seen discussed yet: For the SMPT extension proposed in
Paul's draft, and for Charles' header, what's the policy with respect
- Is a message tagged with Charles' header allowed to contain RFC 2047 stuff?
(I would propose we say: MAY contain RFC 2047-encoded stuff)
I prefer not to have a header, only negotiation before transfer of data.
If a header is used (there is a problem with a header due to it have to
be before all other headers to simplify message handling), the rules should
be the same as fpr SMTP negotiated UTF-8.
Post by Martin Duerst
- Is a message passed over SMTP with UTF-8-HEADERS allowed to contain
RFC 2047 stuff? The way I understand SMTP extensions (experts on this
list, please correct me if I'm wrong), this is a somewhat moot question,
because it's the server that says what extensions it supports; the client
doesn't say which extensions it uses (unless through the use of
parameters in commands, but there are none for UTF-8-HEADERS).
So my understanding is that RFC 2047-encoded headers are not disallowed.
When MTAs have negotiated for UTF-8, only UTF-8 should be used - not RFC 2047.
(the only exception to that rule could be to send characters not in UCS).
The reason for this is to simplify handling of headers (parsing, decoding etc).
When UTF-8 is negotiated no RFC2047 handling should be needed.
Post by Martin Duerst
- Does 'upgrade' include conversion from RFC 2047-encoded headers to
raw UTF-8 (even if the RFC 2047 encoding doesn't use UTF-8)?
I didn't find this in Paul's current draft; there is at the moment
not yet much about upgrading overall. I would propose we say
"upgrading MUST convert RFC 2047-encoded text to UTF-8 if the charset
used in the RFC 2047-encoding is UTF-8, and SHOULD (or MAY?) convert
RFC 2047-encoded text to UTF-8 if the charset used in the
RFC 2047-encoding is not UTF-8.
As I said above for negotiate, when you go UTF-8, only UTF-8 shall be
used otherwise much of the good with UTF-8 goes away (and some will
implement UTF-8 as RFC2047 encoded text to avoid going UTF-8).
So when a gateway goes from lagacy to UTF-8 is must decode all
RFC2047 (and other into ASCI encodings) into UTF-8.

Dan
John C Klensin
2004-01-04 13:13:05 UTC
Permalink
--On Sunday, 04 January, 2004 10:59 +0100 Dan Oscarsson
Post by Dan Oscarsson
Post by Martin Duerst
- Is a message tagged with Charles' header allowed to contain
RFC 2047 stuff? (I would propose we say: MAY contain RFC
2047-encoded stuff)
I prefer not to have a header, only negotiation before
transfer of data. If a header is used (there is a problem with
a header due to it have to be before all other headers to
simplify message handling), the rules should be the same as
fpr SMTP negotiated UTF-8.
Yes. And some of the comments below are ultimately the
strongest motivation for this. A sending MTA has to be able to
say "I'm about to do something strange, i.e., send non-ASCII in
the headers. Is that ok and do you know enough to avoid screwing
things up?" and get back an affirmative response.
Interestingly, if affirmative responses rarely occur, that would
be an in-the-field demonstration of what I take to be Keith's
key argument, i.e., that, despite all of the interest in the
engineering community and discussion on this list, implementers
and users really don't care and sticking with ASCII, or other
things encoded into ASCII, is quite adequate. Like you, I don't
believe that: I think that non-English-speaking communities, and
especially non-Roman-script communities, will deploy this stuff
relatively quickly in both MUAs and MTAs because they are
convinced that they need it (just as (at least mimimal)
implementations of 8BITMIME deployed fairly quickly in
communities that thought it was important). But, ultimately,
only the marketplace can figure that one out: we should try to
focus on sensible ways to let people do things -- which they are
convinced are important enough that they will do them in
non-standard ways if we don't supply a standards -- but do them
in a safe and rational way. And that brings me to...
Post by Dan Oscarsson
When MTAs have negotiated for UTF-8, only UTF-8 should be used
- not RFC 2047. (the only exception to that rule could be to
send characters not in UCS). The reason for this is to
simplify handling of headers (parsing, decoding etc). When
UTF-8 is negotiated no RFC2047 handling should be needed.
Dan, I don't think this is realistic, for the reasons Keith has
cited repeatedly (even if I don't agree with his other
conclusions). In practice, headers are supplied in different
software at different points in the system. Our traditional
rule for MTAs has been "don't tamper with what you get, but pass
it on", a rule that recognizes the experience that, when MTAs
try to correct or reformat message text (including headers), a
lot of them get it wrong and create a mess. That is also why I
was trying to work through an "encapsulate, rather than convert"
strategy a few days ago.

In this particular case, requiring an MTA to convert is a
requirement that it support conversion between an arbitrary
character encoding (in 2047 Q or B form) to Unicode and UTF-8.
That implies that it must have tables to convert from every
2047-valid encoding to UTF-8, which is a near-impossibility,
especially since some of those charsets may not have unambiguous
conversions.

The situation is obviously somewhat better when the 2047
encoding is of UTF-8 or some other Unicode flavor. But, still,
I think that asking MTAs to start sorting through headers they
receive, looking for translations to perform and then applying
them, is just looking for trouble. We have just had far too
much trouble in the past with systems declaring themselves
gateways on the slightest pretense and then thoroughly messing
up the header environment on the assumption that they know
better about what was really intended than the original
submission process did.

john
Keith Moore
2004-01-04 16:07:39 UTC
Permalink
if affirmative responses rarely occur, that would be an in-the-field
demonstration of what I take to be Keith's key argument, i.e., that,
despite all of the interest in the engineering community and
discussion on this list, implementers and users really don't care and
sticking with ASCII, or other things encoded into ASCII, is quite
adequate.
well, that's NOT my key argument, and it's not even close to my key
argument.

my key argument is that if you want to implement IMAAs, the lack of
transparency in mail transport is the least of the problems, and trying
to negotiate transparency in mail transport doesn't bring you closer to
a solution to the IMAA problem or even closer to a desirable state in
the long term. it's simply counterproductive.
John C Klensin
2004-01-04 16:08:56 UTC
Permalink
--On Sunday, 04 January, 2004 11:07 -0500 Keith Moore
Post by Keith Moore
if affirmative responses rarely occur, that would be an
in-the-field demonstration of what I take to be Keith's key
argument, i.e., that, despite all of the interest in the
engineering community and discussion on this list,
implementers and users really don't care and sticking with
ASCII, or other things encoded into ASCII, is quite adequate.
well, that's NOT my key argument, and it's not even close to
my key argument.
My apologies.
Post by Keith Moore
my key argument is that if you want to implement IMAAs, the
lack of transparency in mail transport is the least of the
problems, and trying to negotiate transparency in mail
transport doesn't bring you closer to a solution to the IMAA
problem or even closer to a desirable state in the long term.
it's simply counterproductive.
Ok. That is clear. Now I at least understand, for the first
time, where we disagree.

thanks,
john
Charles Lindsey
2004-01-04 17:35:19 UTC
Permalink
On Sun, 4 Jan 2004 10:59:29 +0100 (CET), Dan Oscarsson
Post by Dan Oscarsson
I prefer not to have a header, only negotiation before transfer of data.
If a header is used (there is a problem with a header due to it have to
be before all other headers to simplify message handling), the rules should
be the same as fpr SMTP negotiated UTF-8
You can make negotiation before transfer a REQUIREMENT of some extended
SMTP, but SMTP is not the only mail transport protocol around, and you
cannot impose such a blanket requirement on _every_ transport protocol,
both existing and not yet invented.

That is why you need both the negotiation (for those protocols which you
are in a position to specify) AND a header (for those you can't).

Those who write gateways between protocols will just have to do the Right
Thing (yes, that may be hard to achieve, but gateways are going to exist
whether you like it or not, and yes gateways may well be the weakest link
in the chain - so what's new?)
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133   Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk      Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9      Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Keith Moore
2004-01-05 04:23:18 UTC
Permalink
Post by Charles Lindsey
You can make negotiation before transfer a REQUIREMENT of some
extended SMTP, but SMTP is not the only mail transport protocol
around, and you cannot impose such a blanket requirement on _every_
transport protocol, both existing and not yet invented.
correct.
Post by Charles Lindsey
That is why you need both the negotiation (for those protocols which
you are in a position to specify) AND a header (for those you can't).
no, that's why folks should stop trying to enforce a boundary by
negotiation in the mail transport, because that clearly won't work.
putting utf-8 information in separate fields probably won't work
either, for different reasons.
Keld Jørn Simonsen
2004-01-05 12:56:31 UTC
Permalink
Post by Charles Lindsey
On Sun, 4 Jan 2004 10:59:29 +0100 (CET), Dan Oscarsson
Post by Dan Oscarsson
I prefer not to have a header, only negotiation before transfer of data.
If a header is used (there is a problem with a header due to it have to
be before all other headers to simplify message handling), the rules should
be the same as fpr SMTP negotiated UTF-8
You can make negotiation before transfer a REQUIREMENT of some extended
SMTP, but SMTP is not the only mail transport protocol around, and you
cannot impose such a blanket requirement on _every_ transport protocol,
both existing and not yet invented.
That is why you need both the negotiation (for those protocols which you
are in a position to specify) AND a header (for those you can't).
This illustrates that this is not a safe way to extend the mail
protocol. You cannot just tell the other end that they need to
understand whay you say, eg that this is a 8bit utf-8 header.
The other end may not be prepared to do so. The safest way to extend
the mail protocol - which causes the least interoperability problems
with the current installed conforming base - is to have all headers
still in 7 bit.

Thus we need to encode ito 7 bit, and the encoding into 7 bit should be
done at the originating MUA. This is a requirement that is not too
difficult to get working, as this is also where the information on
receiver and sender addresses, with possible non-ascii characters, is done,
and this is also where current MIME encoding of names in comment fields
are done (which are encoded in 7 bit), so the functions are already
available.

Whether the 7 bit encoding should just be what we already have in MIME,
or always some form of 10646 is not fully clear, but I tend to think
that some form of 10646 would be the best, for interoperability.
Maybe punicode could be used, or utf-7 or my mnemonic downgrading from
RFC 1345, if anybody still remembers that, or something new.
It would be nice if the 7-bit encoding was kind of readable in raw form,
and I think it sould be based on 10646, not unicode, for standards
conforming reasons (as the technical differencese are almost
negligeable).

Best regards
keld
Adam M. Costello
2004-01-06 01:15:01 UTC
Permalink
10646, not unicode, ... (as the technical differences are almost
negligible).
I'm under the impression that the technical differences are quite great,
because Unicode includes a great deal of technical information that is
absent from 10646 (but whatever does exist in 10646 is identical with or
at least compatible with Unicode).

I haven't seen 10646 (it's too expensive), but does it include much more
than a code chart and specs for UTF-8 and UTF-16? Unicode includes a
lot of character properties and algorithms for doing useful things based
on those properties. See sections C.4 and C.7 of the Unicode standard:

http://www.unicode.org/versions/Unicode4.0.0/appC.pdf

Does 10646 include normalization forms (NFC, etc)? Those are likely to
be important in any effort to use the UCS in identifiers (like email
addresses and web links). Does 10646 include case folding data? That
is needed in any effort to use the UCS in case-insensitive identifiers.

AMC
John Cowan
2004-01-06 02:37:30 UTC
Permalink
Post by Adam M. Costello
I haven't seen 10646 (it's too expensive), but does it include much more
than a code chart and specs for UTF-8 and UTF-16? Unicode includes a
lot of character properties and algorithms for doing useful things based
It defines a few properties such as the mirrored property and the
combining-character property. It also defines a large variety of
named subsets, which are not in Unicode. But for the most part it's
just a big list of codes, names, and glyphs.
--
"May the hair on your toes never fall out!" John Cowan
--Thorin Oakenshield (to Bilbo) ***@reutershealth.com
Mark Davis
2004-01-06 02:56:37 UTC
Permalink
Yes, Keld's statement is completely inaccurate.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Adam M. Costello" <ietf-imaa.amc+***@nicemice.net.RemoveThisWord>
To: <ietf-***@imc.org>
Sent: Mon, 2004 Jan 05 17:15
Subject: 10646 & Unicode
Post by Adam M. Costello
10646, not unicode, ... (as the technical differences are almost
negligible).
I'm under the impression that the technical differences are quite great,
because Unicode includes a great deal of technical information that is
absent from 10646 (but whatever does exist in 10646 is identical with or
at least compatible with Unicode).
I haven't seen 10646 (it's too expensive), but does it include much more
than a code chart and specs for UTF-8 and UTF-16? Unicode includes a
lot of character properties and algorithms for doing useful things based
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf
Does 10646 include normalization forms (NFC, etc)? Those are likely to
be important in any effort to use the UCS in identifiers (like email
addresses and web links). Does 10646 include case folding data? That
is needed in any effort to use the UCS in case-insensitive identifiers.
AMC
Keld Jørn Simonsen
2004-01-06 10:07:43 UTC
Permalink
Post by Mark Davis
Yes, Keld's statement is completely inaccurate.
True, there are a lot of information on character properties etc in
unicode. Some of that info can be found in other ISO standards such as
ISO 14651 on sorting and ISO TR 14652 with an enhanced POSIX like
locales.

What I meant is that the character table that defines the charset
in IETF terms, is almost the same, and the UTF-8 spec is almost the
same, and these are the specs that we probably will be using in
our enhanced mail spec.

Best regards
keld
Post by Mark Davis
Mark
__________________________________
http://www.macchiato.com
??? ??????????????????????????????????????????????????????????????? ???
----- Original Message -----
Sent: Mon, 2004 Jan 05 17:15
Subject: 10646 & Unicode
Post by Adam M. Costello
10646, not unicode, ... (as the technical differences are almost
negligible).
I'm under the impression that the technical differences are quite great,
because Unicode includes a great deal of technical information that is
absent from 10646 (but whatever does exist in 10646 is identical with or
at least compatible with Unicode).
I haven't seen 10646 (it's too expensive), but does it include much more
than a code chart and specs for UTF-8 and UTF-16? Unicode includes a
lot of character properties and algorithms for doing useful things based
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf
Does 10646 include normalization forms (NFC, etc)? Those are likely to
be important in any effort to use the UCS in identifiers (like email
addresses and web links). Does 10646 include case folding data? That
is needed in any effort to use the UCS in case-insensitive identifiers.
AMC
John C Klensin
2004-01-06 13:29:11 UTC
Permalink
Keld,

As you know, I took the same "reference 10646, not Unicode"
position for many years. I have given up, and suggest you do so
also. For better or worse, the fight is lost:

* The Unicode mapping and normalization tables, which
ISO has not incorporated, are critical to key IETF
standards, especially those that are connected to, or
depend on, Stringprep.

* Changes in the structure and management of ISO/IEC
JTC1/SC2, and the transfer of formal responsibility for
key work from SC22 to SC2, pretty clearly put the
ISO-based management process and direction for this work
in the hands of people who will ensure that nothing of
significance is or will be approved in the ISO context
that is inconsistent with Unicode (and, probably, that
has not already reached consensus in UTC). So, for
better or worse, the concern about important divergence
is no longer important: the only difference of substance
between 10646 and friends and Unicode, in the areas of
overlap, is likely to be lag time.

* Where UTF-8 differs between the two, important
computer and software vendors, and IETF standards work,
seem to be tracking Unicode. I suspect, but have not
been following the progress of the work (largely because
I no longer consider it worth the effort), that the
differences are also a matter of lag time, i.e., that
SC2 will, sooner or later, catch up.

* I wish UTC would be more orderly about handling and
identifying revisions, etc., especially to materials
that are critical to the practice of the standard like
some of the TRs. However, being in a situation of
having to reference one or two standards from one
organization and some technical reports from another is
worse than a situation in which we at least can be
reasonably assured of synchrony in the references. In
other words, those critical TRs refer back to [versions
of] the base Unicode standard, not to 10646. If we
reference 10646 as the base, and they reference the TRs,
we risk serious confusion about what is being specified
if there are any differences, even temporarily. And,
while an identification process that does not provide
concise and precise identification of versions of TRs
(or keep earlier versions readily accessible) adds
complication, the fact that ISO doesn't have those
documents as standards is a showstopper on the ISO path.

This, IMO, just isn't worth going around any more. ISO and its
member bodies have decided -- in this area and others -- to
yield its authority, development, and review process to an
outside body, limiting its role to formal final review
(typically now primarily by the same people). I don't like that
outcome in this area, and like it still less in others. But
insisting on recognition of documents just because they have
gone through that weakened process doesn't seem to me to enhance
anything other than ISO's dwindling reputation. So, in this
area and others, perhaps the best thing other standards bodies
can do is to say "Ok, the key standard is really being developed
in another place, and ISO is rubber-stamping and reprinting
(usually at high price and long delay) its key content. We
should reference the other documents unless there is obvious
value-added (and, in this case, comprehensive coverage) in the
ISO work. And assignment of a number and reprinting in another
format is _not_, at least in my opinion, sufficient value-added
to justify the time lag and risk of non-synchrony between, e.g.,
the coding standard and the normalization rules.

I don't like this outcome but the wounds in ISO's feet are
self-inflicted and we can't ignore them any more in this area.

regards,
john


--On Tuesday, 06 January, 2004 11:07 +0100 Keld Jørn Simonsen
Post by Keld Jørn Simonsen
Post by Mark Davis
Yes, Keld's statement is completely inaccurate.
True, there are a lot of information on character properties
etc in unicode. Some of that info can be found in other ISO
standards such as ISO 14651 on sorting and ISO TR 14652 with
an enhanced POSIX like locales.
What I meant is that the character table that defines the
charset in IETF terms, is almost the same, and the UTF-8 spec
is almost the same, and these are the specs that we probably
will be using in our enhanced mail spec.
Best regards
keld
Post by Mark Davis
Mark
__________________________________
http://www.macchiato.com
???
?????????????????????????????????????????????????????????????
?? ???
----- Original Message -----
From: "Adam M. Costello"
Sent: Mon, 2004 Jan 05 17:15
Subject: 10646 & Unicode
Post by Adam M. Costello
10646, not unicode, ... (as the technical differences are
almost negligible).
I'm under the impression that the technical differences are
quite great, because Unicode includes a great deal of
technical information that is absent from 10646 (but
whatever does exist in 10646 is identical with or at least
compatible with Unicode).
I haven't seen 10646 (it's too expensive), but does it
include much more than a code chart and specs for UTF-8 and
UTF-16? Unicode includes a lot of character properties and
algorithms for doing useful things based on those
properties. See sections C.4 and C.7 of the Unicode
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf
Does 10646 include normalization forms (NFC, etc)? Those
are likely to be important in any effort to use the UCS in
identifiers (like email addresses and web links). Does
10646 include case folding data? That is needed in any
effort to use the UCS in case-insensitive identifiers.
AMC
Mark Davis
2004-01-06 15:27:59 UTC
Permalink
Now I am one of the first to say that referencing the Unicode standards is the
right direction, in no small part because of the wealth of specifications and
data over and above the character repertoire itself, items that are simply not
found in ISO standards.

However, your message goes a bit too far in that direction. ISO does not merely
rubber-stamp the Unicode standard; there is an extremely successful degree of
cooperation between the Unicode consortium and the ISO subcommittees. This has
allowed the resulting character repertoire to be more thoroughly vetted than if
either organization had done it alone.

In terms of collation, cooperation between the consortium and ISO has produced
synchronization between the UCA and ISO 14651 on a base level (although the UCA
does add substantial capabilities). While we have not been happy at the speed
with which ISO/IEC SC22 was able to address the issues involved in collation,
the main reason for the transfer of authority of 14651 to ISO/IEC SC2 is that at
this point the bulk of the work is in refining the approaches to more
specialized scripts, and SC2 has many, many times the representation in these
areas.

As to UTF-8, there is no effective difference between ISO, Unicode, or
http://www.ietf.org/rfc/rfc3629.txt remaining.
Post by John C Klensin
* I wish UTC would be more orderly about handling and
identifying revisions, etc., especially to materials
that are critical to the practice of the standard like
some of the TRs.

The UTC has a well-defined process for producing and identifying all revisions
to all specifications. You may find the FAQ at
http://www.unicode.org/faq/reports_process.html useful. If you have any concerns
about the processes that it uses, or suggestions for improvements, please let me
know and I will communicate those to the committee.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "John C Klensin" <john-***@jck.com>
To: "Keld Jørn Simonsen" <***@dkuug.dk>; "Mark Davis" <***@jtcsv.com>
Cc: "IETF IMAA list" <ietf-***@imc.org>
Sent: Tue, 2004 Jan 06 05:29
Subject: Re: 10646 & Unicode


Keld,

As you know, I took the same "reference 10646, not Unicode"
position for many years. I have given up, and suggest you do so
also. For better or worse, the fight is lost:

* The Unicode mapping and normalization tables, which
ISO has not incorporated, are critical to key IETF
standards, especially those that are connected to, or
depend on, Stringprep.

* Changes in the structure and management of ISO/IEC
JTC1/SC2, and the transfer of formal responsibility for
key work from SC22 to SC2, pretty clearly put the
ISO-based management process and direction for this work
in the hands of people who will ensure that nothing of
significance is or will be approved in the ISO context
that is inconsistent with Unicode (and, probably, that
has not already reached consensus in UTC). So, for
better or worse, the concern about important divergence
is no longer important: the only difference of substance
between 10646 and friends and Unicode, in the areas of
overlap, is likely to be lag time.

* Where UTF-8 differs between the two, important
computer and software vendors, and IETF standards work,
seem to be tracking Unicode. I suspect, but have not
been following the progress of the work (largely because
I no longer consider it worth the effort), that the
differences are also a matter of lag time, i.e., that
SC2 will, sooner or later, catch up.

* I wish UTC would be more orderly about handling and
identifying revisions, etc., especially to materials
that are critical to the practice of the standard like
some of the TRs. However, being in a situation of
having to reference one or two standards from one
organization and some technical reports from another is
worse than a situation in which we at least can be
reasonably assured of synchrony in the references. In
other words, those critical TRs refer back to [versions
of] the base Unicode standard, not to 10646. If we
reference 10646 as the base, and they reference the TRs,
we risk serious confusion about what is being specified
if there are any differences, even temporarily. And,
while an identification process that does not provide
concise and precise identification of versions of TRs
(or keep earlier versions readily accessible) adds
complication, the fact that ISO doesn't have those
documents as standards is a showstopper on the ISO path.

This, IMO, just isn't worth going around any more. ISO and its
member bodies have decided -- in this area and others -- to
yield its authority, development, and review process to an
outside body, limiting its role to formal final review
(typically now primarily by the same people). I don't like that
outcome in this area, and like it still less in others. But
insisting on recognition of documents just because they have
gone through that weakened process doesn't seem to me to enhance
anything other than ISO's dwindling reputation. So, in this
area and others, perhaps the best thing other standards bodies
can do is to say "Ok, the key standard is really being developed
in another place, and ISO is rubber-stamping and reprinting
(usually at high price and long delay) its key content. We
should reference the other documents unless there is obvious
value-added (and, in this case, comprehensive coverage) in the
ISO work. And assignment of a number and reprinting in another
format is _not_, at least in my opinion, sufficient value-added
to justify the time lag and risk of non-synchrony between, e.g.,
the coding standard and the normalization rules.

I don't like this outcome but the wounds in ISO's feet are
self-inflicted and we can't ignore them any more in this area.

regards,
john


--On Tuesday, 06 January, 2004 11:07 +0100 Keld Jørn Simonsen
Post by John C Klensin
Post by Mark Davis
Yes, Keld's statement is completely inaccurate.
True, there are a lot of information on character properties
etc in unicode. Some of that info can be found in other ISO
standards such as ISO 14651 on sorting and ISO TR 14652 with
an enhanced POSIX like locales.
What I meant is that the character table that defines the
charset in IETF terms, is almost the same, and the UTF-8 spec
is almost the same, and these are the specs that we probably
will be using in our enhanced mail spec.
Best regards
keld
Post by Mark Davis
Mark
__________________________________
http://www.macchiato.com
???
?????????????????????????????????????????????????????????????
?? ???
----- Original Message -----
From: "Adam M. Costello"
Sent: Mon, 2004 Jan 05 17:15
Subject: 10646 & Unicode
Post by Adam M. Costello
10646, not unicode, ... (as the technical differences are
almost negligible).
I'm under the impression that the technical differences are
quite great, because Unicode includes a great deal of
technical information that is absent from 10646 (but
whatever does exist in 10646 is identical with or at least
compatible with Unicode).
I haven't seen 10646 (it's too expensive), but does it
include much more than a code chart and specs for UTF-8 and
UTF-16? Unicode includes a lot of character properties and
algorithms for doing useful things based on those
properties. See sections C.4 and C.7 of the Unicode
http://www.unicode.org/versions/Unicode4.0.0/appC.pdf
Does 10646 include normalization forms (NFC, etc)? Those
are likely to be important in any effort to use the UCS in
identifiers (like email addresses and web links). Does
10646 include case folding data? That is needed in any
effort to use the UCS in case-insensitive identifiers.
AMC
Keld Jørn Simonsen
2004-01-06 17:27:15 UTC
Permalink
Post by Mark Davis
In terms of collation, cooperation between the consortium and ISO has produced
synchronization between the UCA and ISO 14651 on a base level (although the UCA
does add substantial capabilities).
What are the substantial facilities that UCA has added over ISO 14651?

Best regards
keld
Mark Davis
2004-01-06 18:01:25 UTC
Permalink
Take a look at http://www.unicode.org/reports/tr10/

Very broadly, it is something like the following. This is only a sketch; for
details see the above document.

- a much more thorough introduction to multilingual sorting issues
- much more information about performance and implementation practices
- how to apply collation to searching and matching
- uniform handling of canonical equivalents
- automatic rearrangement for Thai, Lao
- completely ignorable characters and irrelevant combining characters don't
interfere with contractions
- well-formedness criteria for tables (disallowing tables that would produce
results where X < Y and yet XY == YX)
- variable weighting (allowing punctuation to be ignored or not)

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Keld Jørn Simonsen" <***@dkuug.dk>
To: "Mark Davis" <***@jtcsv.com>
Cc: "John C Klensin" <john-***@jck.com>; "IETF IMAA list" <ietf-***@imc.org>
Sent: Tue, 2004 Jan 06 09:27
Subject: Re: 10646 & Unicode
Post by Keld Jørn Simonsen
Post by Mark Davis
In terms of collation, cooperation between the consortium and ISO has produced
synchronization between the UCA and ISO 14651 on a base level (although the UCA
does add substantial capabilities).
What are the substantial facilities that UCA has added over ISO 14651?
Best regards
keld
Keld Jørn Simonsen
2004-01-07 21:35:38 UTC
Permalink
Post by John C Klensin
Keld,
As you know, I took the same "reference 10646, not Unicode"
position for many years. I have given up, and suggest you do so
Yes, I know. I don't think the fight is lost, but I am probably a
die-hard. The last defender of the free and open world:-)
Anyway the dominance of American multinational companies in the cultural
specifications for IT in the 1970'es was what motivated me into all my i18n work,
and I actually thinh we are better off now with some ways to be
in charge of what are our own culture (eg Danish) now, than then.

I am more involved than you as I am the editor of some of the ISO
specifications that are not directly controlled by Unicode, but I do
cooperate with the Unicode people, and although they are quite negative
to this work at large, they are some of the biggest contributers to
these specs too, and the ISO specs, including mine, are well aligned
with the Unicode specs.
Post by John C Klensin
* The Unicode mapping and normalization tables, which
ISO has not incorporated, are critical to key IETF
standards, especially those that are connected to, or
depend on, Stringprep.
I know, this is a hard one. I think it was the wrong decision from IETF,
and I hope we can avoid that decision for the specs we are discussing
here. There are viable ISO alternatives (IMHO).
Post by John C Klensin
* Changes in the structure and management of ISO/IEC
JTC1/SC2, and the transfer of formal responsibility for
key work from SC22 to SC2, pretty clearly put the
ISO-based management process and direction for this work
in the hands of people who will ensure that nothing of
significance is or will be approved in the ISO context
that is inconsistent with Unicode (and, probably, that
has not already reached consensus in UTC). So, for
better or worse, the concern about important divergence
is no longer important: the only difference of substance
between 10646 and friends and Unicode, in the areas of
overlap, is likely to be lag time.
That is probably right, although SC2 has some independence from Unicode.
Anyway it is very normal in the ISO process that the specifications have
been originated somewhere else, and that ISO then standardizes it.
Think of POSIX, which was made by AT&T. And many character sets were
originated in ECMA, eg iso-8859-1.
Post by John C Klensin
* Where UTF-8 differs between the two, important
computer and software vendors, and IETF standards work,
seem to be tracking Unicode. I suspect, but have not
been following the progress of the work (largely because
I no longer consider it worth the effort), that the
differences are also a matter of lag time, i.e., that
SC2 will, sooner or later, catch up.
There are in practice not much difference between ISO and Unicode UTF-8.
ISO is still 31 bit, while unicode is 21 bits only. The IETF spec was
thus changed from 31 bits to 21 bits under the hood, when it said that
the Unicode version was the reference version. I am not so happy
about that. Unicode UTF-8 then also have some more restrictions on
coding of some characters like the ASCII range. I believe that ISO UTF-8
would catch up on the more restrictive spec, while I am not sure whether
ISO will restrict it to 21 bits. In reality there is not allocated any
characters beyond the 21 bits, neither in ISO nor in Unicode.
But some clever software could use some of the extra defined user space
in 10646, I remember a proposal frm Marcus Kuhn using this to represent
all other charsets.
Post by John C Klensin
* I wish UTC would be more orderly about handling and
identifying revisions, etc., especially to materials
that are critical to the practice of the standard like
some of the TRs. However, being in a situation of
having to reference one or two standards from one
organization and some technical reports from another is
worse than a situation in which we at least can be
reasonably assured of synchrony in the references. In
other words, those critical TRs refer back to [versions
of] the base Unicode standard, not to 10646. If we
reference 10646 as the base, and they reference the TRs,
we risk serious confusion about what is being specified
if there are any differences, even temporarily. And,
while an identification process that does not provide
concise and precise identification of versions of TRs
(or keep earlier versions readily accessible) adds
complication, the fact that ISO doesn't have those
documents as standards is a showstopper on the ISO path.
Yes, that is why I advocate using the ISO references only.
ISO do have specs that can do the work, AFAICT.
Post by John C Klensin
This, IMO, just isn't worth going around any more. ISO and its
member bodies have decided -- in this area and others -- to
yield its authority, development, and review process to an
outside body, limiting its role to formal final review
(typically now primarily by the same people).
As others have said, this is not truely the case. ISO has not
yeild itd authority in these matters. Anyway it is natural
in the ISO process that work is done outside ISO
and then brought to ISO, as described earlier.

best regards
keld
Martin Duerst
2004-01-07 22:19:48 UTC
Permalink
Post by Keld Jørn Simonsen
Post by John C Klensin
* The Unicode mapping and normalization tables, which
ISO has not incorporated, are critical to key IETF
standards, especially those that are connected to, or
depend on, Stringprep.
I know, this is a hard one. I think it was the wrong decision from IETF,
and I hope we can avoid that decision for the specs we are discussing
here. There are viable ISO alternatives (IMHO).
I would be interested to know which alternatives you are thinking of.
Post by Keld Jørn Simonsen
Post by John C Klensin
* Where UTF-8 differs between the two, important
computer and software vendors, and IETF standards work,
seem to be tracking Unicode. I suspect, but have not
been following the progress of the work (largely because
I no longer consider it worth the effort), that the
differences are also a matter of lag time, i.e., that
SC2 will, sooner or later, catch up.
There are in practice not much difference between ISO and Unicode UTF-8.
ISO is still 31 bit, while unicode is 21 bits only. The IETF spec was
thus changed from 31 bits to 21 bits under the hood, when it said that
the Unicode version was the reference version. I am not so happy
about that. Unicode UTF-8 then also have some more restrictions on
coding of some characters like the ASCII range. I believe that ISO UTF-8
would catch up on the more restrictive spec,
That would be great. It is important for the IETF for security
reasons.
Post by Keld Jørn Simonsen
while I am not sure whether
ISO will restrict it to 21 bits. In reality there is not allocated any
characters beyond the 21 bits, neither in ISO nor in Unicode.
But some clever software could use some of the extra defined user space
in 10646, I remember a proposal frm Marcus Kuhn using this to represent
all other charsets.
That's the main problem with huge amounts of empty space:
Somebody will come up with an idea of how to use it that they
think is very clever, but is bothering everybody else.
Restricting the amount of space available, even if artificially,
has the beneficial effect to guide people to efficient use of
space; too much space available just leads to 'sprawl'.
In this sense, I think the limitation of the original Unicode
approach to 16 bits was a good thing, and the current limitation
to 21 bits hopefully at least provides a bit of pressure in the
right direction, even if things look still really very empty
at the moment, and it will take at the very least this century
to fill up this space.

As for the IETF, I don't think that a proposal to reflect
all legacy charsets inside the UCS would meet the IETF's
interoperability goals in any way.


Regards, Martin.
James Seng
2004-01-09 17:15:44 UTC
Permalink
These discussion about 10646 vs. Unicode is interesting but shouldn't we
go back to the topic of IMAA?

-James Seng
Post by Martin Duerst
Post by Keld Jørn Simonsen
Post by John C Klensin
* The Unicode mapping and normalization tables, which
ISO has not incorporated, are critical to key IETF
standards, especially those that are connected to, or
depend on, Stringprep.
I know, this is a hard one. I think it was the wrong decision from IETF,
and I hope we can avoid that decision for the specs we are discussing
here. There are viable ISO alternatives (IMHO).
I would be interested to know which alternatives you are thinking of.
Post by Keld Jørn Simonsen
Post by John C Klensin
* Where UTF-8 differs between the two, important
computer and software vendors, and IETF standards work,
seem to be tracking Unicode. I suspect, but have not
been following the progress of the work (largely because
I no longer consider it worth the effort), that the
differences are also a matter of lag time, i.e., that
SC2 will, sooner or later, catch up.
There are in practice not much difference between ISO and Unicode UTF-8.
ISO is still 31 bit, while unicode is 21 bits only. The IETF spec was
thus changed from 31 bits to 21 bits under the hood, when it said that
the Unicode version was the reference version. I am not so happy
about that. Unicode UTF-8 then also have some more restrictions on
coding of some characters like the ASCII range. I believe that ISO UTF-8
would catch up on the more restrictive spec,
That would be great. It is important for the IETF for security
reasons.
Post by Keld Jørn Simonsen
while I am not sure whether
ISO will restrict it to 21 bits. In reality there is not allocated any
characters beyond the 21 bits, neither in ISO nor in Unicode.
But some clever software could use some of the extra defined user space
in 10646, I remember a proposal frm Marcus Kuhn using this to represent
all other charsets.
Somebody will come up with an idea of how to use it that they
think is very clever, but is bothering everybody else.
Restricting the amount of space available, even if artificially,
has the beneficial effect to guide people to efficient use of
space; too much space available just leads to 'sprawl'.
In this sense, I think the limitation of the original Unicode
approach to 16 bits was a good thing, and the current limitation
to 21 bits hopefully at least provides a bit of pressure in the
right direction, even if things look still really very empty
at the moment, and it will take at the very least this century
to fill up this space.
As for the IETF, I don't think that a proposal to reflect
all legacy charsets inside the UCS would meet the IETF's
interoperability goals in any way.
Regards, Martin.
Jony Rosenne
2004-01-06 07:45:29 UTC
Permalink
This is the most important difference, and it relegates ISO standards to
secondary importance vis-à-vis industry standards, such as Unicode and many
others, that are freely available.

ISO should be aware of this, because it was discussed many times, but they
do not do anything about it.

Jony
-----Original Message-----
Sent: Tuesday, January 06, 2004 3:15 AM
Subject: 10646 & Unicode
..
I haven't seen 10646 (it's too expensive),
...
AMC
Keld Jørn Simonsen
2004-01-06 10:03:52 UTC
Permalink
Post by Jony Rosenne
This is the most important difference, and it relegates ISO standards to
secondary importance vis-à-vis industry standards, such as Unicode and many
others, that are freely available.
You mean that the problem about 10646 not neing freely available is
the most important problem? (it was not fully clear from your
statement). Actually there is a survey going on in ISO about this issue
at this time. And there are other means to make 10646 freely available,
that could be pursued.
Post by Jony Rosenne
ISO should be aware of this, because it was discussed many times, but they
do not do anything about it.
Yes, ISO is a bit hard to get moved on this issue, but maybe we can rock
the boat.

Best regards
keld
Post by Jony Rosenne
Jony
-----Original Message-----
Sent: Tuesday, January 06, 2004 3:15 AM
Subject: 10646 & Unicode
..
I haven't seen 10646 (it's too expensive),
...
AMC
Michel Suignard
2004-01-07 00:30:52 UTC
Permalink
A bit off topic of imaa, but so much has been said and not always
correctly on the matter.

Mark Davis has already presented most of the points I would have
mentioned concerning the synchronization between ISO 10646 and
Unicode/UTC, so I don't need to repeat any of that. I would just add the
following points:
- costwise, 10646 doesn't cost an arm and a leg anymore, 83CHF (check
the ISO web site) for Part1 is still too much but on the same order of
magnitude as the Unicode book. With the merged version coming up, there
is a good chance that the sum will be less than the 2 previous parts.
- as time goes 10646 tends to incorporate more and more of the important
features of the Unicode Standard. For example it does normatively
reference the Unicode Bidi algorithm and the Normalization Unicode UAX.
Basically what happened when folks in SC2/WG2 recognized the needs for
these definitions it was found not surprisingly that there were no need
to reinvent the wheel.
- a side effect of this dual referencing is that it enforces additional
checking when the UTC wants to update any of these technical reports.
- It may surprise some, but in effect the same tool and team are
maintaining both the Unicode data and ISO 10646 data. This is to avoid
as much as possible synchronization errors.
- There is more in ISO 10646 than just glyph and names. There is almost
100 pages of textual contents which, for the most part, is also in the
Unicode book but it is worth checking. For example 10646 tends to be
terser which I consider a feature.
- ISO 10646 is the primary issuer of source references information for
CJK characters (both Unified and Compatibility). Because of the
synchronization effort that data is reflected in the Unicode Unihan
database.

Like many I wish the ISO work was more accessible, but in fact anybody
working either on any ISO national body as contributing experts to SC2
or through the UTC (using its liaison status) has access to the working
drafts.

Michel Suignard
ISO/IEC 10646 project editor.
Loading...