Discussion:
if you really want utf-8 headers...
Keith Moore
2004-01-01 06:34:38 UTC
Permalink
Okay, I still see zero justification for utf-8 headers. The
improvement in transmission and storage efficiency is miniscule. They
make both user agents and mail transports more complex and less
reliable, because MTAs need to have conversion code (which will break
messages and cause delivery failures) and UAs need to be able to handle
old messages that use RFC 2047 (resulting in multiple code paths and
additional failure modes).

(That, and they don't address the problem that this group is trying to
solve...)

But if you believe that the very long term benefit of utf-8 headers (by
which I mean that whatever benefit might result from using utf-8 - and
it's by no means certain - won't be realized for a very long time)
somehow outweighs the very high near-term cost, then may I suggest that
the place to do the upgrade and negotiation is not in the mail
transport, but at the message store and message submission.

That is, the major benefit of using utf-8 headers would be to make life
easier for user agents and IMAP servers (for searching). They don't
benefit the transport at all. But I could imagine POP and IMAP options
that said "give me utf-8 headers instead of headers with RFC 2047
and/or IMAAs in them", and I could imagine simplified UAs that would
only talk to POP and IMAP servers that implemented that option.(I'd
hate the lack of interoperability between new simplified UAs and old
POP and IMAP servers, but there's already some precedent for UAs
insisting on nonstandard or optional features in POP and IMAP.)

Message stores could implement this in a variety of ways - they could
store the message as received and convert on-the-fly as necessary; they
could convert the header to utf-8 on receipt; etc. I could also
imagine a SUBMISSION server option that said "translate utf-8 headers
to proper on-the-wire format before forwarding them to their
destination" and UAs that would only submit messages to SUBMISSION
servers that advertised that option via EHLO. Messages sent through
SMTP or other transports would still, for the time being, be in ASCII.

I see several "nice" things about doing it this way:
- it isolates the complexity to portions of the system (the message
store and submission server) that are "close to" the portions of the
system (UAs and message stores) that benefit the most, which means that
users who benefit (if they do realize a benefit) will be in a better
position to get those portions upgraded.
- it is less disruptive because it affects fewer components of the mail
system at once.
- it isolates conversion to a small number of interfaces rather than
allowing conversion to potentially occur at any interface between one
MTA, gateway, firewall, filter, etc. and another, some of which offer
no opportunity for feature negotiation.
- It bounds the number of conversions that a message will undergo, and
thus bounds the potential for delivery failure and message corruption.
- it's easy to try on an experimental basis without impacting the
infrastructure

And if you also wanted to experiment with transporting utf-8
end-to-end, you could always define a SRV record for "direct utf-8 mail
delivery" and have the utf-8 SUBMISSION servers be aware of it, using
that in preference to MX. You could even use this as an means to
replace SMTP with something simpler, rather than making SMTP more
complex.

Keith

p.s. I said something like this in Minneapolis but some amplification
might be useful. I still think that even in the long term there's very
marginal benefit in going to utf-8 headers as long as we've got so many
other baroque irregularities in 2822 and MIME. Of course I understand
the potential for second-system effect, but if you just do utf-8
headers without changing anything else in the message format you're
paying a lot in upgrade cost to only simplify one fairly minor aspect
of the system.

Universal adoption of IMAAs is anything but assured. The largest age
group of the world population is fairly young (say, less than 21 years
old) . Many of these people have grown up with cheap travel and good
communications, and an international popular culture. They are used
to dealing with people from other countries, and in multiple languages.
Many of these people may find that IMAAs don't benefit them so much
and that it's easier to get all email at an ASCII address (or for that
matter at an E164 number using ENUM) than it is to deal with IMAAs.

I'm not trying to argue that we shouldn't try to define IMAAs - clearly
they will be useful to some people - I'm saying that IMAAs by
themselves probably don't justify a vast upgrade to the infrastructure.
John C Klensin
2004-01-01 17:40:59 UTC
Permalink
Keith (and many others),

I've been sitting out this discussion because I've been trying
to work through the long-term and transition cases and sort them
out in a coherent way. I've also been struck by the degree to
which much of these discussions reflects an apparent lack of
operational experience with the way email works in practice.

So let's take a few steps back:

Anything we do will cause some interoperability problems
--whether impact on transport or software systems or as
perceived by users -- somewhere. "8:" headers will mess
something up, somewhere, because of the issues you and others
have identified (including parsing issues, header consolidation
algorithms, special headers coming through but getting trashed
without warnings to the recipient, and so on), even if the
assumptions that cause the problems aren't justified in the
standards. And users will be furious if they see IMAA/ACE
local-parts, even if the mail goes through, and will also be
furious if mail that they consider to be well-formed bounces.
While RFC 1342 and its successors were a brilliant solution
given the constraints of the network at that time, you've
certainly got enough operational experience, and are a keen
enough observer, to know how much users hate actually seeing
them (and how much abuse Quoted-printable took). Any of these
changes will cause problems, and will make people unhappy --
probably, in the short term, more people will be unhappy than
happy about them.

That is, superficially, a really strong argument for "keep it
all in readable ASCII" and "don't try to see what else can be
squeezed into header formats" positions. With MIME, there was
clearly value-added because we had no specification for how to
handle multimedia or non-ASCII mail bodies. And, in that
context, 1342 was a really nice hack because it solved the
problems for a specific set of fields and subfields that we
could identify as specifically intended for humans, with few
protocol implications.

But the argument won't work for this case, not because it is
irrational or logically wrong, but because there is _immense_
user pressure out there for a fully multilingual Internet, one
in which, as I have said a few times, English is just another
language and ASCII is just another script. That pressure won't
accept, at least in local, homogeneous, environments, an
English/ASCII network with various kludges --which work if all
of the environments are perfectly aligned, but are unworkable or
at least ugly otherwise-- attached for other languages and
scripts.

That leaves us with a different problem, one I think we need to
get very serious about rather than distracting ourselves with
discussions about how we would behave in a better, more ideal,
world. We are going to have UTF-8 headers, and probably EUC and
BIG-8 and KOI and 8859-1 and 8859-5, headers, and many more.
We are also going to have those characters in local-part
addresses. And, in some parts of the world, we already do,
typically justified by "we need them", "they work in our
environment", and other variations of the old "just send 8"
story. They are not going away, if only because (a) the modal
interpersonal email message goes between people who share
languages and scripts and (b) because the obvious proprietary
alternatives to SMTP/822/MIME are Unicode-clean today or shortly
will be.

The question is not whether or not people will have and use
non-ASCII local parts and non-ASCII header fields, nor about
whether or not that is really necessary. The question is how we
will deal with that fact in a way that:

(i) Maximizes global interoperability of the mail
infrastructure, especially when it is actually important
in practice (not just in theory).

(ii) Minimizes damage when things leak out of Unicode or
local CCS environments.

(iii) Avoids driving users and mail systems toward
proprietary environments because they provide a better
experience.

I think those are our goals, or should be. If they are not,
then we should, IMO, be discussing that issue, not solutions.

Now, it seems to me that there are two main possible models for
getting there. And, much as I hate (and have been resisting)
putting alternatives out there, they are probably compatible
(although at some cost):

(1) We accept the conclusion that that the proprietary, local
CCS (which might be Unicode in UTF-8 or some other form),
local-header-definition, systems are out there and are going to
be with us forever. We then view this strictly as a gateway
problem. Given all of our other constraints, that gateway
problem is probably best dealt with by encapsulation, e.g.,

1.1 We insist that gateway systems work in terms of
Unicode and UTF-8, keeping local character sets out of
the public Internet. Note that this condition is _not_
necessary (see below), but it would avoid lots of problems.

1.2 We invent message/rfcNNNN, where "rfcNNNN" basically
says "just like RFC2822, but all header fields are
defined as being in UTF-8, not ASCII".

1.3 The gateway converts all envelope addresses to IMAA
form and encapsulates the original message using
message/rfcNNNN, so we have a MIME body of...

From: "1342/2047 PersonalName"
<IMAA-local-***@IDNA-domain>
To: "1342/2047 PersonalName2"
<IMAA-local-***@IDNA-domain2>
Date: RFC2822-date
MIME-Version: 1.0
content-type: message/rfcNNNN
content-type-encoding: <as needed>

<original message, with original headers, in original
form>

I hope we can avoid it, but a charset parameter for
message/rfcNNNN would certainly not be rocket science to
define.

1.4 Clever receiving systems notice "message/rfcNNN" and
unwind the situation in some appropriate way, with no
information loss. And note that the model above is
pure, unextended, MIME and hence causes no Received or
Return-path issues at all. Non-clever receiving systems
are going to make users unhappy.

(2) We really work on a Unicode-clean environment, supported by
transport option negotiation. In that environment, the sender
accepts the notion that i18n communication is going to occur
only with fully internationalized environments (target systems,
intermediate relays, etc., and for addresses, mailbox names,
headers, and so on). If an environment is encountered that is
not fully internationalized, the transport at the boundary is
going to either take on a gateway role and adopt the conversion
above or will bounce the mail (as with 8BITMIME, etc.).

But the second option will permit all of the edge cases for
subaddresses with non-ASCII delimiters, bidi according to strict
Unicode rules, etc., to work in predictable and obvious ways.
And, within environments that would otherwise shift to
proprietary solutions, it would deliver full i18n functionality
while permitting staying with "real" Internet mail, albeit with
upgraded (but backward compatible with existing messages and
addressing) MTAs and MUAs, while the IMAA conversions may still
lose or distort some information (although the original
information would presumably be preserved in the encapsulated
message).

Are either of those two options wonderfully attractive from an
architectural standpoint? Nope. Will we end up with as much
interoperability as we would if everyone in the world could be
persuaded to stay with pure ASCII (or encoded-to-ASCII)
SMTP/MIME? Nope. But that second question isn't relevant in
practice --"they" can't be persuaded-- and we either face
reality or give up the game.

john

p.s. For those who may be surprised or confused, option (1)
above represents a significant change of position for me. I
think it identifies a role for IMAA encoding in transport that
is not long-term unacceptably harmful... and it specifies, as
draft-hoffman-imaa-03.txt does not, what really happens to the
message headers and body to preserve real operational
compatibility, not just squeeze in an addressing variation. I
hope it helps demonstrate that some of us are really trying to
learn from these list messages, not just dig in and repeat the
same arguments over and over again. I know I'm not the only
one, but the combination of repeated arguments and epicycle-like
models from a few people is getting tedious.



--On Thursday, 01 January, 2004 01:34 -0500 Keith Moore
Post by Keith Moore
Okay, I still see zero justification for utf-8 headers. The
improvement in transmission and storage efficiency is
miniscule. They make both user agents and mail transports more
complex and less reliable, because MTAs need to have
conversion code (which will break messages and cause delivery
failures) and UAs need to be able to handle old messages that
use RFC 2047 (resulting in multiple code paths and additional
failure modes).
...
Thomas Roessler
2004-01-02 17:18:47 UTC
Permalink
Post by John C Klensin
1.2 We invent message/rfcNNNN, where "rfcNNNN" basically
says "just like RFC2822, but all header fields are
defined as being in UTF-8, not ASCII".
1.3 The gateway converts all envelope addresses to IMAA
form and encapsulates the original message using
message/rfcNNNN, so we have a MIME body of...
From: "1342/2047 PersonalName"
To: "1342/2047 PersonalName2"
Date: RFC2822-date
MIME-Version: 1.0
content-type: message/rfcNNNN
content-type-encoding: <as needed>
<original message, with original headers, in original
form>
I hope we can avoid it, but a charset parameter for
message/rfcNNNN would certainly not be rocket science to
define.
1.4 Clever receiving systems notice "message/rfcNNN" and
unwind the situation in some appropriate way, with no
information loss. And note that the model above is
pure, unextended, MIME and hence causes no Received or
Return-path issues at all. Non-clever receiving systems
are going to make users unhappy.
Wouldn't that construction violate MIME's "no nested encodings"
rule when transferred in a 7bit environment?

Regards,
--
Thomas Roessler ยท Personal soap box at <http://log.does-not-exist.org/>.
John C Klensin
2004-01-03 16:01:12 UTC
Permalink
--On Friday, 02 January, 2004 18:18 +0100 Thomas Roessler
Post by Thomas Roessler
Post by John C Klensin
1.2 We invent message/rfcNNNN, where "rfcNNNN" basically
says "just like RFC2822, but all header fields are
defined as being in UTF-8, not ASCII".
1.3 The gateway converts all envelope addresses to IMAA
form and encapsulates the original message using
message/rfcNNNN, so we have a MIME body of...
From: "1342/2047 PersonalName"
To: "1342/2047 PersonalName2"
Date: RFC2822-date
MIME-Version: 1.0
content-type: message/rfcNNNN
content-type-encoding: <as needed>
<original message, with original headers, in original
form>
I hope we can avoid it, but a charset parameter for
message/rfcNNNN would certainly not be rocket science to
define.
1.4 Clever receiving systems notice "message/rfcNNN" and
unwind the situation in some appropriate way, with no
information loss. And note that the model above is
pure, unextended, MIME and hence causes no Received or
Return-path issues at all. Non-clever receiving systems
are going to make users unhappy.
Wouldn't that construction violate MIME's "no nested encodings"
rule when transferred in a 7bit environment?
I didn't think so, but I'm not an adequate expert on the
convolutions of that rule. If it did, moving to something like

multipart/used-to-be-utf8-headers; boundary = "--foo"
--foo
message/utf-8-headers
Post by Thomas Roessler
content-transfer-encoding: ...
<header text>
--foo
message/ ???

<message text, possibly encoded>
--foo--

would seem to work, although it would unquestionably be less
attractive.

john
Adam M. Costello
2004-01-05 07:05:30 UTC
Permalink
Post by John C Klensin
1.3 The gateway converts all envelope addresses to IMAA
form and encapsulates the original message using
message/rfcNNNN, so we have a MIME body of...
Date: RFC2822-date
MIME-Version: 1.0
content-type: message/rfcNNNN
content-transfer-encoding: <as needed>
<original message, with original headers, in original form>
Wouldn't that construction violate MIME's "no nested encodings"
rule when transferred in a 7bit environment?
Yes. RFC 2045 section 6.4 says:

it is EXPRESSLY FORBIDDEN to use any encodings other than "7bit",
"8bit", or "binary" with any composite media type, i.e. one that
recursively includes other Content-Type fields.

Therefore, if you try to encapsulate the 8-bit header and pass the
message to a 7-bit MTA, you won't be able to apply quoted-printable or
base64 encoding, and you're stuck.
Post by John C Klensin
If it did, moving to something like
multipart/used-to-be-utf8-headers; boundary = "--foo"
--foo
message/utf-8-headers
Post by Thomas Roessler
content-transfer-encoding: ...
<header text>
--foo
message/ ???
<message text, possibly encoded>
--foo--
would seem to work, although it would unquestionably be less
attractive.
You have almost arrived at a structure I thought of a few weeks ago but
never finished writing up. If you make one more tweak you get something
more attractive than the initial message/rfcNNNN idea:

#### begin example message ####

From: ***@ASCII
To: ***@ASCII
Subject: ENCODED_WORD
Content-Type: multipart/header8; boundary=boundary123

---boundary123
Content-Disposition: inline
Content-Type: text/plain: charset=utf-8
Content-Transfer-Encoding: 8bit or quoted-printable or base64 as needed

8:From: ***@UTF8
8:To: ***@UTF8
8:Subject: UTF8

--boundary123
Content-Disposition: inline
Content-Type: WHATEVER; charset=WHATEVER
Content-Transfer-Encoding: WHATEVER

BODY
--boundary123--

#### end example message ####

The tweak I spoke of is in the content-types of the two inner parts.
The second part has the appropriate content-type for the body, the same
content-type that would have gone in the top-level header before we
inserted this multipart/header8 shim. The first part has content-type
text/plain, which is not a lie because a message header is indeed plain
text. The fact that it is not only plain text but also a message header
is conveyed by the content-type in the outer header: multipart/header8
is defined to contain exactly two parts, of which the first is a UTF-8
header and the second is an arbitrary message body.

What makes this style of encapsulation more attractive than the
message/rfcNNNN style is that it can be displayed by today's MUAs.
An MUA today will not know what to do with message/rfcNNNN, but it
will be able to cope with multipart/header8: it will treat it as
multipart/mixed, according to RFC 2046 section 5.1.7. And it will also
be able to cope with the UTF-8 header tagged as text/plain: it will
simply display it. I've tried this with my own MUA (mutt) and indeed
it makes no attempt to display message/foo but does correctly display
multipart/foo.

This structure is rather ugly to propose as the next generation message
format, but it can instead be proposed as the downgraded form of the
next generation format. In this model, there are two classes of header
fields: old-style (what we have today) and new-style (similar, but with
direct support for non-ASCII text and maybe some other extensions). An
old-style header is a sequence of old-style fields, and a new-style
header is a sequence of either-style fields; that is, a new-style header
can contain both old-style and new-style fields.

A new-style message would simply be a new-style header and a body, but
it could be downgraded to an old-style message by splitting the header
into an old-style fallback header, a new-style residual header, and an
old-style content header, using the structure described above. For
example:

#### begin new-style message ####

Date: Mon, 5 Jan 2004 05:14:38 +0000
8:From: ***@UTF8
8:To: ***@UTF8
8:Subject: UTF8
In-Reply-To: <***@blah>
Content-Type: text/plain; charset=iso-2022-jp

BODY

#### end new-style message ####

That could be downgraded to:

#### begin old-style message ####

Date: Mon, 5 Jan 2004 05:14:38 +0000
From: ***@ASCII
To: ***@ASCII
Subject: ENCODED_WORD
In-Reply-To: <***@blah>
Content-Type: multipart/header8; boundary=boundary123

--boundary123
Content-Disposition: inline
Content-Type: text/plain; charset=utf-8

8:From: ***@UTF8
8:To: ***@UTF8
8:Subject: UTF8

--boundary123
Content-Disposition: inline
Content-Type: text/plain; charset=iso-2022-jp

BODY
--boundary123--

#### end old-style message ####

The fallback header is the outer header minus the Content-* fields
(which is part of the shim). The residual header is the body of
the first part of the multipart/header8. The content header is
the header of the second part of the multipart/header8 minus the
Content-Disposition field (which is part of the shim).

A downgraded message can be upgraded back into a new-style message,
but before I discuss that I need to clarify the relationship between
old-style and new-style fields.

Given an old-style field Foo:, there does not automatically exist a
new-style field 8:Foo:. The new-style field does not exist without its
own specification. Similarly, if someone defines a new new-style field
8:Bar:, they are not obligated to specify a corresponding old-style
field.

However, if both new-style and old-style versions of a field are
specified, then they must agree on whether multiple instances of the
field are allowed in a header. If multiple instances are allowed, then
there is no special significance to the occurence of both old-style and
new-style forms within a header; they are simply independent instances
of the field, same as they would be if they were all old-style or all
new-style. But if multiple instances are not allowed, and both forms
occur in the same header, then they are alternates, and one must be
respected over the other. Obviously, old software will respect the
old-style form (because the new-style form won't be recognized), but new
software that understands new-style header fields should respect the
new-style form.

The specification of the new-style field may define a downgrade
conversion to the old-style form, possibly using encoded-words
and/or ACEs and/or lookups to special servers. Downgrade
conversions would be defined by at least the standard fields
8:From:, 8:Sender:, 8:Reply-To:, 8:To:, 8:Cc:, 8:Bcc:, and 8:Subject:.

The procedure for downgrading a message is as follows: The
Content-* fields go into the content header (in the second part
of the multipart/header8), the other old-style fields go into the
fallback header (in the outer header), and the new-style fields
go into the residual header (in the body of the first part of the
multipart/header8). Furthermore, old-style copies of some of the
new-style fields are created and put into the fallback header. A copy
is made if and only if the following four conditions are met:

1. the corresponding old-style field was not already present in the
original new-style header
2. the new-style field is recognized
3. multiple instances of the field are not allowed
4. a downgrade conversion is defined for the field

Finally, the shim structure is created around all of that.

Condition 1 allows the original message creator to supply a precomputed
downgraded field in the original new-style header, possibly different
from the one that would result from the standard downgrade algorithm for
that field.

There is a small exception to the rule that old-style fields go into
the fallback header, motivated by the unique role of Received: as a
trace field added in a particular order by multiple agents. If the
original new-style header contains both Received: and 8:Received:
fields, then they all go into the residual header, so that their order
can be preserved.

The procedure for upgrading a downgraded message is simple: Concatenate
the fallback header, the residual header, and the content header to form
the new-style header, and discard the shim.

By definition, every old-style header is also a new-style header, so if
you want to add a new-style field to a header, you can in general just
do it, and let the header get downgraded later if necessary. However,
we don't want to create nested shims; therefore new-style fields must
not be added to fallback headers (headers containing Content-Type:
multipart/header8). In that case, first upgrade the message, then add
the new-style field.

By the way, for anyone concerned about having multiple instances of
the 8: field, the prefix could be "8;" instead of "8:", resulting in
distinct field names "8;From", "8;To", "8;Subject", etc. Also, we could
leave room for backward-compatible expansion by using a prefix of "8::"
or "8;;", so that non-critical parameters could be inserted between
the (semi)colons, which would be ignored by implementations that don't
understand them.

As for the "other extensions" I alluded to when introducing the
term "new-style field"... as long as we're defining a new header
field syntax, we might as well consider making other changes besides
allowing non-ASCII. For example, people have expressed a desire for
alternate addresses. One approach is semi-stable cachable-or-lookupable
equivalent addresses via new servers and/or an Address-Map field.
Another approach is one-shot inline alternate addresses via an extension
of the grammar.

The current grammar defines the mailbox token as:

mailbox = name-addr / addr-spec

Suppose that in new-style fields the mailbox token is redefined as:

mailbox = single-addr / any-of-addr
single-addr = name-addr / addr-spec
any-of-addr = [display-name] "[" single-addr-list "]"
single-addr-list = single-addr *("," single-addr)

For example:

8:To: [ Joe1 <***@1.com>, Joe2 <***@2.com> ], Foo <***@bar>

(Of course the expected common usage is addresses of different scripts
or languages.)

Unlike the Address-Map or server-based approaches, there is no claim
here that the addresses are equivalent; this is simply a way to write
"address1 or address2 or address3..." in a particular place in a header.
Anyone can create such a multiple-choice list for whatever purpose they
wish; there is no question of whether an any-of-addr is authentic or
bogus. An any-of-addr can be added to an address book, but should not
be automatically cached and reused, because there is no reason to assume
that an any-of-addr that made sense for one field of one message will be
appropriate in any other context.

The intent here is *not* to create an illusion of a single address with
multiple appearances, but rather to invite a choice among multiple
distinct addresses. An MUA could display the any-of-addr literally, and
the user could simply ignore the scripts that are hard to remember and
remember the one that's easy to remember. The MUA could also provide an
option to auto-hide all but one of the choices (the one estimated to be
the best suited to the user), just as MUAs provide options to hide some
header fields.

When an MUA sends a message to an any-of-addr, it should make the
envelope recipient match whichever single-addr was displayed to the
user; if multiple addresses were displayed, it should use the one that
was first in the displayed list, or let the user choose one.

New-style fields that contain mailboxes and define downgrade conversions
will need to specify how to downgrade an any-of-addr to a single-addr,
perhaps by simply discarding all but the first single-addr in the list.

AMC
Dan Oscarsson
2004-01-02 14:28:31 UTC
Permalink
Post by John C Klensin
(i) Maximizes global interoperability of the mail
infrastructure, especially when it is actually important
in practice (not just in theory).
(ii) Minimizes damage when things leak out of Unicode or
local CCS environments.
(iii) Avoids driving users and mail systems toward
proprietary environments because they provide a better
experience.
I think those are our goals, or should be.
They are goals I have worked for and is the reason I only
want ONE way to encode characters. I will come back to it later on.
Post by John C Klensin
Now, it seems to me that there are two main possible models for
getting there.
(1) We accept the conclusion that that the proprietary, local
CCS (which might be Unicode in UTF-8 or some other form),
local-header-definition, systems are out there and are going to
be with us forever. We then view this strictly as a gateway
problem. Given all of our other constraints, that gateway
problem is probably best dealt with by encapsulation, e.g.
This is like MIME/IDNS/IMAA - encapsulate non-ASCII inside ASCII
preserving much of all the problems with interoperability.
Post by John C Klensin
(2) We really work on a Unicode-clean environment, supported by
transport option negotiation.
Which I prefer.

Back to the goals above.
I could upgrade the goals to be applied to all Internet communication.
Global interoperability is needed for all protocols.

Keith have wondered why we need UTF-8 in headers.
The major problem with how things work today is to make
interoperability work without misunderstanding/failed character
identification. Today we have many ways to encode characters.
You have many different character encodings in use (for example
ASCII, ISO 8859-1, ISO 10646, ISO 8859-2), and many ways
to transfer these encodings between systems (for example,
URL %-encoding, MIME header encoding, IDNA), and theses ways
can be intermixed inside same character string.
This results in complex parsers and encoders. Lots of possibilities
to make mistakes in parsing, decoding or encoding.
Also lots of code and data to handle all the formats of encoding/decoding
and character sets.
When I write my software I have to spend a lot of time trying to
get everything right.

If we instead agreed on using a single character set (like UCS) and
a single transfer encoding on the wire (like UTF-8) things would be
a lot easier! For example:
- You only need one decoder to translate from transport encoding
into internal encoding (the one you use in your system).
- You only need one encoder to translate from internal encoding
to transport encoding.
- A parser for e-mail headers need only parse e-mail related data.
No need to parse for embedded encoded character sets.
Removing a lot of failures that we get today.
- A lot easier to write applications handling e-mail as all problems
of translating of character sets are much simpler due to the fack that
only one encoding is used for all character data.
- Software gets smaller size due to simpler code and translation data,
and quicker due to easier character handling.
- In summery: simpler.

Moving to a clean usage using single character set for interoperability
will take time and will require gateways between legacy world and
single character world. From my view gateways are best because
then you either are in legacy world or in single character set world.
Using types of embedding results in never getting a clean simple
environment.

A simple clean gateways point between old and new could be to
define that protocols over IPv6 only use UTF-8. This would result
in applications switching between IPv6 and IPv4 handling the gateway
function and applications only in IPv6 do only have to handle the single
character set. Unfortunately I suspect it is to late to define this
simple solution so we have to do negotiation in each protocol instead.

That is my reasons for wanting a single character set to be used for
interoperability.

Dan
Keith Moore
2004-01-02 16:45:06 UTC
Permalink
Post by Dan Oscarsson
Keith have wondered why we need UTF-8 in headers.
No, that's not what I'm wondering at all. What I'm wondering is why
people think that merely saying "it's okay to use utf-8 in message
headers" will result in reduced complexity, when even a simple analysis
indicates that it will increase complexity of every part of the mail
system. And I'm wondering why, given that mail is already getting less
and less useful and less and less reliable due to spam and viruses and
the countermeasures for these, that people think that having unencoded
utf-8 in email addresses is somehow worth making email even less
reliable. We can provide IMAAs with far less disruption and
transition pain without introducing utf-8 in message headers.
Post by Dan Oscarsson
If we instead agreed on using a single character set (like UCS) and
a single transfer encoding on the wire (like UTF-8) things would be
a lot easier!
No it wouldn't, because we're still going to need to accept all of
those various character encodings and ACEs in mail from legacy MUAs and
we're still going to need to accept all of those things in old mail.

But that's part of why I suggested making this change first at the MUA
and message store, without burdening mail transport at least initially -
because the message store is in a good position to convert legacy
messages to the new format, and because POP or IMAP provide the means to
allow simplified MUAs to refuse to deal with legacy messages,
and because it doesn't burden the rest of the mail transport in the near
term.

Keith
Martin Duerst
2004-01-02 19:11:12 UTC
Permalink
Post by Keith Moore
But that's part of why I suggested making this change first at the MUA
and message store, without burdening mail transport at least initially -
because the message store is in a good position to convert legacy
messages to the new format, and because POP or IMAP provide the means to
allow simplified MUAs to refuse to deal with legacy messages,
and because it doesn't burden the rest of the mail transport in the near
term.
I think getting UTF-8 for the message store and for POP and IMAP is
great. But I don't think Paul or John (and the rest of us) are starting
at the wrong end, I just see it as them doing one part of the overall work.
The work on downgrading/upgrading has to be done anyway, and better
only once. And looking at all the pieces has various advantages:

- We make sure that things work together
- We get some increased push on adoption, because people have
various ways and places to start upgrading. It's often difficult
to predict in which sequence upgrading will happen; the best thing
to me seems to have consistent/compatible upgrades available
across the board.
- We know that adoption won't necessarily be that quick. If we
think it takes five years or more, we shouldn't start at one
end and wait for that to be fairly upgraded and then start at
the other end.
- A simplified MUA is an excellent example of the benefit of streamlining
to UTF-8. But for that to work, the MUA not only has to be able
to receive messages with UTF-8 headers, it also has to be able
to send them. For that, it seems that an SMTP extension is just
about right.

Regards, Martin.
Keith Moore
2004-01-02 20:32:45 UTC
Permalink
Post by Martin Duerst
I think getting UTF-8 for the message store and for POP and IMAP is
great. But I don't think Paul or John (and the rest of us) are starting
at the wrong end, I just see it as them doing one part of the overall work.
The work on downgrading/upgrading has to be done anyway, and better
only once.
I agree that a specification for how to do the conversion is necessary,
and that it should only be done once. The disagreement is really about
whether it's a good idea to try to send these messages through SMTP
(making SMTP more complex and error-prone in the process) and also SMTP
negotiation is a good way to define the boundary between the legacy
mail system and the mail system that supports utf-8.
Post by Martin Duerst
- We make sure that things work together
- We get some increased push on adoption, because people have
various ways and places to start upgrading. It's often difficult
to predict in which sequence upgrading will happen; the best thing
to me seems to have consistent/compatible upgrades available
across the board.
- We know that adoption won't necessarily be that quick. If we
think it takes five years or more, we shouldn't start at one
end and wait for that to be fairly upgraded and then start at
the other end.
- A simplified MUA is an excellent example of the benefit of
streamlining
to UTF-8. But for that to work, the MUA not only has to be able
to receive messages with UTF-8 headers, it also has to be able
to send them. For that, it seems that an SMTP extension is just
about right.
Mumble. Yes, we need to "look at" all of the pieces, at least to
understand how they are affected by IMAA. However I doubt we can
afford to fully specify all of the pieces before we start implementing
some of them, and I'm fairly certain that we can't afford to insist
that the entire signal path be upgraded before IMAAs can work. This
means that we must find a way to make IMAAs work without upgrading the
mail transport, and probably, without upgrading the message stores.
If, separately, we can upgrade the mail transport to make it more
efficient to transport such messages, that's fine - and that upgrade
can be evaluated on its own merits,
and it can happen independently of IMAA deployment.

If we're going to upgrade the mail infrastructure, of course we want to
improve transparency and add the ability to handle IMAAs, but there are
a lot more important things to consider than that.

And no, an SMTP extension is not "just about right" for sending utf-8
headers, because this means that we're making mail transport more
complex and less reliable for the sake of making MUAs simpler. That's
moving the complexity in the wrong direction, and SMTP already has
enough problems without trying to make it bear the burden of enforcing
a border between utf-8 header mail and ascii header mail.
Keith Moore
2004-01-02 16:48:25 UTC
Permalink
Post by Dan Oscarsson
A simple clean gateways point between old and new could be to
define that protocols over IPv6 only use UTF-8.
this is a really lousy idea, unless you want to make BOTH transitions
(IPv4 -> IPv6 and ascii -> utf-8) more difficult. it's far easier to
upgrade one part of the system at a time than to require that you have
to upgrade all of your MTAs, MUAs, gateways, firewalls, etc., at the
same time you introduce IPv6.
J-F C. (Jefsey) Morfin
2004-01-02 19:45:50 UTC
Permalink
From my view gateways are best because then you either are in legacy
world or in single character set world.
Unfortunately there is also a real world.
Steve Hole
2004-01-02 21:33:13 UTC
Permalink
Post by Keith Moore
Okay, I still see zero justification for utf-8 headers. The
improvement in transmission and storage efficiency is miniscule. They
make both user agents and mail transports more complex and less
reliable, because MTAs need to have conversion code (which will break
messages and cause delivery failures) and UAs need to be able to handle
old messages that use RFC 2047 (resulting in multiple code paths and
additional failure modes).
I am completely in agreement with this. I still see a huge delay in
transport upgrades for features like DSN, 8BITMIME and PIPELINE because,
basically, everybody has to play to make it useful. If you *require* a
transport upgrade by SMTP extension you will be looking at at decade long
deployment.

---
Steve Hole
Chief Technology Officer - Billing and Payment Systems
ACI Worldwide
<mailto:***@ACIWorldwide.com>
Phone: 780-424-4922
Martin Duerst
2004-01-05 20:10:33 UTC
Permalink
Post by Keith Moore
Universal adoption of IMAAs is anything but assured. The largest age
group of the world population is fairly young (say, less than 21 years
old) . Many of these people have grown up with cheap travel and good
communications, and an international popular culture. They are used to
dealing with people from other countries, and in multiple languages. Many
of these people may find that IMAAs don't benefit them so much and that
it's easier to get all email at an ASCII address (or for that matter at an
E164 number using ENUM) than it is to deal with IMAAs.
There are definitely a lot of young people who travel cheaply.
But there are also a lot of people who can't afford to travel,
but who still might be able to use a computer at some public
place. And while we can't do anything to make travel cheaper,
we can work on making computers easier to use, for everybody.

Regards, Martin.
Keith Moore
2004-01-05 20:29:14 UTC
Permalink
Post by Martin Duerst
Post by Keith Moore
Universal adoption of IMAAs is anything but assured. The largest age
group of the world population is fairly young (say, less than 21
years old) . Many of these people have grown up with cheap travel
and good communications, and an international popular culture. They
are used to dealing with people from other countries, and in multiple
languages. Many of these people may find that IMAAs don't benefit
them so much and that it's easier to get all email at an ASCII
address (or for that matter at an E164 number using ENUM) than it is
to deal with IMAAs.
There are definitely a lot of young people who travel cheaply.
But there are also a lot of people who can't afford to travel,
but who still might be able to use a computer at some public
place. And while we can't do anything to make travel cheaper,
we can work on making computers easier to use, for everybody.
Indeed, and seen at that level, it's a laudible goal. The question is
whether it's worth it to upgrade the entire email infrastructure in
order to provide the specific feature of IMAs, when those IMAs might not
be widely used. Especially given that
a) we can provide the same service without such an expensive upgrade, and
probably without nearly so much disruption, or
b) we could use the expensive upgrade to drastically improve email
service in many other ways than just to provide IMAs.
Martin Duerst
2004-01-06 20:08:53 UTC
Permalink
Post by Keith Moore
Indeed, and seen at that level, it's a laudible goal. The question is
whether it's worth it to upgrade the entire email infrastructure in
order to provide the specific feature of IMAs, when those IMAs might not
be widely used. Especially given that
a) we can provide the same service without such an expensive upgrade, and
probably without nearly so much disruption, or
Even potentially without an SMTP extension, I think that defining
a new header format and a way to down- and upgrade is important,
because only this will allow things such as simplified clients
based on upgraded delivery agents,... And once we are there,
defining an SMTP extension isn't really a big deal, although
adoption may take quite a while.
Post by Keith Moore
b) we could use the expensive upgrade to drastically improve email
service in many other ways than just to provide IMAs.
If it were up to you alone to choose, what would you do in that upgrade?

Regads, Martin.
Keith Moore
2004-01-07 00:10:58 UTC
Permalink
Post by Martin Duerst
Even potentially without an SMTP extension, I think that defining
a new header format and a way to down- and upgrade is important,
because only this will allow things such as simplified clients
based on upgraded delivery agents,...
I think we should think in terms of a new message format, not merely a
new header format, because MIME is so complex and so irregular (some
would say baroque) that the amount of simplification you get from
having utf-8 in message headers is only a small part of that which
could be gained. even if you have utf-8 transparency, mail readers
still need to know how to parse address fields, normalize/canonicalize
addresses, look up IDNs, etc. you still need to deal with RFC 2047 in
received messages and old messages. you still have to support a
different syntax for each different header or bodypart field. compared
to all of this cruft, the extra overhead required to translate
addresses between ACE and raw UTF-8 is minimal.
Post by Martin Duerst
And once we are there,
defining an SMTP extension isn't really a big deal, although
adoption may take quite a while.
the big deal is the leakage and damage to messages that we can expect
from the extension.
Post by Martin Duerst
Post by Keith Moore
b) we could use the expensive upgrade to drastically improve email
service in many other ways than just to provide IMAs.
If it were up to you alone to choose, what would you do in that upgrade?
off the top of my head?

for mail transport:
- clean separation between relaying and submission - MUAs talk only to
submission servers
- mutual authentication (say, based on TLS) would be required
- all messages reliably traceable to origin submission server
- sending MTAs would require external registration (say, in DNS)
- binary transparency assured
- true pipelining
- checkpoint/restart
- engineered so that most failures and configuration errors are
detected and reported at the point where corrective action can be taken
- well-defined behavior for multiple servers
- discourage store-and-forward processing - most messages should make
one hop from the submission server to the recipient's message store
(pass-through processing would be available for routing through local
proxies/firewalls)
- option to request immediate (pass-through) delivery
- ability for submission server to poll for delivery completion
(submission server is responsible for ensuring that the message is
delivered rather than expecting a message to be mailed back to the
sender)
- explicit support for proxies/firewalls/filters (i.e. the behavior is
defined as part of the protocol)
- fast, easy to parse PDU format
- built-in ability to query recipient capabilities, including recipient
filtering preferences
- built-in content negotiation ability
- [maybe] ability to use e.164 addresses as recipient addresses (for
voice mail, fax, sms, etc.)

for message format:
- text header fields in utf-8
- binary transparency for all body parts and extension fields
- extremely regular, easy to parse, probably binary, format
- directories for multipart messages to allow quick access to
individual message components
- alignment with [2]822/MIME data model, field names, content-type
names, charset names, etc.
- s/mime compatibility (yes, I think I know how to make this work)
- clean separation between: envelope/trace information, information for
the recipient UA, information supplied and/or used by recipient message
store
- ability to associate additional information with header addresses -
multiple names (in different languages), spoken name, photo, alternate
addresses (including E.164 & IM), web page URL.
- ability to specify which recipients the message author(s) think
should be included, by default, in a reply-to-all
- all messages traceable to origin domain, with opaque "nonce"
sender-ID traceable to message author (with provision for author to be
anonymous, but recipients don't have to accept anonymous mail)
- well-defined spec for downgrading to 2822/MIME such that there is a
repeatable, "canonical" translation

of course, the devil is in the details, and I'm sure I wouldn't get
agreement on all of these features. but I'm convinced most of this is
technically feasible.

Loading...