Charles Lindsey
2004-03-18 10:32:55 UTC
Discussion of UTF-8 headers as a way forward seems to have gone quiet.
Here are some thoughts intended to revive them. This is not a solution,
but more a demonstration of viability and a checklist of issues that
proposals would have to address.
I could work it up into an ID if people think that would be useful.
A Model for proposals for I18N of email.
----------------------------------------
Proposals so far discussed on the IMAA list have been of two types:
1. End-to-end models which require only the user agents to be upgraded.
2. 8bit-clean models which require the transport machinery to be upgraded
also.
Examples of models of type 2 are:
draft-hoffman-utf8headers-00.txt
draft-klensin-emailaddr-i18n-02.txt
draft-klensin-idn-tld-00.txt
It is presumed that the charset used in proposals of the 2nd type will be
UTF-8, though in principle it could be done with other charsets.
An "i18n-extended message" is a message which contains some headers in
UTF-8, or which is to be sent from, or to, an "i18n-address", which is one
whose local-part and/or domain is expressed in UTF-8.
We consider such email messages that are sent from A(lice) to B(øb). Both
A and B need to possess upgraded MUAs with at least the capability of
generating and displaying headers in UTF-8 (in fact such agents already
exist). In addition, both A and B need access to an upgraded MTA (though
some limited communication may be possoble if only A is upgraded). It is
not assumed that intermediate MTAs have been upgraded.
The "initiator" of the exchange is whichever party has
(i) chosen to publish an i18n-address (e.g. Bøb), or
(ii) chooses to dispatch an i18n-extended message (e.g. Alice) for
whatever reason.
The initiator should not be surprised if others fail to email him (case
i), or if her messages are bounced (case ii). That is simply the price to
be paid for using features that are not yet widely deployed on the
Internet.
An "i18n-scheme" is a draft proposal (intended for standards track or
experimental) for enabling the sending and receiving of i18n-extended
messages. The purpose of this model is facilitate the construction
of such schemes by setting out requirements for them. It is hoped that
this will help to ensure that such schemes are complete and consistent,
and to enable different schemes to be compared with one another.
All i18n-schemes compliant with this model MUST ensure that, in all cases,
EITHER the message gets through
OR it gets bounced reliably.
THE MODEL
---------
Each scheme consists of extensions to various protocols (notably RFC 2822
and RFC 2821).
1. RFC 2822
1.1 RFC 2822 is extended by defining syntax allowing certain components of
certain headers to be expressed in UTF-8.
For example:
a) the local-part of all mailboxes (this would seem to be the minimal
extension);
b) the domain in each mailbox;
c) unstructureds, in some or all of Subject, phrases, comments, etc.
But probably not:
y) date-times, as in DATE headers, etc;
z) msgids, as in Message-ID;
these being better left in strict ASCII for the foreseeable future.
The extended syntax of <address> MAY provide for alternative addresses,
mailboxes or local-parts to be specified.
1.2 With each such extended header, there SHOULD also be specified:
a) normalization requirements, e.g. NFC, NFKC, stringprep, nameprep,
etc (or an explicit statement that normalization is not needed);
b) a downsizing mechanism, preferably one that would be understood by
current agents (or an explicit statement that no downsizing is
provided for that header).
E.g. unstructured might be downsized as per RFC 2047;
IRIs might be downsized to URIs;
domains might be downsized as in IDNA;
local-parts might be downsized with the aid of Address-Map headers,
or by looking up in some mapping database;
<address>es might be restricted to their all-ASCII alternatives.
Particular care needs to be exercised if the scheme allows Received
headers to use UTF-8, since such headers may need to be examined in their
downsized form at intermediate sites. Requiring the Received header to
remain in all-ASCII (by downsizing all domains using IDNA and omitting the
FOR field if necessary) is an alternative strategy.
1.3 There SHOULD be a new header to be included in all i18n-extended
messages. This will be referred to as the "Foobar-header" for now, since
its exact significance will depend on the particular scheme, and it is not
clear at this stage what additional parameters and features it might
require. Its purpose is to enable other agents that receive the it and
need to do special processing on i18n-extended messages to enter their
special processing mode (or, more particularly, to avoid doing so with
normal all-ASCII messages, thus avoiding the necessity to scan every
message looking for an 8th bit set).
1.4 Headers defined by other standards might be extended in similar ways.
For example:
a) parameters, as they arise in Content-Type and other MIME headers,
with downsizing according to RFC 2231;
b) if the scheme is extended to Netnews, newsgroup-names, as in
Newsgroups-headers.
But
z) tokens might be better left in ASCII for now.
1.5 It is expected that additional UTF-8 extensions to headers will be
defined over time. The infrastruture of the scheme MUST be able to
accomodate all such extensions.
1.6 For cases where there is no downsizing specified, there MUST be a
fallback downsizing suitable for all cases (this will likely be some form
of encapsulation). This downsizing is to be used where either none was
specified for some particular header, or where the agent performing the
downsizing was unaware of the proper downsizing for that header (but had
presumably detected the presence of UTF-8 characters).
1.7 Where the fallback downsizing method consists of encapsulation of the
entire message, the scheme MUST specify the headers to be included in the
wrapper message (naturally, these will include a Content-Type header
indicating the method of encapsulation). In particular, it must be made
clear whether the wrapper message is to be sent to the final recipient of
the enclosed message, or whether it is to be sent to some special address
(e.g. ***@recipient's-domain) where the encapsulation can be undone
prior to final delivery.
Observe that a minimal instantiation of this model would use the fallback
downsizing for all cases, although this would give poor robustness where
the receiving party was not suitably prepared.
2. RFC 2821
2.1 It is in the nature of this model that at least the end-point MTAs
will need to be upgraded. SMTP servers MUST therefore advertize (in
response to EHLO) one or more service extensions such as the following,
which we provisionally label for the purposes of this model:
UTF-8-HEADERS for use when the i18n-extended message contains UTF-8
characters in one or more of its headers.
UTF-8-ENVELOPES for use when either or both of the FROM and RCPT
addresses includes UTF-8 characters.
There might in fact be only one extension advertized encompassing both
these purposes, but it is convenient to distinguish between them for the
purposes of this model.
It might be appropriate to insist that, if either of these extensions is
offered, 8BITMIME must be offered as well.
2.2 Clearly, at the very least, upgraded servers MUST be "8bit-clean" as
regards transmission of headers, and SHOULD be 8bit-clean as regards
headers in MIME body parts and message/rfc822 headers (though this is
pretty well guaranteed if 8BITMIME is supported).
2.2 Each originating MUA and each extended MTA MUST NOT offer an
i18n-extended message to an SMTP server unless that server advertizes the
apropriate extension (failure to observe this removes any guarantee that
the message will not be garbled or lost en route).
2.3 The "normal" operation of any scheme will be where all the MTAs
between Alice and Bøb have been extended, and the message can pass through
the chain without downsizing or other modification, just as all-ASCII
messages do at the present time. It is expected that systems will be
optimized for this normal case, and thus there will be a performance
penalty wherever downsizing or upsizing is needed.
2.4 It is the responsibility of any server offering these extensions to
ensure that any i18n-extended message that is received is either:
a) forwarded to another extended server that will honour that
responsibility, or
b) downsized so that it may be forwarded over servers compliant with
the current standards, or
c) delivered to the mailbox of the ultimate recipient, in a format
understood by that mailbox, or
d) bounced up the Return-Path.
A server cannot be expected to know the capabilities of the ultimate MUA
of the recipient (if only because recipients tend to use different MUAs on
differnet occasions, as when accessing their mailboxes from different
hosts). Therefore, the servers responsibility ends when the message is
safely stored in the proper mailbox.
2.5 If a server advertizes the UTF-8-HEADERS capability and the message is
to be forwarded to a server not advertizing that capability, it MUST
either downsize the message using one of the downsizing mechanisms
provided by the scheme, or else it MUST bounce the message.
It MAY rely on the presence of the Foobar-header, if provided by the
scheme, to determine whether the possibility of downsizing needs to be
considered, or alternatively it MAY rely on some special parameter in the
FROM command. Note that utilizing the Foobar header is cintrary to the
spirit of RFC 2821 3.7, but that bridge has probably already been crossed
with the specification of 8BITMIME.
2.6 For a server that advertizes the UTF-8-ENVELOPES capability, the
scheme MUST provided an extended syntax for addresses in the FROM and/or
RCPT commands (and maybe EHLO, VRFY and EXPN also), and MUST specify how
those addresses are to be interpreted. This may or may not involve
converting domains using IDNA, in order to determine which server to send
the message to next, and it may or may not involved transforming the
local-part using information provided in the message (e.g. Address-Map
headers) or in some external database. Any
<address> so transformed MAY be conveyed back to the client by means of a
251 response.
Note that, pedantically speaking, allowing UTF-8 in the envelope (even in
a service extension) violates RFC 2821 Ap. A. Will John Klensin please
advise on that one?
The SERVER MAY (and presumably will) use those extended commands when
communicating with any server downstream that advertizes the same
capability.
2.7 The scheme MUST specify who is responsible for upsizing any downsized
message (except that upsizing MAY be omitted if the downsizing used only
mechanisms, such as RFC 2047 or IDNA, that are compatible with current
systems). It MAY specify that upsizing be performed upon arrival at the
next server with the necessary capability, though it might be considered
more reasonable to leave it in downsized form until arrival at the
end-point.
It MAY even specify that upsizing is to be the responsibility of the final
MUA.
2.8 If a message is downsized by encapsulation, the scheme MUST provide
some mechanism whereby any Received headers within the encapsulation are
combined with any Received headers added to the wrapper en route so that
the final recipient sees the same set of Received headers as if the
message had passed directly to him over the same overall route.
2.9 In any case, the scheme MUST ensure that Received headers are
downsized as necessary when passed to non-upgraded servers. However, since
the domain name of the server itself, as will be recorded in the FROM
fields of the Received headers that it adds, will likely be all-ASCII (for
that server will also be in the business of handling non-i18n-upgraded
messages), there is no pressing need for UTF-8 in the Received header at
all except for the FOR field which could, as already suggested, be
omitted.
2.10 All service extensions applied to RFC 2821 SHOULD be applied to RFC
2476 also.
3. Other transport mechanisms.
It is possible that other transport mechanisms may not offer the
possibility to advertize upgraded capabilities. But if they are already
8bit-clean, that is no bar to their use for transporting i18n-extended
messages. However, there will then be no possibility for downsizing en
route so that, at a gateway to an SMTP transport, some mechanism will be
needed to determine whether that transport is suited to the message.
Therefore, the scheme SHOULD specify requirements for gateways between
transport mechanisms (such as by requiring inspection of the Foobar
header).
4. Delivery agents.
4.1 Delivery agents hand off messages to message stores, which can
range from IMAP and POP3 servers through simple files in
mbox format and procmail filters. Mailing list expanders are also delivery
agents for the purpose of this model. In essence, their purpose (mailing
lists excepted) is to store received messages in a "mailbox" which is the
property of the ultimate recipient of a message. It is possible that the
final upsizing of a message may take place within the delivery agent.
4.2 The scheme MUST specify some means whereby the final transport server
can assure itself that the delivery agent leading to the recipients
mailbox is capable of accepting i18n-extended messages (this may not be as
easy as it sounds, since the standards defining such agents tend not to
specify the interface with the server, but only the interface with
recipients' MUAs).
4.3 The scheme SHOULD specify some means whereby the delivery agent can be
aware that the message is an i18n-extended one. The presence of a Foobar
header would suffice for the purpose.
4.4 The scheme SHOULD specify suitable extensions for at least the IMAP
and POP3 protocols. These extensions SHOULD include the ability to deliver
the message to MUAs with the headers in UTF-8, as they were originally
sent.
Here are some thoughts intended to revive them. This is not a solution,
but more a demonstration of viability and a checklist of issues that
proposals would have to address.
I could work it up into an ID if people think that would be useful.
A Model for proposals for I18N of email.
----------------------------------------
Proposals so far discussed on the IMAA list have been of two types:
1. End-to-end models which require only the user agents to be upgraded.
2. 8bit-clean models which require the transport machinery to be upgraded
also.
Examples of models of type 2 are:
draft-hoffman-utf8headers-00.txt
draft-klensin-emailaddr-i18n-02.txt
draft-klensin-idn-tld-00.txt
It is presumed that the charset used in proposals of the 2nd type will be
UTF-8, though in principle it could be done with other charsets.
An "i18n-extended message" is a message which contains some headers in
UTF-8, or which is to be sent from, or to, an "i18n-address", which is one
whose local-part and/or domain is expressed in UTF-8.
We consider such email messages that are sent from A(lice) to B(øb). Both
A and B need to possess upgraded MUAs with at least the capability of
generating and displaying headers in UTF-8 (in fact such agents already
exist). In addition, both A and B need access to an upgraded MTA (though
some limited communication may be possoble if only A is upgraded). It is
not assumed that intermediate MTAs have been upgraded.
The "initiator" of the exchange is whichever party has
(i) chosen to publish an i18n-address (e.g. Bøb), or
(ii) chooses to dispatch an i18n-extended message (e.g. Alice) for
whatever reason.
The initiator should not be surprised if others fail to email him (case
i), or if her messages are bounced (case ii). That is simply the price to
be paid for using features that are not yet widely deployed on the
Internet.
An "i18n-scheme" is a draft proposal (intended for standards track or
experimental) for enabling the sending and receiving of i18n-extended
messages. The purpose of this model is facilitate the construction
of such schemes by setting out requirements for them. It is hoped that
this will help to ensure that such schemes are complete and consistent,
and to enable different schemes to be compared with one another.
All i18n-schemes compliant with this model MUST ensure that, in all cases,
EITHER the message gets through
OR it gets bounced reliably.
THE MODEL
---------
Each scheme consists of extensions to various protocols (notably RFC 2822
and RFC 2821).
1. RFC 2822
1.1 RFC 2822 is extended by defining syntax allowing certain components of
certain headers to be expressed in UTF-8.
For example:
a) the local-part of all mailboxes (this would seem to be the minimal
extension);
b) the domain in each mailbox;
c) unstructureds, in some or all of Subject, phrases, comments, etc.
But probably not:
y) date-times, as in DATE headers, etc;
z) msgids, as in Message-ID;
these being better left in strict ASCII for the foreseeable future.
The extended syntax of <address> MAY provide for alternative addresses,
mailboxes or local-parts to be specified.
1.2 With each such extended header, there SHOULD also be specified:
a) normalization requirements, e.g. NFC, NFKC, stringprep, nameprep,
etc (or an explicit statement that normalization is not needed);
b) a downsizing mechanism, preferably one that would be understood by
current agents (or an explicit statement that no downsizing is
provided for that header).
E.g. unstructured might be downsized as per RFC 2047;
IRIs might be downsized to URIs;
domains might be downsized as in IDNA;
local-parts might be downsized with the aid of Address-Map headers,
or by looking up in some mapping database;
<address>es might be restricted to their all-ASCII alternatives.
Particular care needs to be exercised if the scheme allows Received
headers to use UTF-8, since such headers may need to be examined in their
downsized form at intermediate sites. Requiring the Received header to
remain in all-ASCII (by downsizing all domains using IDNA and omitting the
FOR field if necessary) is an alternative strategy.
1.3 There SHOULD be a new header to be included in all i18n-extended
messages. This will be referred to as the "Foobar-header" for now, since
its exact significance will depend on the particular scheme, and it is not
clear at this stage what additional parameters and features it might
require. Its purpose is to enable other agents that receive the it and
need to do special processing on i18n-extended messages to enter their
special processing mode (or, more particularly, to avoid doing so with
normal all-ASCII messages, thus avoiding the necessity to scan every
message looking for an 8th bit set).
1.4 Headers defined by other standards might be extended in similar ways.
For example:
a) parameters, as they arise in Content-Type and other MIME headers,
with downsizing according to RFC 2231;
b) if the scheme is extended to Netnews, newsgroup-names, as in
Newsgroups-headers.
But
z) tokens might be better left in ASCII for now.
1.5 It is expected that additional UTF-8 extensions to headers will be
defined over time. The infrastruture of the scheme MUST be able to
accomodate all such extensions.
1.6 For cases where there is no downsizing specified, there MUST be a
fallback downsizing suitable for all cases (this will likely be some form
of encapsulation). This downsizing is to be used where either none was
specified for some particular header, or where the agent performing the
downsizing was unaware of the proper downsizing for that header (but had
presumably detected the presence of UTF-8 characters).
1.7 Where the fallback downsizing method consists of encapsulation of the
entire message, the scheme MUST specify the headers to be included in the
wrapper message (naturally, these will include a Content-Type header
indicating the method of encapsulation). In particular, it must be made
clear whether the wrapper message is to be sent to the final recipient of
the enclosed message, or whether it is to be sent to some special address
(e.g. ***@recipient's-domain) where the encapsulation can be undone
prior to final delivery.
Observe that a minimal instantiation of this model would use the fallback
downsizing for all cases, although this would give poor robustness where
the receiving party was not suitably prepared.
2. RFC 2821
2.1 It is in the nature of this model that at least the end-point MTAs
will need to be upgraded. SMTP servers MUST therefore advertize (in
response to EHLO) one or more service extensions such as the following,
which we provisionally label for the purposes of this model:
UTF-8-HEADERS for use when the i18n-extended message contains UTF-8
characters in one or more of its headers.
UTF-8-ENVELOPES for use when either or both of the FROM and RCPT
addresses includes UTF-8 characters.
There might in fact be only one extension advertized encompassing both
these purposes, but it is convenient to distinguish between them for the
purposes of this model.
It might be appropriate to insist that, if either of these extensions is
offered, 8BITMIME must be offered as well.
2.2 Clearly, at the very least, upgraded servers MUST be "8bit-clean" as
regards transmission of headers, and SHOULD be 8bit-clean as regards
headers in MIME body parts and message/rfc822 headers (though this is
pretty well guaranteed if 8BITMIME is supported).
2.2 Each originating MUA and each extended MTA MUST NOT offer an
i18n-extended message to an SMTP server unless that server advertizes the
apropriate extension (failure to observe this removes any guarantee that
the message will not be garbled or lost en route).
2.3 The "normal" operation of any scheme will be where all the MTAs
between Alice and Bøb have been extended, and the message can pass through
the chain without downsizing or other modification, just as all-ASCII
messages do at the present time. It is expected that systems will be
optimized for this normal case, and thus there will be a performance
penalty wherever downsizing or upsizing is needed.
2.4 It is the responsibility of any server offering these extensions to
ensure that any i18n-extended message that is received is either:
a) forwarded to another extended server that will honour that
responsibility, or
b) downsized so that it may be forwarded over servers compliant with
the current standards, or
c) delivered to the mailbox of the ultimate recipient, in a format
understood by that mailbox, or
d) bounced up the Return-Path.
A server cannot be expected to know the capabilities of the ultimate MUA
of the recipient (if only because recipients tend to use different MUAs on
differnet occasions, as when accessing their mailboxes from different
hosts). Therefore, the servers responsibility ends when the message is
safely stored in the proper mailbox.
2.5 If a server advertizes the UTF-8-HEADERS capability and the message is
to be forwarded to a server not advertizing that capability, it MUST
either downsize the message using one of the downsizing mechanisms
provided by the scheme, or else it MUST bounce the message.
It MAY rely on the presence of the Foobar-header, if provided by the
scheme, to determine whether the possibility of downsizing needs to be
considered, or alternatively it MAY rely on some special parameter in the
FROM command. Note that utilizing the Foobar header is cintrary to the
spirit of RFC 2821 3.7, but that bridge has probably already been crossed
with the specification of 8BITMIME.
2.6 For a server that advertizes the UTF-8-ENVELOPES capability, the
scheme MUST provided an extended syntax for addresses in the FROM and/or
RCPT commands (and maybe EHLO, VRFY and EXPN also), and MUST specify how
those addresses are to be interpreted. This may or may not involve
converting domains using IDNA, in order to determine which server to send
the message to next, and it may or may not involved transforming the
local-part using information provided in the message (e.g. Address-Map
headers) or in some external database. Any
<address> so transformed MAY be conveyed back to the client by means of a
251 response.
Note that, pedantically speaking, allowing UTF-8 in the envelope (even in
a service extension) violates RFC 2821 Ap. A. Will John Klensin please
advise on that one?
The SERVER MAY (and presumably will) use those extended commands when
communicating with any server downstream that advertizes the same
capability.
2.7 The scheme MUST specify who is responsible for upsizing any downsized
message (except that upsizing MAY be omitted if the downsizing used only
mechanisms, such as RFC 2047 or IDNA, that are compatible with current
systems). It MAY specify that upsizing be performed upon arrival at the
next server with the necessary capability, though it might be considered
more reasonable to leave it in downsized form until arrival at the
end-point.
It MAY even specify that upsizing is to be the responsibility of the final
MUA.
2.8 If a message is downsized by encapsulation, the scheme MUST provide
some mechanism whereby any Received headers within the encapsulation are
combined with any Received headers added to the wrapper en route so that
the final recipient sees the same set of Received headers as if the
message had passed directly to him over the same overall route.
2.9 In any case, the scheme MUST ensure that Received headers are
downsized as necessary when passed to non-upgraded servers. However, since
the domain name of the server itself, as will be recorded in the FROM
fields of the Received headers that it adds, will likely be all-ASCII (for
that server will also be in the business of handling non-i18n-upgraded
messages), there is no pressing need for UTF-8 in the Received header at
all except for the FOR field which could, as already suggested, be
omitted.
2.10 All service extensions applied to RFC 2821 SHOULD be applied to RFC
2476 also.
3. Other transport mechanisms.
It is possible that other transport mechanisms may not offer the
possibility to advertize upgraded capabilities. But if they are already
8bit-clean, that is no bar to their use for transporting i18n-extended
messages. However, there will then be no possibility for downsizing en
route so that, at a gateway to an SMTP transport, some mechanism will be
needed to determine whether that transport is suited to the message.
Therefore, the scheme SHOULD specify requirements for gateways between
transport mechanisms (such as by requiring inspection of the Foobar
header).
4. Delivery agents.
4.1 Delivery agents hand off messages to message stores, which can
range from IMAP and POP3 servers through simple files in
mbox format and procmail filters. Mailing list expanders are also delivery
agents for the purpose of this model. In essence, their purpose (mailing
lists excepted) is to store received messages in a "mailbox" which is the
property of the ultimate recipient of a message. It is possible that the
final upsizing of a message may take place within the delivery agent.
4.2 The scheme MUST specify some means whereby the final transport server
can assure itself that the delivery agent leading to the recipients
mailbox is capable of accepting i18n-extended messages (this may not be as
easy as it sounds, since the standards defining such agents tend not to
specify the interface with the server, but only the interface with
recipients' MUAs).
4.3 The scheme SHOULD specify some means whereby the delivery agent can be
aware that the message is an i18n-extended one. The presence of a Foobar
header would suffice for the purpose.
4.4 The scheme SHOULD specify suitable extensions for at least the IMAP
and POP3 protocols. These extensions SHOULD include the ability to deliver
the message to MUAs with the headers in UTF-8, as they were originally
sent.
--
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5
Charles H. Lindsey ---------At Home, doing my own thing------------------------
Tel: +44 161 436 6131 Fax: +44 161 436 6133 Web: http://www.cs.man.ac.uk/~chl
Email: ***@clerew.man.ac.uk Snail: 5 Clerewood Ave, CHEADLE, SK8 3JU, U.K.
PGP: 2C15F1A9 Fingerprint: 73 6D C2 51 93 A0 01 E7 65 E8 64 7E 14 A4 AB A5