Post by Roy BadamiI wonder whether the simpler and more general approach would be to
apply canonical decompositions to the internationalized address before
dequoting?
This would also have the effect that parenthesized characters (such
as U+2474) would be mapped to sequeneces that might be treated as
comments.
Canonical decomposition has no effect on U+2474. I think you meant
compatible decomposition.
The effect you describe can be generalized: Any character whose
decomposition contains an ASCII character might create issues if that
ASCII character is a metacharacter.
This idea is worth considering, I just haven't given it much thought
yet.
The decomposition might also need to be done before requoting, in order
to catch hidden metacharacters that need to be protected from the more
aggressive dequoting. I found the following gotchas:
226E NOT LESS-THAN -> 003C 0338
2260 NOT EQUAL TO -> 003D 0338
226F NOT GREATER-THAN -> 003E 0338
These characters can be present in Nameprepped strings, and their
decompositions contain the ASCII characters < = > (two of which are not
allowed unquoted in local parts in message headers and SMTP commands).
Maybe one way to avoid some gotchas is to use NFKC before dequoting,
rather than NFKD, because Nameprep uses NFKC.
Post by Roy BadamiThis would remove having to worry about full-width at, full-width
quote, etc, and would seem to have no effect on the subsequent
processing of the domain by IDNA.
You seem to be suggesting that the entire mail address be normalized,
although IMAA currently talks about dequoting only the local part (the
dequoting of the domain name was implicit in IDNA).
IDNA is mostly silent about how domain names are separated into labels.
In particular, if the domain name contains a compatibility character
whose decomposition includes an ASCII full stop (.), does that delimit
labels or not? I think we punted on this because it was considered a
user-interface issue.
If IMAA required normalization of the entire mail address, then it would
settle the question for domain names appearing in mail addresses. But
should such a decision be made by IMAA, or by a future update to IDNA?
Hmmm, the motivation for requiring recognition of fullwidth
metacharacters was consistency with the requirement to recognize
fullwidth at-sign, but now that I think about it, IDNA doesn't have
that sort of consistency. For example, an extended DNS master-file
format might allow non-ASCII domain names, and IDNA would require that
fullwidth dots be recognized as dots in such names, but it does not
require that fullwidth backslash be recognized as beginning an escape
sequence. That's a private interface issue for the designer of the
extended master-file format.
So maybe IMAA, like IDNA, should not try to dictate exactly how
dequoting is done, and should, like IDNA, limit any discussion of
fullwidth characters to those delimiters that remain after dequoting
(namely, at-sign).
It could mention that applications might want to perform some sort of
normalization before dequoting (and requoting?), but could say that that
is a user interface issue ultimately left to the application to decide.
Post by Roy BadamiHowever, if we have to worry about whitespace, then we have to worry
about whether we should keep the (2)822 definition of whitespace, or
generalize it to all whitespace chars.
Applying NFKC or NFKD would take care of it, because almost all
whitespace characters become ASCII space (and the few that don't
presumably don't for a reason--they aren't considered regular
whitespace).
But if we decide against normalizing before dequoting, then we should
still consider generalizing the whitespace. Stringprep provides a table
of space characters that we could refer to.
Post by Roy BadamiOne approach that would avoid the issue entirely would be to require
non-traditional addresses to avoid using the obsolete syntax of 2822.
This would avoid any use of unquoted whitespace in a non-traditional
addr-spec.
RFC 2822 defines the syntax of mail addresses appearing in
message headers, but does not define the syntax of mail addresses
entered/displayed in user interfaces, or appearing in config files.
Mail addresses in those other contexts might or might not use the syntax
of RFC 2822 (or 2821), and it is in precisely those contexts where the
definition of an IMA is relevant. (The definition of an IMA is not
relevant in message headers because message headers are ASCII-only.) So
the definition of an IMA cannot rely on the RFC 2822 syntax; it needs to
be more general than that.
Post by Roy Badami(Oops, and there's no such thing as FULL WIDTH SPACE, I guess the
space character in JIS X 0208 is IDEOGRAPHIC SPACE.)
Good catch! The fullwidth versions of ASCII characters ought to be
defined as the characters that have <wide> decompositions to ASCII
characters. The characters fitting that description all lie in the
range FF01..FF5E, except for U+3000 (ideographic space). The definition
of "fullwidth version" in the IMAA draft should have included U+3000,
but didn't.
AMC