Full-width/half-width issues

Discussion:

Roy Badami

2003-04-21 23:55:23 UTC

I wonder whether the simpler and more general approach would be to
apply canonical decompositions to the internationalized address before
dequoting?

This would remove having to worry about full-width at, full-width
quote, etc, and would seem to have no effect on the subsequent
processing of the domain by IDNA.

This would also have the effect that parenthesized characters (such as
U+2474) would be mapped to sequeneces that might be treated as
comments. I'm not sure whether this would be good are bad (but see
below).

However, if we have to worry about whitespace, then we have to worry
about whether we should keep the (2)822 definition of whitespace, or
generalize it to all whitespace chars. (The draft generalizes it only
to full-width space). One approach that would avoid the issue
entirely would be to require non-traditional addresses to avoid using
the obsolete syntax of 2822. This would avoid any use of unquoted
whitespace in a non-traditional addr-spec.

(If the requirement was that the address avoided the use of obsolete
syntax after canonical decompositions were applied, this would also
prohibit the unquoted use of characters such as parenthesized digits.)

-roy

Roy Badami

2003-04-22 00:44:59 UTC

Permalink

Post by Roy Badami
However, if we have to worry about whitespace, then we have to worry
about whether we should keep the (2)822 definition of whitespace, or
generalize it to all whitespace chars. (The draft generalizes it only
to full-width space). One approach that would avoid the issue
entirely would be to require non-traditional addresses to avoid using
the obsolete syntax of 2822. This would avoid any use of unquoted
whitespace in a non-traditional addr-spec.
(If the requirement was that the address avoided the use of obsolete
syntax after canonical decompositions were applied, this would also
prohibit the unquoted use of characters such as parenthesized digits.)

Forget I said that. RFC 2822 appears to still allow whitespace and
comments adjecent to the at-sign. (Though I still think there's merit
in prohibiting the use of the obsolete syntax with non-traditional
addresses.)

(Oops, and there's no such thing as FULL WIDTH SPACE, I guess the
space character in JIS X 0208 is IDEOGRAPHIC SPACE.)

So the question remains, should one allow all unicode whitespace
characters, rather than just SPACE and HT?

It strikes me as confusing (and ultimately wrong) that the validity of
an address can depend on the kind of whitespace used. Though I've
never actually seen anyone use unquoted whitespace in an address, so
maybe this can just be ignored....

-roy

Adam M. Costello

2003-04-22 03:14:48 UTC

Permalink

Post by Roy Badami
I wonder whether the simpler and more general approach would be to
apply canonical decompositions to the internationalized address before
dequoting?
This would also have the effect that parenthesized characters (such
as U+2474) would be mapped to sequeneces that might be treated as
comments.

Canonical decomposition has no effect on U+2474. I think you meant
compatible decomposition.

The effect you describe can be generalized: Any character whose
decomposition contains an ASCII character might create issues if that
ASCII character is a metacharacter.

This idea is worth considering, I just haven't given it much thought
yet.

The decomposition might also need to be done before requoting, in order
to catch hidden metacharacters that need to be protected from the more
aggressive dequoting. I found the following gotchas:

226E NOT LESS-THAN -> 003C 0338
2260 NOT EQUAL TO -> 003D 0338
226F NOT GREATER-THAN -> 003E 0338

These characters can be present in Nameprepped strings, and their
decompositions contain the ASCII characters < = > (two of which are not
allowed unquoted in local parts in message headers and SMTP commands).

Maybe one way to avoid some gotchas is to use NFKC before dequoting,
rather than NFKD, because Nameprep uses NFKC.

Post by Roy Badami
This would remove having to worry about full-width at, full-width
quote, etc, and would seem to have no effect on the subsequent
processing of the domain by IDNA.

You seem to be suggesting that the entire mail address be normalized,
although IMAA currently talks about dequoting only the local part (the
dequoting of the domain name was implicit in IDNA).

IDNA is mostly silent about how domain names are separated into labels.
In particular, if the domain name contains a compatibility character
whose decomposition includes an ASCII full stop (.), does that delimit
labels or not? I think we punted on this because it was considered a
user-interface issue.

If IMAA required normalization of the entire mail address, then it would
settle the question for domain names appearing in mail addresses. But
should such a decision be made by IMAA, or by a future update to IDNA?

Hmmm, the motivation for requiring recognition of fullwidth
metacharacters was consistency with the requirement to recognize
fullwidth at-sign, but now that I think about it, IDNA doesn't have
that sort of consistency. For example, an extended DNS master-file
format might allow non-ASCII domain names, and IDNA would require that
fullwidth dots be recognized as dots in such names, but it does not
require that fullwidth backslash be recognized as beginning an escape
sequence. That's a private interface issue for the designer of the
extended master-file format.

So maybe IMAA, like IDNA, should not try to dictate exactly how
dequoting is done, and should, like IDNA, limit any discussion of
fullwidth characters to those delimiters that remain after dequoting
(namely, at-sign).

It could mention that applications might want to perform some sort of
normalization before dequoting (and requoting?), but could say that that
is a user interface issue ultimately left to the application to decide.

Post by Roy Badami
However, if we have to worry about whitespace, then we have to worry
about whether we should keep the (2)822 definition of whitespace, or
generalize it to all whitespace chars.

Applying NFKC or NFKD would take care of it, because almost all
whitespace characters become ASCII space (and the few that don't
presumably don't for a reason--they aren't considered regular
whitespace).

But if we decide against normalizing before dequoting, then we should
still consider generalizing the whitespace. Stringprep provides a table
of space characters that we could refer to.

Post by Roy Badami
One approach that would avoid the issue entirely would be to require
non-traditional addresses to avoid using the obsolete syntax of 2822.
This would avoid any use of unquoted whitespace in a non-traditional
addr-spec.

RFC 2822 defines the syntax of mail addresses appearing in
message headers, but does not define the syntax of mail addresses
entered/displayed in user interfaces, or appearing in config files.
Mail addresses in those other contexts might or might not use the syntax
of RFC 2822 (or 2821), and it is in precisely those contexts where the
definition of an IMA is relevant. (The definition of an IMA is not
relevant in message headers because message headers are ASCII-only.) So
the definition of an IMA cannot rely on the RFC 2822 syntax; it needs to
be more general than that.

Post by Roy Badami
(Oops, and there's no such thing as FULL WIDTH SPACE, I guess the
space character in JIS X 0208 is IDEOGRAPHIC SPACE.)

Good catch! The fullwidth versions of ASCII characters ought to be
defined as the characters that have <wide> decompositions to ASCII
characters. The characters fitting that description all lie in the
range FF01..FF5E, except for U+3000 (ideographic space). The definition
of "fullwidth version" in the IMAA draft should have included U+3000,
but didn't.

AMC

Roy Badami

2003-04-23 22:10:03 UTC

Permalink

Post by Adam M. Costello
Canonical decomposition has no effect on U+2474. I think you meant
compatible decomposition.

I did indeed.

Post by Adam M. Costello
You seem to be suggesting that the entire mail address be normalized,
although IMAA currently talks about dequoting only the local part (the
dequoting of the domain name was implicit in IDNA).

I was indeed suggesting that.

Post by Adam M. Costello
IDNA is mostly silent about how domain names are separated into labels.
In particular, if the domain name contains a compatibility character
whose decomposition includes an ASCII full stop (.), does that delimit
labels or not? I think we punted on this because it was considered a
user-interface issue.
If IMAA required normalization of the entire mail address, then it would
settle the question for domain names appearing in mail addresses. But
should such a decision be made by IMAA, or by a future update to IDNA?

I don't think it makes sense for IDNA to punt it to the user
interface, per se. IDNA can punt the issue to the user of IDNA, which
in this case is IMAA. IMAA then has to make a decision (which could
be to punt it to the user of the IMAA, which could in turn be a user
interface).

Post by Adam M. Costello
Hmmm, the motivation for requiring recognition of fullwidth
metacharacters was consistency with the requirement to recognize
fullwidth at-sign, but now that I think about it, IDNA doesn't have
that sort of consistency. For example, an extended DNS master-file
format might allow non-ASCII domain names, and IDNA would require that
fullwidth dots be recognized as dots in such names, but it does not
require that fullwidth backslash be recognized as beginning an escape
sequence. That's a private interface issue for the designer of the
extended master-file format.

Ordinary users never encounter zone files, so it is reasonable to
leave that as a private issue for the name server implementor.

Ordinary users deal with RFC 822/2822 addresses (or at least
addr-specs) every day, and such constructs contain a variety of
metacharacters including (sometimes) quoting.

Post by Adam M. Costello
So maybe IMAA, like IDNA, should not try to dictate exactly how
dequoting is done, and should, like IDNA, limit any discussion of
fullwidth characters to those delimiters that remain after dequoting
(namely, at-sign).

IDNA addresses the issue for every metacharacter that an end-user will
ever see (there is only one, of course: full stop). I think therefore
it's consistent for IMAA to consider doing likewise (as the
current draft does).

Post by Adam M. Costello
RFC 2822 defines the syntax of mail addresses appearing in
message headers, but does not define the syntax of mail addresses
entered/displayed in user interfaces, or appearing in config files.
Mail addresses in those other contexts might or might not use the syntax
of RFC 2822 (or 2821), and it is in precisely those contexts where the
definition of an IMA is relevant.

See my separate post "On humanly-readable (printable) e-mail
addresses", which started out (largely) as a response to this
paragraph.

For any e-mail system to be useful, it needs an external
(humanly-readable, printable) representation of addresses. I would
argue that the Internet has one, even though the original authors of
RFC 822 never saw the need to formalize it.

-roy

Jeffrey J Zahari

2003-05-02 05:25:10 UTC

Permalink

Dear all,

i-DNS.net will launch punycode multilingual domain names this 2nd of June 2003. Punycode will be used for all existing and new multilingual domain names.

The Migration Plan starting in May will prepare all existing multilingual domain names for punycode, and the Company has also upgraded all IDN-related Software to support this new IETF standard.

i-DNS.net, the IDN leader who thru its beginnings at the National University of Singapore pioneered the IDN movement in early 1998, is once again proud to be amongst the few if not the first IDN provider to pioneer adoption of this new IETF standard, 3 years in the making.

regards
Jeffrey J Zahari