Question: Fullwidth double-quote and fullwidth backslash

Discussion:

Question: Fullwidth double-quote and fullwidth backslash

Roy Badami

2003-02-14 12:07:08 UTC

When dequoting/requoting localparts, should we consider recognizing
fullwidth double quotes and fullwidth backslash (and any other
double-quote-like and backlash-like characters)?

It seems to me that the arguments for this are similar to those for
fullwidth dot and fullwidth at, and once we decide to recognize
metacharacters in fullwidth form, we should apply this consistently to
*all* metacharacters.

-roy

Claus Färber

2003-02-14 13:33:00 UTC

Post by Roy Badami
When dequoting/requoting localparts, should we consider recognizing
fullwidth double quotes and fullwidth backslash (and any other
double-quote-like and backlash-like characters)?

Just do a NFKC normalisation at the very beginning and then additinally
map U+3002 to U+002E. This will handle all of these special cases.

Claus

--
http://www.faerber.muc.de/

Simon Josefsson

2003-02-14 14:17:28 UTC

Post by Claus FÃ¤rber

Post by Roy Badami
When dequoting/requoting localparts, should we consider recognizing
fullwidth double quotes and fullwidth backslash (and any other
double-quote-like and backlash-like characters)?

Just do a NFKC normalisation at the very beginning and then additinally
map U+3002 to U+002E. This will handle all of these special cases.

Doing normalization before mapping goes against stringprep and results
in different behaviour (see the "self reverting" test vectors on
<http://www.gnu.org/software/libidn/draft-josefsson-idn-test-vectors.html>).
I'm not saying your idea is a bad one, I think it is another
indication that IMAA cannot be a simple stringprep profile.

Claus Färber

2003-02-14 17:43:00 UTC

Post by Simon Josefsson

Post by Claus FÃ¤rber
Just do a NFKC normalisation at the very beginning and then additinally
map U+3002 to U+002E. This will handle all of these special cases.

Doing normalization before mapping goes against stringprep and results
in different behaviour (see the "self reverting" test vectors on
<http://www.gnu.org/software/libidn/draft-josefsson-idn-test-vectors.html>).

It does not if you do the normalisation twice (at the very beginning and
after mapping).

For IMAA, it suffices to specify that implementations MUST accept all
characters as delimiters that decompose to one of our delimiters during
NFKC-with-U+3002-to-U+002E normalisation and that the delimiters MUST be
normalised.

The easiest way to implement this is an additional normalisation at the
very beginning.

IDNA can get away wihtout such a normalisation because they have a
single delimiter (U+002E) in their output. The IDNA processing maps all
dot variants (including U+3002 and the width variants) to whatever
delimiter is used (usually a dot U+002E or, in DNS packets, no delimiter
at all).

Claus

--
http://www.faerber.muc.de/

Martin Duerst

2003-02-14 15:59:27 UTC

My (limited) understanding is that quotes and backslashes are not
printed on business cards, and not entered by the user. It therefore
seems completely unnecessary to consider full-width variants.
While the average user might not get the '@' right, we should
be able to rely on programmers getting the quotes and backslashes
right.

Regards, Martin.

Post by Roy Badami
When dequoting/requoting localparts, should we consider recognizing
fullwidth double quotes and fullwidth backslash (and any other
double-quote-like and backlash-like characters)?
It seems to me that the arguments for this are similar to those for
fullwidth dot and fullwidth at, and once we decide to recognize
metacharacters in fullwidth form, we should apply this consistently to
*all* metacharacters.
-roy

Roy Badami

2003-02-14 20:24:34 UTC

On the contrary, quotes can appear on business cards.

Consider the following (invented) address, obtained by mapping an
X.400 address using RFC1148 or successors:

"/PN=Roy.Badami/OU=Systems/O=Microsoft Inc/C=US/ADMD=ATT/"@x-400-relay.att.com

I really have seen addresses like this (though not recently, I'll admit).

If the LHS contains unusual characters, quoting had better appear on
the business card.

-roy

Martin Duerst

2003-02-14 20:38:18 UTC

Hello Roy,

Post by Roy Badami
On the contrary, quotes can appear on business cards.

Ok, thanks. So they actually can, and do, in odd cases.
Paper is patient. (German saying)

But are we really required, or do we see it as our goal,
to help people avoid some potential typing mistakes in
addresses that are, by their length and complexity, not
at all user-friendly in the first place?

My position is that we don't have any reason to go there.

Regards, Martin.

Post by Roy Badami
Consider the following (invented) address, obtained by mapping an
I really have seen addresses like this (though not recently, I'll admit).
If the LHS contains unusual characters, quoting had better appear on
the business card.
-roy

Roy Badami

2003-02-15 01:18:48 UTC

But are we really required, or do we see it as our goal,
to help people avoid some potential typing mistakes in
addresses that are, by their length and complexity, not
at all user-friendly in the first place?

If we're going to support quoting in IMAs then I think my original
question is still a valid one. We know that Japanese users (at least
those who are not intimately familliar with character set issues)
often consider full width and half width characters as equivalent and
interchangeable. For this reason, the IDN group chose to accept full
with dot as equivalent to half width dot.

The IMAA base document suggests doing the same for at-sign, presumably
for the same reason.

*If* we are going to allow quoting in IMAs (that aren't plain 822
addresses) then it is a reasonable question to pose to the group as
to whether the same approach should be taken with the relevent
metacharacters.

If we're going to constuct a syntax for IMAs that involves double
quotes and backslash, then I think making an effort to ensure that
these are interpreted correctly by the software is sensible.

So I'm not sure I really understand your objection...

-roy

J-F C. (Jefsey) Morfin

2003-02-14 23:02:13 UTC

Post by Martin Duerst
My position is that we don't have any reason to go there.

What 95% of the users could accept today will not tell you much about what
10% will demand once IMAA has changed the conception of 80% of the
worldwide users, and service providers and designers, about the mail
address, ie a key element of a service representing 80% of the internet
traffic. IMHO the question is not "what should we do?", but "what cannot we
really do?".

I have used the WSIS lists and asked around. People cannot commit on
something they never saw. But the interest, and the subsequent demands are
here. I suggest you carry the same test. Also, remember that people mostly
use Windows, and that Windows uses file names with space and write file
names with upper cases on the diplays etc.. and some other funny things
people see every day and they understand as an improvement over the current
proposition (or a liberation from limitations they do not understand: "why
would it be so complex? it is all over my IE screen today").

Roy Badami

2003-02-14 20:31:27 UTC

It does not if you do the normalisation twice (at the very beginning and
after mapping).

For IMAA, it suffices to specify that implementations MUST accept all
characters as delimiters that decompose to one of our delimiters during
NFKC-with-U+3002-to-U+002E normalisation and that the delimiters MUST be
normalised.

The easiest way to implement this is an additional normalisation at the
very beginning.

Are you saying we can do a normalization of the entire e-mail address
without violating IDNA (which specifies that the domain be split on
dot-like characters before normalization).

Because we have to parse the quoting in order to identify the
local-part (the LHS may contain quoted at-signs).

-roy

Adam M. Costello

2003-02-15 05:25:34 UTC

This message responds to messages by Roy Badami and Claus Färber.

Post by Roy Badami
When dequoting/requoting localparts, should we consider recognizing
fullwidth double quotes and fullwidth backslash (and any other
double-quote-like and backlash-like characters)?
It seems to me that the arguments for this are similar to those for
fullwidth dot and fullwidth at, and once we decide to recognize
metacharacters in fullwidth form, we should apply this consistently to
*all* metacharacters.

I don't think the arguments are sufficiently similar.

For one thing, the dots and at-signs that delimit a mail address are
not metacharacters. They are part of the address, and they serve a
standard function in all mail addresses in all contexts. Metacharacters
are characters that are not actually part of the string they appear
in. Examples are quote characters, wildcard characters, macro-expansion
characters, etc.

The motivating example for requiring the recognition of various dots
and at-signs as separators in IDNs and IMAs is this: If I can type an
address into my IMAA-aware application and it works, then I expect to be
able to type the address into a message body, mail it to you, and have
you paste it into your IMAA-aware application, and have it work.

We cannot guarantee success, but standardizing the most common dots and
at-signs gets us 99% of the way there.

But local parts that require quotation are fundamentally more difficult,
even with today's ASCII local parts. Although there is a standard
quotation mechanism for local parts in message headers and SMTP
commands, there is no standard quotation mechanism for user interfaces.
Some user agents might copy the user input directly into the header
(relying on the user to supply any needed quotation), others might
assume the user input is literal and add more quotation if needed,
and others might allow users to use some other quotation mechanism
altogether, which the agent undoes before applying the 822-style
quotation. There's no standard, so we can't expect local parts
requiring quotation to be mailable and paste-able, even in today's ASCII
world. It would be a wasted effort to try to standardize the Unicode
variants of non-standard ASCII metacharacters.

Post by Roy Badami
quotes can appear on business cards.

They can, but anyone who puts such an address on a business card must
not be very concerned about being reachable (for the reasons above).

Aside from the futility argument, it would probably be overstepping our
authority to try to standardize Unicode variants of metacharacters.
It's not hard to imagine that local parts might be found in contexts
where dequoting them involves undoing %hex escapes or &ent; escapes.
Should we try to insist that fullwidth % and fullwidth & should be
recognized as introducing those escape sequences? Of course not, that
would almost surely contradict the relevant standards.

Post by Roy Badami
Just do a NFKC normalisation at the very beginning

Not before dequoting, for the reason given in the preceeding paragraph.
Metacharacters are context-dependent and out of our jurisdiction, and
need to be removed before we even have a string to work with.

Applying NFKC after dequoting, but before subdividing the local part, is
okay.

Post by Roy Badami
For IMAA, it suffices to specify that implementations MUST accept
all characters as delimiters that decompose to one of our delimiters
during NFKC-with-U+3002-to-U+002E normalisation and that the
delimiters MUST be normalised.
The easiest way to implement this is an additional normalisation at
the very beginning.

I'm not confident that the first paragraph is exactly equivalent to
the second. Normalization is very subtle. If the latter is what you
have in mind, it might be best to specify that, and leave it up to the
optimizers to prove the existence of a shortcut if there is one.

By the way, I'm not sure the CJK community would want ideographic full
stop mapped to full stop inside the local part. They might prefer the
ability to have genuine ideographic full stops in there.

Post by Roy Badami
Are you saying we can do a normalization of the entire e-mail address
without violating IDNA (which specifies that the domain be split on
dot-like characters before normalization).

IDNA requires that normalization happen as part of the processing of
each individual label, but it doesn't say there must never have been a
previous normalization step. IDNA does not specify exactly how a domain
name is split into labels (because it depends on context). In some
situations normalization could be a part of, or a precursor to, that
splitting operation.

AMC

Roy Badami

2003-02-15 13:03:35 UTC

But local parts that require quotation are fundamentally more difficult,
even with today's ASCII local parts. Although there is a standard
quotation mechanism for local parts in message headers and SMTP
commands, there is no standard quotation mechanism for user interfaces.
Some user agents might copy the user input directly into the header
(relying on the user to supply any needed quotation), others might
assume the user input is literal and add more quotation if needed,
and others might allow users to use some other quotation mechanism
altogether, which the agent undoes before applying the 822-style
quotation. There's no standard, so we can't expect local parts
requiring quotation to be mailable and paste-able, even in today's ASCII
world. It would be a wasted effort to try to standardize the Unicode
variants of non-standard ASCII metacharacters.

I'm not sure I follow where the user interface issues come in.

As I see it, an RFC-822 address (in the form in which it appears in
headers) is regarded by most users as an opaque string of characters
which they must copy verbatim in order to reach the recipient. Any
quoting that is needed in an address will already be present when the
address is given to the end user (eg out of band) and will by typed in
literally by the user into the MUA. Most users probably won't even
explicitly know that quoting is going on, they'll just notice that
it's a slightly ususual address that they've been given.

The only thing a user (who is not familliar with the RFCs) will do
with this opaque string is copy it or transcribe it. The string they
are given to transcribe may contain a number of classes of characters:
ascii characters, international characters, dots, at-sign,
double-quote and backslash.

IDNA already ensures that if the user accidentally mis-transcribes
normal alphanumeric characters on the RHS as full-width characters
this won't break the address, and I imagine that IMAA will involve a
similar normalization on the LHS.

IDNA already ensures that if the user accidentally mis-transcribes dot
on the RHS as full width dot, this will work.

IMAA proposes ensureing that if the user accidentally mis-transcribes
the at-sign as a full-width at-sign, the address will still work.

So if IMAA chooses not to allow for users mis-transcribing backslash
and double-quote as full-width characters, these will end up being the
*only* ASCII characters in an address (as presented to the users) that
are sensitive to full-width/half-width transcription issues.

I feel that the group should consider attempting to either solve or
avoid the transcription problem by doing one of the following:

either (1) modify the dequoting mechanism to recognize full-width-backslash,
full-width double-quote, and any other similar characters that
are considered appropriate,

or (2) declare that an IMA that contains non-ASCII characters SHOULD NOT
use quoting.

I don't have a strong preference one way or the other between the
above options.

-roy

Adam M. Costello

2003-07-03 21:02:14 UTC

Hmmm, the motivation for requiring recognition of fullwidth
metacharacters was consistency with the requirement to recognize
fullwidth at-sign, but now that I think about it, IDNA doesn't have
that sort of consistency. For example, an extended DNS master-file
format might allow non-ASCII domain names, and IDNA would require that
fullwidth dots be recognized as dots in such names, but it does not
require that fullwidth backslash be recognized as beginning an escape
sequence. That's a private interface issue for the designer of the
extended master-file format.
Ordinary users never encounter zone files, so it is reasonable to
leave that as a private issue for the name server implementor.
Ordinary users deal with RFC 822/2822 addresses (or at least
addr-specs) every day, and such constructs contain a variety of
metacharacters including (sometimes) quoting.

Do ordinary users deal with quoting in local parts? The RFC
2821/2822 local part syntaxes allow metacharacters (double-quotes and
backslashes), but do local parts seen by regular users ever contain
those characters? I can't remember ever encountering a real local part
that needed a double-quote or backslash.

Including support for fullwidth metacharacters in IMAA adds complexity
to the spec (and therefore to implementations), but how much benefit
would there be to end users? In practice, I think there would be zero
benefit.

if IMAA chooses not to allow for users mis-transcribing backslash and
double-quote as full-width characters, these will end up being the
*only* ASCII characters in an address (as presented to the users) that
are sensitive to full-width/half-width transcription issues.
I feel that the group should consider attempting to either solve or
either (1) modify the dequoting mechanism to recognize
full-width-backslash, full-width double-quote, and any other similar
characters that are considered appropriate,
or (2) declare that an IMA that contains non-ASCII characters SHOULD
NOT use quoting.
I don't have a strong preference one way or the other between the
above options.

The current draft tries to do (1), but I am now leaning toward (2), in
order to simplify the spec. I could live with SHOULD NOT, but a less
strict warning might suffice:

Local parts that need quoting can be difficult for humans to use.
This is already true for ASCII local parts, and is even more true
for IMA local parts. It is inadvisable to create such local parts
if they are to be used by humans.

The quoting facilities would still exist for IMAs, but the correct ASCII
metacharacters would have to be used, not the fullwidth forms. This
would not be a problem for software such as mail gateways and robots,
which are the only likely "users" of the quoting facilities.

(I think the quoting facilities were originally included in the syntax
to allow for the needs of gateways between internet mail and other mail
systems. I doubt the quoting facilities would have been included if
internet mail had been designed to be the only mail system, which is
what it is becoming.)

AMC

Roy Badami

2003-07-03 21:38:48 UTC

Post by Adam M. Costello
Do ordinary users deal with quoting in local parts?

I've only seen them a couple of times, and that was over ten years
ago, with addresses that gatewayed into X.400

Post by Adam M. Costello
The current draft tries to do (1), but I am now leaning toward (2), in
order to simplify the spec. I could live with SHOULD NOT, but a less

Yes, you're right, that would probably be enough.

Post by Adam M. Costello
(I think the quoting facilities were originally included in the syntax
to allow for the needs of gateways between internet mail and other mail
systems. I doubt the quoting facilities would have been included if
internet mail had been designed to be the only mail system, which is
what it is becoming.)

Gatewaying is the only context in which I can recall seeing them. So
it may well be a non-issue these days. If and when someone needs to
gateway some other internationalized e-mail system into Internet mail,
they can consider these issues when they decide how to map addresses
into the Internet world.

-roy

John C Klensin

2003-07-03 22:00:28 UTC

--On Thursday, 03 July, 2003 21:02 +0000 "Adam M. Costello"

Post by Adam M. Costello
...
Do ordinary users deal with quoting in local parts? The RFC
2821/2822 local part syntaxes allow metacharacters
(double-quotes and backslashes), but do local parts seen by
regular users ever contain those characters? I can't remember
ever encountering a real local part that needed a double-quote
or backslash.

Adam, one instance of this used to be fairly common, and I still
see it with some (although small) frequency. That involves
receiving SMTP servers that interface to local systems that

(i) Use the local user name, or workstation name, as an
email local part

(ii) Use spaces in the local user name

e.g., you would end up with a local user name, or workstation
name, of "Adam Costello" and an email local-part of
"Adam Costello" or
Adam\ Costello

Note that these systems may not be gateways, since the interface
is to a local mailstore on the receiving system rather than to a
different mail transport environment. The MUAs on those
systems typically handle these addresses unquoted, quoting them
only when moving into SMTP environments, but a sender from
another system usually has to deal with the quotes.

These types of local address formats are becoming less common.
But that is less, IMO, a direct consequence of SMTP becoming
ubiquitous than the result of years of experience that such
addresses, on the Internet, become a pain in sensitive parts of
the anatomy, cause excessive support costs, etc. Counting on
none of them being out there would represent very bad protocol
design.

That said, if i18n addresses can be distinguished from
historical ones in some 100% reliable way, I see no reason to
avoid making rules about them that provide good functionality
and place conversion responsibility onto the gateways and/or
receiving systems with odd ideas about addresses. Put
differently, giving those systems a choice between oddities with
ASCII addresses and i18n addresses seems to me to be a rational
tradeoff.

john

Adam M. Costello

2003-08-07 02:44:07 UTC

Post by Adam M. Costello
Local parts that need quoting can be difficult for humans to use.
This is already true for ASCII local parts, and is even more true
for IMA local parts. It is inadvisable to create such local parts
if they are to be used by humans.

Doh, I forgot to insert that into the draft, because I had previously
forgotten to add it to my todo list. It's now on the todo list.

AMC

15 Replies
1 View
Permalink to this page
Disable enhanced parsing

Thread Navigation

Roy Badami 2003-02-14 12:07:08 UTC

Claus Färber 2003-02-14 13:33:00 UTC

Simon Josefsson 2003-02-14 14:17:28 UTC

Claus Färber 2003-02-14 17:43:00 UTC

Martin Duerst 2003-02-14 15:59:27 UTC

Roy Badami 2003-02-14 20:24:34 UTC

Martin Duerst 2003-02-14 20:38:18 UTC

Roy Badami 2003-02-15 01:18:48 UTC

J-F C. (Jefsey) Morfin 2003-02-14 23:02:13 UTC

Roy Badami 2003-02-14 20:31:27 UTC

Adam M. Costello 2003-02-15 05:25:34 UTC

Roy Badami 2003-02-15 13:03:35 UTC

Adam M. Costello 2003-07-03 21:02:14 UTC

Roy Badami 2003-07-03 21:38:48 UTC

John C Klensin 2003-07-03 22:00:28 UTC

Adam M. Costello 2003-08-07 02:44:07 UTC

about - legalese

Loading...