Discussion:
FYI: BOF on Internationalized Email Addresses (IEA)
Patrik Fältström
2003-10-26 19:55:08 UTC
Permalink
At the IETF in Minneapolis, there will be a BOF on Internationalized
Email Addresses (IEA).

It is *preliminary* on the agenda on Monday, November 10, 2003 at
1530-1730.

Chairs: Pete Resnick, Patrik Fältström
Mailing list:***@imc.org (other salient lists include ***@w3.org)
Agenda:

Agenda Bashing (Chairs) 5 min.
Topic Introduction (Chairs) 10 min.
Proposals
IDNA-Based (Paul Hoffman) 15 min.
Infrastructure-Based (John Klensin) 15 min.
IRI-Based (Michel Suignard) 15 min.

Discussion 60 min.

Topics for discussion:

Are there other solutions which have been specified?

The solutions present the problem at different scopes;

Where should the IETF tackle it?

Are some short-term, and other long-term?

Can the solutions be staged or co-exist?

If staged, how to migrate from one to another?

What are the next steps for the IETF?


NB: This BoF is exploratory in nature, and it is not intended that the
IETF will finalize a decision in this venue. It was proposed to foster
a community discussion, not charter a working group or pick a winner.
If further work is required, step one would be identifying individuals
willing to carry that work forward.

Reading material:
draft-hoffman-imaa-03.txt
draft-klensin-emailaddr-i18n-01.txt
draft-duerst-iri-04.txt

Pete and myself hope people will come with a lot of constructive
comments and ideas.

Patrik, co-chair of the bof
Dave Crocker
2003-10-27 15:41:11 UTC
Permalink
Patrik,

Thanks for putting this BOF together.


PF> Where should the IETF tackle it?

I am not sure I understand this question. Please clarify.


PF> What are the next steps for the IETF?

Would it help to have a draft charter for the meeting? (I realize that the
presence of such different specifications makes a charter at least a bit
challenging, but it seems to help to have a draft, to make things concrete.)



d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Keith Moore
2003-10-27 15:52:22 UTC
Permalink
> PF> What are the next steps for the IETF?
>
> Would it help to have a draft charter for the meeting?

let's back up a step further.

what problem are we trying to solve here?

Keith
Keith Moore
2003-10-27 18:44:03 UTC
Permalink
Mark,

Thanks for taking a stab at a problem statement. I'd like to drill down
on this just a bit.

What is the source of the "growing need"? Is it:

a. for users of many languages (particularly those not using Latin alphabets)
email addresses are difficult to remember
b. for users of many languages (particularly those not using Latin alphabets)
email addresses are difficult to transcribe or type
c. users want to use their names in email addresses
d. users are confused by apparently arbitrary restrictions on use of
characters in email addresses, and this leads to mistakes
e. on computer systems employing non-ASCII names for other purposes (e.g.
login or account names) these do not map well to ASCII email addresses

or something else that I don't see?

> As presently constituted, email addresses are limited to the 26 Latin
> alphabetics, 10 digits, and a limited number of special characters in
> the ASCII character set. There is a growing need to use additional
> characters, specifically Latin characters with diacriticals and
> non-Latin characters, in email addresses to better serve the needs of
> the multi-national Internet community. However, the restrictions of
> ASCII email addresses have served as a "lingua franca" since everybody
> can enter ASCII email addresses, and there is an ongoing need for this
> as well. The problem to be solved is the resolution of these two
> needs.
Mark Crispin
2003-10-27 19:10:55 UTC
Permalink
On Mon, 27 Oct 2003, Keith Moore wrote:
> Thanks for taking a stab at a problem statement. I'd like to drill down
> on this just a bit.
> What is the source of the "growing need"? Is it:
> [snip]

I agree that this needs to be stated, but someone other than me will have
to do it.

I believe that the primary push for this functionality comes from regions
which use Latin alphabetics with diacriticals; and that most individuals
in regions which do not use Latin script are accept the use of Latin
script for multinational interchange. In many regions where Latin
diacriticals are used, there is no acceptable transform of a surname to a
form that does not use diacriticals. Simply omitting the diacritical
causes (at least to the inhabitants of those regions) a misspelling.

This set of beliefs naturally biases how I approach the problem. The
problem statement must be free of bias, including mine.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Keith Moore
2003-10-27 19:17:54 UTC
Permalink
[recipient list trimmed to what I hope is a reasonable subset]

> On Mon, 27 Oct 2003, Keith Moore wrote:
> > Thanks for taking a stab at a problem statement. I'd like to drill
> > down on this just a bit.
> > What is the source of the "growing need"? Is it:
> > [snip]
>
> I agree that this needs to be stated, but someone other than me will
> have to do it.

Sorry, I should have made that clearer. I intended the question for
everyone, not just you.

I do suspect that the shape of the solution looks very different
depending on how a person defines the problem, and in particular,
on his beliefs about the "growing need".

Keith
Zefram
2003-10-28 15:13:06 UTC
Permalink
Mark Crispin wrote:
> In many regions where Latin
>diacriticals are used, there is no acceptable transform of a surname to a
>form that does not use diacriticals. Simply omitting the diacritical
>causes (at least to the inhabitants of those regions) a misspelling.

Ah, this one's easy. Local parts aren't limited to Latin letters,
they can use all the ASCII printables. Diacriticals are available
there, albeit in characters that are shared with other uses.
<dej`***@foo.example> is a perfectly valid email address already.
It doesn't start to get tricky until we get into the eastern European
languages -- ASCII only intentionally provides western European
diacriticals.

Cue the debate about whether the diacritic should go before or after
the base letter.

-zefram
V***@vt.edu
2003-10-29 07:51:55 UTC
Permalink
On Tue, 28 Oct 2003 15:13:06 GMT, Zefram said:

> It doesn't start to get tricky until we get into the eastern European
> languages -- ASCII only intentionally provides western European
> diacriticals.

"Macrons and carons and cedillas, oh my..." :)

Actually, ASCII doesn't intentionally provide any diacriticals. Western
European diacriticals are in the Unicode Latin-1 Supplement, and (as you
correctly note) some of us of the eastern European persuasion need characters
from Latin-A and/or Latin-B to actually do things...

> Cue the debate about whether the diacritic should go before or after
> the base letter.

So tell me, does the dot on an 'i' go before or after the base letter?

(OK, so I'm just touchy because nobody on this side of the big puddle
wants to deal with the fact that the 3rd letter of the preferred spelling of
my last name is Unicode codepoint 0113 (Latin small e with macron)).
Pete Resnick
2003-10-27 16:37:21 UTC
Permalink
On 10/27/03 at 10:52 AM -0500, Keith Moore wrote:

>>DC: Would it help to have a draft charter for the meeting?

As was mentioned in the draft agenda at
<http://www.ietf.org/ietf/03nov/iea.txt>, we want to simply start the
discussion, not immediately attempt to charter a working group.

>let's back up a step further.
>
>what problem are we trying to solve here?

Please have a look at draft-hoffman-imaa,
draft-klensin-emailaddr-i18n, and draft-duerst-iri. Certainly the
first two have very explicit descriptions of the problem.

pr
--
Pete Resnick <http://wwww.qualcomm.com/~presnick/>
QUALCOMM Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
Marc Blanchet
2003-10-27 17:00:01 UTC
Permalink
-- Monday, October 27, 2003 10:52:22 -0500 Keith Moore <***@cs.utk.edu>
wrote/a ecrit:

>> PF> What are the next steps for the IETF?
>>
>> Would it help to have a draft charter for the meeting?
>
> let's back up a step further.
>
> what problem are we trying to solve here?

to me, that (problem we are trying to solve) would be part of the
introduction in the charter...

so I guess some initial proposal for:
- what are we trying to solve
- what would be the way to solve it

would be a good starting point together with the "state-of-the-art"
presentations.

Marc.


>
> Keith



------------------------------------------
Marc Blanchet
Hexago
tel: +1-418-266-5533x225
------------------------------------------
http://www.freenet6.net: IPv6 connectivity
------------------------------------------
Mark Crispin
2003-10-27 17:13:29 UTC
Permalink
On Mon, 27 Oct 2003, Keith Moore wrote:
> what problem are we trying to solve here?

I agree with Keith. This isn't to say that I dispute that there is a
problem to be solved -- indeed, I think that the problem is apparent to
all -- but we must have a problem statement that we all agree upon before
we even think about solutions.

I don't think that references to drafts of proposed solutions suffice as a
problem statement. Leaving aside questions of possible bias (= present a
problem in such a way that this is the obvious best solution), having the
problem statement in a draft (which by its nature is an ephemeral
document) muddies the issues.

The problem statement should consist of a single paragraph (and preferably
in one or two sentences), separate from any proposed solution, and stated
in a charter which is approved by everyone.

Here's a start at such a statement:

As presently constituted, email addresses are limited to the 26 Latin
alphabetics, 10 digits, and a limited number of special characters in the
ASCII character set. There is a growing need to use additional
characters, specifically Latin characters with diacriticals and non-Latin
characters, in email addresses to better serve the needs of the
multi-national Internet community. However, the restrictions of ASCII
email addresses have served as a "lingua franca" since everybody can enter
ASCII email addresses, and there is an ongoing need for this as well. The
problem to be solved is the resolution of these two needs.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
WJCarpenter
2003-10-27 17:30:46 UTC
Permalink
mc> As presently constituted, email addresses are limited to the 26
mc> Latin alphabetics, 10 digits, and a limited number of special
mc> characters in the ASCII character set. There is a growing need to

upper and lower case alphabetics
--
bill-***@carpenter.ORG (WJCarpenter) PGP 0x91865119
38 95 1B 69 C9 C6 3D 25 73 46 32 04 69 D6 ED F3
Dave Aronson
2003-10-27 19:43:09 UTC
Permalink
On Mon October 27 2003 12:30, WJCarpenter wrote:

> mc> As presently constituted, email addresses are limited to the 26
> mc> Latin alphabetics, 10 digits, and a limited number of special
> mc> characters in the ASCII character set. There is a growing need
> to
>
> upper and lower case alphabetics

Yes, but with either of those two sets (generally) considered equivalent
to the other, boiling down to effectively 26 choices.

--
Dave Aronson, Senior Software Engineer, Secure Software Inc.
Email me at: work (D0T) 2004 (@T) dja (D0T) mailme (D0T) org
Web: http://destined.to/program http://listen.to/davearonson
Adam M. Costello
2003-10-27 21:22:30 UTC
Permalink
The extremely broad To/Cc list was appropriate for the initial
announcement of the BOF, but for this ensuing discussion I'm guessing it
would be good to trim it down, so I did.

Mark Crispin <***@CAC.Washington.EDU> wrote:

> As presently constituted, email addresses are limited to the 26 Latin
> alphabetics, 10 digits, and a limited number of special characters in
> the ASCII character set.

Not so limited. According to RFCs 821 & 822, all ASCII characters are
allowed. According to RFCs 2821 & 2822, NUL is "obsolete", as are CR
and LF except as the pair CRLF. (Obsolete means must be accepted and
must not be generated.)

Keith Moore <***@cs.utk.edu> wrote:

> What is the source of the "growing need"? Is it:
>
> a. for users of many languages (particularly those not using Latin
> alphabets) email addresses are difficult to remember
> b. for users of many languages (particularly those not using Latin
> alphabets) email addresses are difficult to transcribe or type
> c. users want to use their names in email addresses
> d. users are confused by apparently arbitrary restrictions on use of
> characters in email addresses, and this leads to mistakes
> e. on computer systems employing non-ASCII names for other purposes
> (e.g. login or account names) these do not map well to ASCII email
> addresses
>
> or something else that I don't see?

Regarding (a), there are at least two kinds of remembering: one is
recognition (is this address the same one I saw yesterday? is it a font
variation or a different character?); the other, more challenging, is
recall (mentally retrieve the address I saw yesterday). Even harder
than remembering is reproducing (draw the characters or find them on a
keyboard) which is (b).

I've heard claims of all of those sources, except (d). But I think (d)
will become true if internationalized mail addresses are not introduced.
I think users will be astonished that non-ASCII characters are allowed
after the at-sign but not before it.

I guess a problem statement should include both the motivation and the
challenge. The challenge is the same as for internationalized domain
names: Given a huge installed infrastructure of protocols, end-user
software, and intermediate software, all built on the assumption that
identifiers are ASCII, how can you relax that assumption without causing
so much breakage and non-interoperability that people would rather stick
with the existing ASCII system than endure the transition?

There are presumably several challenges, but that is the one that I see
as the main challenge. I suppose that the people advocating approaches
very different from IMAA might think I'm overestimating the height
of this hurdle, and therefore might see something else as the main
challenge.

AMC
Dave Crocker
2003-10-28 00:19:25 UTC
Permalink
Folks,

On the theory that discussions go better when they have a concrete
deliverable, here is a proposed charter for a proposed working group.

The following started with Mark Crispin's text, although it might not look it.
Besides the usual goals for a charter, the following text attempts to specify
the problem domain in the narrowest feasible form that is valid. If anyone
thinks the scope is too narrow, they need to explain why.



DRAFT CHARTER

Mail Internationalised Local-Part (MILP)
---------------------------------

The <local-part> portion of RFC2822 and <Local-part> portion of RFC2821 mail
addresses are restricted to a subset of ASCII. This poses a fundamental
barrier for users needing mail addresses to be expressed in a richer set of
characters, such as Latin characters with diacriticals and the many Asian
characters. The goal of the current work is to add local-part support for
these additional characters, while preserving the large, installed base of
ASCII usage.

The group will take:

draft-hoffman-imaa-03.txt
draft-klensin-emailaddr-i18n-01.txt
draft-duerst-iri-04.txt

as input to discussions.

The group will pay particular attention to barriers to adoption and utility,
as well as any impact the new scheme might have on the existing base of
Internet mail usage.


Milestones
----------

Nov, 03: BOF

Dec, 03: WG chartered

Feb, 03: Initial draft of working group specifications.

Jun, 03: Specifications submitted for IETF approval


d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Roy Badami
2003-10-28 00:50:53 UTC
Permalink
> Mail Internationalised Local-Part (MILP)

Even though, given IDNA now exists as a proposed standard, the main
issues relate to the local part, the issue under discussion is that of
internationalized mail addresses, not just internationalized
local-parts.

Restricting the disucssion to local-parts runs the risk of excluding
other potentially relevent issues. For instance, one of the issues
that has been discussed on the IMAA list is whether full-width at
should be recognized in an internationalized mail address. IMHO, the
charter shouldn't be framed in a way that is sufficiently narrow as to
render such questions out of scope.

-roy
Dave Crocker
2003-10-28 02:42:26 UTC
Permalink
Roy,

>> Mail Internationalised Local-Part (MILP)

RB> Even though, given IDNA now exists as a proposed standard, the main
RB> issues relate to the local part, the issue under discussion is that of
RB> internationalized mail addresses, not just internationalized
RB> local-parts.

Really? What work needs to be done, except for local part? IDNA takes care
of the right-hand side.

So what is there to do about "internationalized mail addresses" other than the
local part?

RB> Restricting the disucssion to local-parts runs the risk of excluding
RB> other potentially relevent issues. For instance, one of the issues
RB> that has been discussed on the IMAA list is whether full-width at
RB> should be recognized in an internationalized mail address.

full-width _where_? somewhere other than local part?

if yes, then how can that be practical? if no, then the charter does not
preclude their use. (if you think otherwise, please explain.)

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Adam M. Costello
2003-10-28 03:44:18 UTC
Permalink
I've trimmed the To/Cc fields to the most relevant mailing lists. After
another couple days, we should probably trim it down to just ietf-imaa.
That should be enough time for people interested in this thread to
subscribe to that list.

Dave Crocker <***@dcrocker.net> wrote:

> RB> For instance, one of the issues that has been discussed on the
> RB> IMAA list is whether full-width at should be recognized in an
> RB> internationalized mail address.
>
> full-width _where_?

:)

Roy was using the word "at" not as a preposition, but as the name of a
character.

The current IMAA draft requires that full-width "@" be recognized
as a separator between the local part and the domain part of an
internationalized mail address, for consistency with the IDNA
requirement that fullwidth "." be recognized as a separator between
labels in an internationalized domain name.

IMAA takes the position that its scope is the entire mail address. For
the domain part, it simply reuses IDNA (by reference). That leaves the
local part and the at-sign where it has more to say.

AMC
Keith Moore
2003-10-28 05:50:42 UTC
Permalink
> I've trimmed the To/Cc fields to the most relevant mailing lists. After
> another couple days, we should probably trim it down to just ietf-imaa.

I suggest we keep it on both ietf-imaa and ietf-822. I think the
discussion will be more balanced if we include "mainstream" email folks.
Dave Crocker
2003-10-28 06:08:45 UTC
Permalink
Adam,


>> RB> For instance, one of the issues that has been discussed on the
>> RB> IMAA list is whether full-width at should be recognized in an
>> RB> internationalized mail address.
>>
>> full-width _where_?
AMC> Roy was using the word "at" not as a preposition, but as the name of a
AMC> character.
AMC> The current IMAA draft requires that full-width "@" be recognized
AMC> as a separator between the local part and the domain part of an

Oh.

In spite of reading the imaa spec a number of times, I entirely missed that
bit of subtlety.

1. Herein lies a useful meta-lesson about the benefits of starting with the
narrowest scope possible: It helps uncover different expectations.

2. As to the particulars of idea of allowing a second separator, let me
suggest that breaking existing Internet mail will not be conducive to global
adoption. And altering the parsing rules for addresses is a good way to break
Internet mail.

This disparity also suggests a confusion between user presentation, versus
interchange format. It is occasionally noted that the problem with having data
be so "directly" readable by users is that people think they are looking at a
user presentation format. Alas, RFC2822 and RFC2821 are about
computer-to-computer interchange, not human presentation. (There is quite a
bit of human-to-human interchange, particularly in 2822, but that is different
from human presentation.)

For those with any interest in history, I'll note that we used to allow a
second separator (" at ") between local-part and domain, but simplified it 20
years ago, to keep the parsing simpler.

Simplicitly facilitates adoption and interoperability. Multiple forms for the
same syntactic constructs do not make for simplicity.


d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Adam M. Costello
2003-10-28 06:47:06 UTC
Permalink
Dave Crocker <***@dcrocker.net> wrote:

> AMC> The current IMAA draft requires that full-width "@" be recognized
> AMC> as a separator between the local part and the domain part
>
> In spite of reading the imaa spec a number of times, I entirely missed
> that bit of subtlety.

It's the first requirement listed in section 3 "Requirements and
applicability".

3. Requirements and applicability

3.1 Requirements

IMAA conformance means adherence to the following four
requirements:

1) In an internationalized mail address, the following
characters MUST be recognized as at-signs for separating the
local part from the domain name: U+0040 (commercial at),
U+FF20 (fullwidth commercial at).

> 2. As to the particulars of idea of allowing a second separator, let
> me suggest that breaking existing Internet mail will not be conducive
> to global adoption. And altering the parsing rules for addresses is a
> good way to break Internet mail.

The current parsing rules allow only ASCII characters, so the rules
are going to have to change. But they can be changed without breaking
things. IMAA is based on the principle of not breaking things. That's
where requirement 2 comes in:

2) Whenever a mail address (or part of a mail address) is
put into an IMA-unaware mail address slot (see section
2), it MUST contain only ASCII characters. Given an
internationalized mail address, an equivalent mail address
satisfying this requirement can be obtained by applying
ToASCII to the local part as specified in section 4,
changing the at-sign to U+0040, and processing the domain
name as specified in [IDNA].

The mail addresses in message headers are in IMA-unaware slots, and
therefore they remain ASCII-only, using the exact same syntax they've
always used. IMAA deliberately makes no change to any existing
protocol.

Requirement 1 has an effect only in places where non-ASCII mail
addresses are allowed, like user interfaces and new protocols.

> Simplicitly facilitates adoption and interoperability. Multiple forms
> for the same syntactic constructs do not make for simplicity.

True, but sometimes it makes sense for computers and computer
programmers to work harder in order to make things easier for the users.
In IDNA, it was decided that it would be too annoying for CJK users to
have to switch input modes back and forth for every dot in a CJK domain
name. The same rationale applies to the at-sign in email addresses.

AMC
Dave Crocker
2003-10-28 07:08:57 UTC
Permalink
Adam,


>> AMC> The current IMAA draft requires that full-width "@" be recognized
>> AMC> as a separator between the local part and the domain part
>> In spite of reading the imaa spec a number of times, I entirely missed
>> that bit of subtlety.

AMC> It's the first requirement listed in section 3 "Requirements and
AMC> applicability".

Sorry for expecting to garner all the rules from the normative syntax and
algorithm section. I'm not used to having to look into a Requirements
sections to understand basic parsing rules.


>> 2. As to the particulars of idea of allowing a second separator, let
>> me suggest that breaking existing Internet mail will not be conducive
>> to global adoption. And altering the parsing rules for addresses is a
>> good way to break Internet mail.
AMC> The current parsing rules allow only ASCII characters, so the rules
AMC> are going to have to change. But they can be changed without breaking
AMC> things. IMAA is based on the principle of not breaking things. That's
AMC> where requirement 2 comes in:
AMC> 2) Whenever a mail address (or part of a mail address) is
AMC> put into an IMA-unaware mail address slot (see section
AMC> 2), it MUST contain only ASCII characters.

You put a great deal of effort into designing features that work around the
delivering MTA's not knowing that it is dealing with IMAA. Yet now you are
requiring that knowledge.


AMC> Given an
AMC> internationalized mail address, an equivalent mail address
AMC> satisfying this requirement can be obtained by applying
AMC> ToASCII to the local part as specified in section 4,
AMC> changing the at-sign to U+0040, and processing the domain
AMC> name as specified in [IDNA].
AMC> The mail addresses in message headers are in IMA-unaware slots, and
AMC> therefore they remain ASCII-only, using the exact same syntax they've
AMC> always used. IMAA deliberately makes no change to any existing
AMC> protocol.

It looks to me like you very much _do_ make a change. The fact that you then
hope some component will do an equivalence to regular at-sign does not alter
this.


AMC> Requirement 1 has an effect only in places where non-ASCII mail
AMC> addresses are allowed, like user interfaces and new protocols.

Requirements are not protocol specifications.


>> Simplicitly facilitates adoption and interoperability. Multiple forms
>> for the same syntactic constructs do not make for simplicity.

AMC> True, but sometimes it makes sense for computers and computer
AMC> programmers to work harder in order to make things easier for the users.

If RFC2822 and RFC2821 were user interface specifications, I might agree with
you.

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Adam M. Costello
2003-10-28 08:55:19 UTC
Permalink
Keith Moore <***@cs.utk.edu> wrote:

> > The current parsing rules allow only ASCII characters, so the rules
> > are going to have to change.
>
> uh, no. putting anything other than ASCII characters in email headers
> is completely unacceptable.

IMAA does *not* put anything other than ASCII characters in email
headers.

Did you stop reading my message right there? What immediately followed
should have allayed your concern:

But they can be changed without breaking things. IMAA is based on
the principle of not breaking things. That's where requirement 2
comes in:

2) Whenever a mail address (or part of a mail address) is put
into an IMA-unaware mail address slot (see section 2), it
MUST contain only ASCII characters....

The mail addresses in message headers are in IMA-unaware slots,
and therefore they remain ASCII-only, using the exact same syntax
they've always used. IMAA deliberately makes no change to any
existing protocol.

If internationalized mail addresses are going to be internationalized
in any sense, then they need to contain non-ASCII characters in some
circumstances. Not in message headers, but at least in user interfaces.
Wherever non-ASCII addresses are used (not in messages headers), that's
where the new parsing rules will be needed.

Dave Crocker <***@brandenburg.com> wrote:

> AMC> 2) Whenever a mail address (or part of a mail address)
> AMC> is put into an IMA-unaware mail address slot (see
> AMC> section 2), it MUST contain only ASCII characters.
>
> You put a great deal of effort into designing features that work
> around the delivering MTA's not knowing that it is dealing with IMAA.
> Yet now you are requiring that knowledge.

IMAA does not expect MTAs to conform to IMAA, or even be aware of IMAA.

IMAA is designed so that software that conforms to IMAA and software
that does not conform to IMAA can interoperate. Software that does not
conform to IMAA is assumed to see the world in pre-IMAA terms: mail
addresses are ASCII, as specified in all prior standards.

Every piece of mail-related software has the option to continue to abide
by the old rules only, or it can adopt IMAA and abide by its rules too.
There is no meta-requirement that any piece of software must adopt IMAA,
there is merely an incentive: if you don't adopt IMAA, you don't get to
have non-ASCII characters in mail addresses. For MTAs, which don't come
into contact with regular users, that's not a great incentive. Regular
users would be perfectly content in a world where MUAs adopt IMAA and
MTAs don't.

This high-level view is not presented in the IMAA draft, but it is
presented in the first two paragraphs of the IDNA spec (RFC 3490), which
the IMAA draft incorporates by reference:

This document relies heavily on IDNA for both its concepts and
its justification. This document omits a great deal of the
justification and design information that might otherwise be found
here because it is identical to that in IDNA. Anyone reading this
document needs to have first read [IDNA], [PUNYCODE], [NAMEPREP],
and [STRINGPREP].

Perhaps a future revision of the IMAA draft should copy more of this
material from IDNA.

> AMC> Given an internationalized mail address, an
> AMC> equivalent mail address satisfying this requirement
> AMC> can be obtained by applying ToASCII to the local part
> AMC> as specified in section 4, changing the at-sign to
> AMC> U+0040, and processing the domain name as specified
> AMC> in [IDNA].
> AMC> The mail addresses in message headers are in IMA-unaware slots,
> AMC> and therefore they remain ASCII-only, using the exact same syntax
> AMC> they've always used. IMAA deliberately makes no change to any
> AMC> existing protocol.
>
> It looks to me like you very much _do_ make a change. The fact that
> you then hope some component will do an equivalence to regular at-sign
> does not alter this.

The message header protocol is not changed. It has the exact same
syntax and semantics as it always did.

There are two mail address protocols: ASCII-only mail addresses as
defined by RFCs 821, 822, 2821, 2822; and internationalized mail
addresses as defined by IMAA. The new address protocol may be used
only in new places (like new user interfaces and new protocols), while
the old address protocol must continue to be used in all the existing
places (like message headers and SMTP commands). Any software that
deals with both address protocols acts as a gateway, and implements the
gateway functions specified by IMAA. (I don't mean a mail gateway in
the RFC 1123 sense, just a protocol-to-protocol gateway in the general
sense.)

> AMC> Requirement 1 has an effect only in places where non-ASCII mail
> AMC> addresses are allowed, like user interfaces and new protocols.
>
> Requirements are not protocol specifications.

In this case they are. Section 3.1 "Requirements" *is* the IMAA
protocol. Sections 2 "Terminology" and 4 "Conversion operations" are
supporting sections that define the terms used in section 3.1. In
particular, section 4 is just a detailed definition of the terms
"ToASCII" and "ToUnicode".

AMC
Keith Moore
2003-10-28 07:26:42 UTC
Permalink
> The current parsing rules allow only ASCII characters, so the rules
> are going to have to change.

uh, no. putting anything other than ASCII characters in email headers is
completely unacceptable.
Pete Resnick
2003-10-28 07:38:33 UTC
Permalink
On 10/28/03 at 2:26 AM -0500, Keith Moore wrote:

>putting anything other than ASCII characters in email headers is
>completely unacceptable.

Let me try this one more time: Please read John Klensin's draft
before making these kinds of statements. Either you and Dave have
dismissed this draft without giving technical reasons for why you
have, or you simply haven't read the draft. Please let's all get on
the same page before arguing.

pr
--
Pete Resnick <http://wwww.qualcomm.com/~presnick/>
QUALCOMM Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
Dave Crocker
2003-10-29 04:15:02 UTC
Permalink
Pete,


>>putting anything other than ASCII characters in email headers is
>>completely unacceptable.
PR> Let me try this one more time: Please read John Klensin's draft
PR> before making these kinds of statements. Either you and Dave have
PR> dismissed this draft without giving technical reasons for why you
PR> have, or you simply haven't read the draft. Please let's all get on
PR> the same page before arguing.


Sorry. Thought I was being pretty clear, but now that you mention it, I
didn't connect all the dots -- perhaps my latest note did, but just to be
clear:

Yes, I've dismissed John's proposal, in terms of any near-term benefit.

I see the issue as exactly the same as we had for MIME/ESMTP, with one
approach being solely adopted in the user agent software that adopts the
enhancement, and the other approach requiring infrastructure modification.

I see one as continuing the model started by IDNA, and the other as
essentially tossing it out and starting over.

Given that we have gone through this paradigm debate several times over the
last decade, I am hard-pressed to understand why anyone would think that it
will be productive to do it again.


d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Keith Moore
2003-10-29 04:47:40 UTC
Permalink
> Given that we have gone through this paradigm debate several times over the
> last decade, I am hard-pressed to understand why anyone would think that it
> will be productive to do it again.

because
- email addresses are much more heavily used by humans than URLs are,
and the impact of an IDNA-like scheme on email addresses is greater
than for other apps using DNS names
- email is more disruption-sensitive than most apps (every MTA that a
message passes through is another opportunity for failure)
- people expect to use their names in email local-parts more often
than they expect to use enterprise names as DNS domain names
- matching/uniqueness issues for personal names (thus email local-parts)
are somewhat different than for enterprise names (especially given
influence of trademark laws on the latter)
- having IDNA for the domain portion of an address provides a useful hook
for looking up mappings from I18Ned local-parts to unique portable
identifiers

because failure to actually do analysis on the problem is irresponsible and
inconsistent with the engineering that we're supposed to be doing

(note this is about why it's appropriate to analyze the specific problem
at hand rather than blindly pursue a particular approach - not about the
merits of John's proposal - which I haven't finished reading yet)

Keith
Pete Resnick
2003-10-28 07:20:58 UTC
Permalink
On 10/27/03 at 6:42 PM -0800, Dave Crocker wrote:

>RB> Even though, given IDNA now exists as a proposed standard, the main
>RB> issues relate to the local part, the issue under discussion is that of
>RB> internationalized mail addresses, not just internationalized
>RB> local-parts.
>
>Really? What work needs to be done, except for local part? IDNA takes care
>of the right-hand side.

Please review John Klensin's draft before making these kinds of assumptions.

>RB> Restricting the disucssion to local-parts runs the risk of excluding
>RB> other potentially relevent issues.

I agree. Limiting discussion at this point to local-part does not
take into account some of the possibilities.

pr
--
Pete Resnick <http://wwww.qualcomm.com/~presnick/>
QUALCOMM Incorporated - Direct phone: (858)651-4478, Fax: (858)651-1102
Dave Crocker
2003-10-28 09:13:50 UTC
Permalink
Pete,

>>RB> Restricting the disucssion to local-parts runs the risk of excluding
>>RB> other potentially relevent issues.
PR> I agree. Limiting discussion at this point to local-part does not
PR> take into account some of the possibilities.


That was exactly the intent of the text.

We have already seen how nicely the text served to bring into pretty stark
relief one bit of expectation from one of the proposals. It is only fitting to
have it serve the same purpose for another one.

IETF BOF time is pretty lousy for an open-ended chat. Having specifications
to chat about is only marginally better than not having them.

What makes the real difference is having serious focus to the meeting. If we
go into this meeting without even having a clear sense of the scope of the
problem to be tackled, then the chance of having a productive meeting is
pretty small.

At the moment, it appears that the focus of the meeting is likely to be:
Shall we break existing Internet mail or shall we lay an enhancement on top of
it that preserves the installed base. (I'm sure that everyone else who was
present at the pre-MIME/ESMTP discussions is really looking forward to
repeating the experience.)

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Roy Badami
2003-10-28 12:07:08 UTC
Permalink
[Trimmed distribution]

> Please review John Klensin's draft before making these kinds of assumptions.

Sorry, I realized that as soon as I had posted. But, AIUI, John
Klensin's proposal also discusses whether domains should be allowed to
be represented in UTF-8 rather than punycode. So it, too, does not
confine itself to considering only localpart issues.

> >RB> Restricting the disucssion to local-parts runs the risk of excluding
> >RB> other potentially relevent issues.
>
> I agree. Limiting discussion at this point to local-part does not
> take into account some of the possibilities.

That was the only point I was trying to make. The existing proposals
address broader issues that just the localpart.

-roy
Marc Blanchet
2003-10-28 01:08:25 UTC
Permalink
- good start!
- timeline seems pretty agressive... will see.
- would probably good to have a requirement document upfront. Might not the
same way that idn requirement ends up, but a narrow-implementable
requirement would help to have a concensus (hopefully) on what needs to be
done.
- while the idn req went not that good, now that we have experience, I
think we should try to be better and have one.

I know I might start some debate with this, but still think it is the best
way to go...

- would be useful to have some reference to idn (idna) in the charter, as
background work. the developers and users will have to take care of "both"
(ie. idn and imail) in the email infrastructure.

Marc.

-- Monday, October 27, 2003 16:19:25 -0800 Dave Crocker <***@dcrocker.net>
wrote/a ecrit:

> Folks,
>
> On the theory that discussions go better when they have a concrete
> deliverable, here is a proposed charter for a proposed working group.
>
> The following started with Mark Crispin's text, although it might not
> look it. Besides the usual goals for a charter, the following text
> attempts to specify the problem domain in the narrowest feasible form
> that is valid. If anyone thinks the scope is too narrow, they need to
> explain why.
>
>
>
> DRAFT CHARTER
>
> Mail Internationalised Local-Part (MILP)
> ---------------------------------
>
> The <local-part> portion of RFC2822 and <Local-part> portion of RFC2821
> mail addresses are restricted to a subset of ASCII. This poses a
> fundamental barrier for users needing mail addresses to be expressed in a
> richer set of characters, such as Latin characters with diacriticals and
> the many Asian characters. The goal of the current work is to add
> local-part support for these additional characters, while preserving the
> large, installed base of ASCII usage.
>
> The group will take:
>
> draft-hoffman-imaa-03.txt
> draft-klensin-emailaddr-i18n-01.txt
> draft-duerst-iri-04.txt
>
> as input to discussions.
>
> The group will pay particular attention to barriers to adoption and
> utility, as well as any impact the new scheme might have on the existing
> base of Internet mail usage.
>
>
> Milestones
> ----------
>
> Nov, 03: BOF
>
> Dec, 03: WG chartered
>
> Feb, 03: Initial draft of working group specifications.
>
> Jun, 03: Specifications submitted for IETF approval
>
>
> d/
> --
> Dave Crocker <dcrocker-at-brandenburg-dot-com>
> Brandenburg InternetWorking <www.brandenburg.com>
> Sunnyvale, CA USA <tel:+1.408.246.8253>
>



------------------------------------------
Marc Blanchet
Hexago
tel: +1-418-266-5533x225
------------------------------------------
http://www.freenet6.net: IPv6 connectivity
------------------------------------------
Zefram
2003-10-28 09:31:15 UTC
Permalink
Dave Crocker wrote:
> This poses a fundamental
>barrier for users needing mail addresses to be expressed in a richer set of
>characters,

I have yet to see this "need" established. Everyone who has supported
internationalised mail addresses has axiomatically assumed such a need,
and has conspicuously failed to provide any more detail, such as any of
Keith Moore's suggestions.

I think the first task in this area should be to investigate the nature
and degree of desire for non-ASCII local parts. This desire needs to
be weighed against the benefits we derive from writing all local parts
in a small, fixed alphabet (ASCII printables).

-zefram
J-F C. (Jefsey) Morfin
2003-10-29 15:50:39 UTC
Permalink
On 10:31 28/10/03, Zefram said:
>I think the first task in this area should be to investigate the nature
>and degree of desire for non-ASCII local parts. This desire needs to be
>weighed against the benefits we derive from writing all local parts in a
>small, fixed alphabet (ASCII printables).

May I ask which part of the world you come from?

This being said only Americans want/are satisfied with "internationalized"
(sic) names (the artificial extension of the American character set with
most of the American foreign scripting, within an ascii frame). No one
really wants multilingual names (a totally internationalized (sic) frame
supporting languages and therefore some language oriented rules - at least
ni management and user support). The users need vernacular support, that is
to be able to freely do in the mail what they use to do elsewhere.

I note that for non-American writers "international names" means names that
everyone from every nation will understand. It happens to be the ascii
character set limited to the DNS used names (they were selected for that
reason).
jfc
Keith Moore
2003-10-29 17:53:52 UTC
Permalink
I suspect it would much more useful if people tried to characterize
what they and their cohorts want, rather than trying to characterize what some
other group wants.

> This being said only Americans want/are satisfied with "internationalized"
> (sic) names (the artificial extension of the American character set with
> most of the American foreign scripting, within an ascii frame). No one
> really wants multilingual names (a totally internationalized (sic) frame
> supporting languages and therefore some language oriented rules - at least
> ni management and user support). The users need vernacular support, that is
> to be able to freely do in the mail what they use to do elsewhere.
>
> I note that for non-American writers "international names" means names that
> everyone from every nation will understand. It happens to be the ascii
> character set limited to the DNS used names (they were selected for that
> reason).
> jfc
John C Klensin
2003-10-28 23:03:15 UTC
Permalink
Dave,

(distro trimmed to IMAA and IETF lists; I hope we can soon get
rid of the latter too).

One large problem with the charter draft (I'm too tired to know
if there are small ones)...

If one is going to consider internationalization of email
addresses in a way that permits them to move through the mail
protocol in some traditional Unicode encoding (e.g., UTF-8),
then I believe that we at least need to entertain the notion
that what we are going after is "mailbox", rather than "local
part". Yes, the hard work lies in the local part. But, to me,
the goal is to have an I18N presentation form that is also
carried over the protocol.

That implies that one should be able to have
local-part-***@FQDN-i18n
(UTF-8 on both sides or, if you prefer, throughout the string
(since the coding of "@" in UTF-8 is the same as it is in ASCII)

rather than, e.g.,
local-part-***@FQDN-IDNA
(UTF-8 on the LHS, but punycode for any domain labels that use
non-ASCII).

Again, the goal is that this should be natural for the user,
using the user's script (or the script of the recipient), both
in protocol transactions and in the case of "My email address is
***@yy." in the body of a message... where the only parts of that
sentence (appropriately translated) which are a ASCII characters
are the @-sign and _maybe_ the TLD (whether the TLD can be
non-ASCII is presumably an ICANN problem unless the user
interface does something akin to draft-klensin-idn-tld-01.txt).

If the email address the user sees _looks_ like local script in
the local part, but is forced into ASCII/punycode in the domain
part, I think the users will assume that we have been smoking
something. And, arguably, they will be right --punycode is a
way of transporting internationalized data so it doesn't foul up
other systems or cause DNS damage. But, from the user
standpoint, however much users hate, e.g., a string representing
a name transliterated into Roman characters, they will hate
looking at punycode-- which has no mnemonic value at all for
non-Roman scripts-- even more.

So please don't prejudge the question of what happens to the
domain part (right hand side) of an email address in the
charter: this set of issues should at lease be considered very
carefully.

john



--On Monday, October 27, 2003 16:19 -0800 Dave Crocker
<***@dcrocker.net> wrote:

> Folks,
>
> On the theory that discussions go better when they have a
> concrete deliverable, here is a proposed charter for a
> proposed working group.
>
> The following started with Mark Crispin's text, although it
> might not look it. Besides the usual goals for a charter, the
> following text attempts to specify the problem domain in the
> narrowest feasible form that is valid. If anyone thinks the
> scope is too narrow, they need to explain why.
>
>
>
> DRAFT CHARTER
>
> Mail Internationalised Local-Part (MILP)
> ---------------------------------
>
> The <local-part> portion of RFC2822 and <Local-part> portion
> of RFC2821 mail addresses are restricted to a subset of ASCII.
> This poses a fundamental barrier for users needing mail
> addresses to be expressed in a richer set of characters, such
> as Latin characters with diacriticals and the many Asian
> characters. The goal of the current work is to add local-part
> support for these additional characters, while preserving the
> large, installed base of ASCII usage.
>
> The group will take:
>
> draft-hoffman-imaa-03.txt
> draft-klensin-emailaddr-i18n-01.txt
> draft-duerst-iri-04.txt
>
> as input to discussions.
>
> The group will pay particular attention to barriers to
> adoption and utility, as well as any impact the new scheme
> might have on the existing base of Internet mail usage.
>
>
> Milestones
> ----------
>
> Nov, 03: BOF
>
> Dec, 03: WG chartered
>
> Feb, 03: Initial draft of working group specifications.
>
> Jun, 03: Specifications submitted for IETF approval
>
>
> d/
> --
> Dave Crocker <dcrocker-at-brandenburg-dot-com>
> Brandenburg InternetWorking <www.brandenburg.com>
> Sunnyvale, CA USA <tel:+1.408.246.8253>
>
>
>
Dave Crocker
2003-10-29 02:16:46 UTC
Permalink
John,

JCK> If one is going to consider internationalization of email
JCK> addresses in a way that permits them to move through the mail
JCK> protocol in some traditional Unicode encoding (e.g., UTF-8),
JCK> then

...then we get to repeat the mime/esmtp debates all over again. After all,
why should we even try to learn anything from 10 years of experience. (And
no, John, I'm not directing my comment at you.)

To be specific: I am not suggesting that pure utf-8 is a bad goal -- although
the fact that utf-8 is, itself, a condensed representation of unicode should
strike folks as a just a tad ironic, with respect to these discussions.

Rather, I suggest that it be a _separate_ goal from near-term support of an
edge-only enhancement for Unicode support, the same as we did for mime and
IDN.

We already have that support for domain names. That only leaves local-part.

It's fine to pursue a separate path for long-term 8-bit purity. I'm sure we
will achieve it much sooner for addresses than we have for content.


JCK> Again, the goal is that this should be natural for the user,
JCK> using the user's script (or the script of the recipient), both
JCK> in protocol transactions and in the case of "My email address is
JCK> ***@yy."

The business card representation of an email address is the classic example of
IETF work that very much _does_ directly involve the user interface. However
we already dealt with this issue for non-ascii domain names. We do not need
to rehash this issue yet again.

As for the protocol, I could have sworn that users do not type protocol data
units directly, or at least that they haven't for roughly 25 years. (Another
jibe, citing the fact that utf-8 is, itself, a modification to "raw" unicode
is probably worth repeating, here.)



JCK> in the body of a message... where the only parts of that
JCK> sentence (appropriately translated) which are a ASCII characters
JCK> are the @-sign and _maybe_ the TLD (whether the TLD can be
JCK> non-ASCII is presumably an ICANN problem unless the user
JCK> interface does something akin to draft-klensin-idn-tld-01.txt).

Indeed, representation of non-ascii addressing information within a text
segment is an interesting problem. I'd guess it's identical to the business
card requirement.

And the current issue is no different than we have for IDN.

So perhaps the right thing to do is forget about IDN. Pretend it never
happen. Let's start all over.

Or, perhaps we could complete the design approach started by IDN, while
_separately_ pursuing the purist approach of end-to-end 8-bit.


JCK> So please don't prejudge the question of what happens to the
JCK> domain part (right hand side) of an email address in the
JCK> charter: this set of issues should at lease be considered very
JCK> carefully.

Indeed it should, including tidbits like adoption barriers, and the last
several years of IDN work.

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Mark Davis
2003-10-29 15:02:22 UTC
Permalink
> As for the protocol, I could have sworn that users do not type protocol data
> units directly, or at least that they haven't for roughly 25 years. (Another
> jibe, citing the fact that utf-8 is, itself, a modification to "raw" unicode
> is probably worth repeating, here.)

While it doesn't really have a bearing on the rest of your message, this is a
common misperception that I'd like to take a moment to correct.

When Unicode is expressed as a series of bytes, there are a number of equally
valid sncoding schemes (aka serializations). UTF-8 is one of those schemes, and
is no more or less a "modification", and no more or less "Unicode" than any
other of these schemes. Different encoding schemes may be better for different
domains, but the conversion between any of those schemes is fast and lossless.
See http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf, Sections 2.4-2.6.

(When Unicode started out 15 years ago, the architecture was different; but it
has long been structured this way.)

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄
Dave Crocker
2003-10-29 15:11:30 UTC
Permalink
Mark,

>> (Another
>> jibe, citing the fact that utf-8 is, itself, a modification to "raw" unicode
>> is probably worth repeating, here.)
MD> When Unicode is expressed as a series of bytes, there are a number of equally
MD> valid sncoding schemes (aka serializations). UTF-8 is one of those schemes, and
MD> is no more or less a "modification", and no more or less "Unicode" than any
MD> other of these schemes.


That's right. It is an "encoding". Raw Unicode takes more than 8-bits. Lots
more. UTF-8 is a method of encoding those raw bits into a non-raw form.

So is the ACE approach.

My point was that folks tend to talk about UTF-8 as if it were the raw
representation, rather than a derivative encoding. In fact, UTF-8 is exactly
parallel to the ACE approach.

It might be a more efficient encoding, but it is no more "native" or "direct"
or "raw" than ACE.


d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
John Cowan
2003-10-29 16:27:39 UTC
Permalink
Dave Crocker scripsit:

> That's right. It is an "encoding". Raw Unicode takes more than 8-bits. Lots
> more. UTF-8 is a method of encoding those raw bits into a non-raw form.

[snip]

> It might be a more efficient encoding, but it is no more "native" or "direct"
> or "raw" than ACE.

That's only true if you take the position that there are no native/direct/raw
encodings of Unicode.

--
In politics, obedience and support John Cowan <***@reutershealth.com>
are the same thing. --Hannah Arendt http://www.ccil.org/~cowan
Dave Crocker
2003-10-29 17:14:04 UTC
Permalink
John,

JC> That's only true if you take the position that there are no native/direct/raw
JC> encodings of Unicode.

Oh? You mean that Unicode does not fit directly -- ie, with no special
encoding rules -- into 32 bits, or 24 bits, or somesuch.

You mean that Unicode does not need special rules to stuff it into 8 bits, and
another set of rules to stuff it into 16 bits?

Because if the answer is that yes it does -- and the answer _is_ yes it does
-- then my point stands.

That's the difference between native representation, versus "encoding".

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
John Cowan
2003-10-29 19:01:33 UTC
Permalink
Dave Crocker scripsit:

> Oh? You mean that Unicode does not fit directly -- ie, with no special
> encoding rules -- into 32 bits, or 24 bits, or somesuch.

Nope. The Unicode character set maps characters to integers. How the
integers are mapped to bytes is defined by the encoding rules, of which
there are seven standard ones: UTF-8, UTF-16, UTF-16BE, UTF-16LE,
UTF-32, UTF-32BE, UTF-32LE. All have equal status.

> That's the difference between native representation, versus "encoding".

There is no native representation in the sense you mean. All
representations are equal.

--
De plichten van een docent zijn divers, John Cowan
die van het gehoor ook. ***@reutershealth.com
--Edsger Dijkstra http://www.ccil.org/~cowan
Adam M. Costello
2003-10-29 21:23:26 UTC
Permalink
Dave Crocker <***@dcrocker.net> wrote:

> [UTF-8] might be a more efficient encoding, but it is no more "native"
> or "direct" or "raw" than ACE.

I know this is beside the point, but...

UTF-8 is more compact than Punycode only for strings with a lot of
ASCII characters, which is typical of Latin-based scripts. For small
non-Latin scripts (like Cyrillic and Arabic), Punycode is significantly
more compact than UTF-8 (and for some of them, including all the Indian
scripts, the difference is quite great). For large scripts (like Han
and Hangul) Punycode and UTF-8 are comparable, and UTF-16 beats them
both by a wide margin.

AMC
Mark Davis
2003-10-29 23:24:48 UTC
Permalink
And this is, of course, only for short strings. A 10K file of Cyrillic converted
to Punycode would blow out completely. (Of course, short strings were part of
the design constraints.)

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Adam M. Costello" <ietf-imaa.amc+***@nicemice.net.RemoveThisWord>
To: <ietf-***@imc.org>
Sent: Wed, 2003 Oct 29 13:23
Subject: Re: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


>
> Dave Crocker <***@dcrocker.net> wrote:
>
> > [UTF-8] might be a more efficient encoding, but it is no more "native"
> > or "direct" or "raw" than ACE.
>
> I know this is beside the point, but...
>
> UTF-8 is more compact than Punycode only for strings with a lot of
> ASCII characters, which is typical of Latin-based scripts. For small
> non-Latin scripts (like Cyrillic and Arabic), Punycode is significantly
> more compact than UTF-8 (and for some of them, including all the Indian
> scripts, the difference is quite great). For large scripts (like Han
> and Hangul) Punycode and UTF-8 are comparable, and UTF-16 beats them
> both by a wide margin.
>
> AMC
>
Adam M. Costello
2003-10-30 02:20:48 UTC
Permalink
Mark Davis <***@jtcsv.com> wrote:

> And this is, of course, only for short strings. A 10K file of
> Cyrillic converted to Punycode would blow out completely.

Think so? Let's find out...

The easiest way to find a bunch of Russian text was to search Google
for "bible russian", so I grabbed the first five chapters of Genesis
(represented in koi8-r, an 8-bit charset) and piped them through "tr -d
'[\000-\177]'" to remove all ASCII characters, leaving 10226 Cyrillic
characters.

I then used "recode koi8-r..ucs-4" and a homemade script to convert
those characters to U+ notation.

I then edited the Punycode sample implementation to increase the static
array sizes of the input and output to 11000 and 22000 respectively.

The UTF-8 encoding of the text is 20452 bytes. The Punycode encoding is
11970 caseless ASCII letters and digits (base 36).

But for a string that large, conventional compression techniques work
better than Punycode's technique. Compressing the UTF-8 encoding with
gzip, then converting to base-32, results in 8332 ASCII letters and
digits.

Summary:

input: 10226 Cyrillic characters
UTF-8: 20452 bytes (high bit always set)
Punycode: 11970 base-36 ASCII characters
UTF-8.gz.base32: 8332 base-32 ASCII characters

I think with strings beyond about 4000 characters, Punycode starts
to run the risk of bumping up against the limit of 32-bit integers.
Although strings that use only plane 0 should be fine up to around
60,000 characters.

The O(n^2) algorithm given in the Punycode spec is costly for long
strings (my 864 MHz Pentium III could encode the 10k test string only 44
times per second). I'm pretty sure I can write an O(n log n) algorithm
for Punycode, but I haven't actually done it yet.

Of course you wouldn't want to use Punycode for long strings anyway,
you'd want to use conventional compression that exploits repeated
substrings, like deflate. But maybe the O(n log n) algorithm will
be faster than the O(n^2) algorithm for medium-length strings, where
Punycode still compresses better than deflate.

AMC
Mark Davis
2003-10-30 02:30:11 UTC
Permalink
I stand corrected (and sit surprised).

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Adam M. Costello" <ietf-imaa.amc+***@nicemice.net.RemoveThisWord>
To: "IETF IMAA list" <ietf-***@imc.org>
Sent: Wed, 2003 Oct 29 18:20
Subject: Re: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


>
> Mark Davis <***@jtcsv.com> wrote:
>
> > And this is, of course, only for short strings. A 10K file of
> > Cyrillic converted to Punycode would blow out completely.
>
> Think so? Let's find out...
>
> The easiest way to find a bunch of Russian text was to search Google
> for "bible russian", so I grabbed the first five chapters of Genesis
> (represented in koi8-r, an 8-bit charset) and piped them through "tr -d
> '[\000-\177]'" to remove all ASCII characters, leaving 10226 Cyrillic
> characters.
>
> I then used "recode koi8-r..ucs-4" and a homemade script to convert
> those characters to U+ notation.
>
> I then edited the Punycode sample implementation to increase the static
> array sizes of the input and output to 11000 and 22000 respectively.
>
> The UTF-8 encoding of the text is 20452 bytes. The Punycode encoding is
> 11970 caseless ASCII letters and digits (base 36).
>
> But for a string that large, conventional compression techniques work
> better than Punycode's technique. Compressing the UTF-8 encoding with
> gzip, then converting to base-32, results in 8332 ASCII letters and
> digits.
>
> Summary:
>
> input: 10226 Cyrillic characters
> UTF-8: 20452 bytes (high bit always set)
> Punycode: 11970 base-36 ASCII characters
> UTF-8.gz.base32: 8332 base-32 ASCII characters
>
> I think with strings beyond about 4000 characters, Punycode starts
> to run the risk of bumping up against the limit of 32-bit integers.
> Although strings that use only plane 0 should be fine up to around
> 60,000 characters.
>
> The O(n^2) algorithm given in the Punycode spec is costly for long
> strings (my 864 MHz Pentium III could encode the 10k test string only 44
> times per second). I'm pretty sure I can write an O(n log n) algorithm
> for Punycode, but I haven't actually done it yet.
>
> Of course you wouldn't want to use Punycode for long strings anyway,
> you'd want to use conventional compression that exploits repeated
> substrings, like deflate. But maybe the O(n log n) algorithm will
> be faster than the O(n^2) algorithm for medium-length strings, where
> Punycode still compresses better than deflate.
>
> AMC
>
James Seng
2003-10-27 11:07:09 UTC
Permalink
I seen John and Paul proposal but I have not seen Michel. Is there a
draft that I can read up?

ps: I wont be able to join the meeting but I am interested in the subject.

-James Seng

Patrik Fältström wrote:

> At the IETF in Minneapolis, there will be a BOF on Internationalized
> Email Addresses (IEA).
>
> It is *preliminary* on the agenda on Monday, November 10, 2003 at
> 1530-1730.
>
> Chairs: Pete Resnick, Patrik Fältström
> Mailing list:***@imc.org (other salient lists include ***@w3.org)
> Agenda:
>
> Agenda Bashing (Chairs) 5 min.
> Topic Introduction (Chairs) 10 min.
> Proposals
> IDNA-Based (Paul Hoffman) 15 min.
> Infrastructure-Based (John Klensin) 15 min.
> IRI-Based (Michel Suignard) 15 min.
>
> Discussion 60 min.
>
> Topics for discussion:
>
> Are there other solutions which have been specified?
>
> The solutions present the problem at different scopes;
>
> Where should the IETF tackle it?
>
> Are some short-term, and other long-term?
>
> Can the solutions be staged or co-exist?
>
> If staged, how to migrate from one to another?
>
> What are the next steps for the IETF?
>
>
> NB: This BoF is exploratory in nature, and it is not intended that the
> IETF will finalize a decision in this venue. It was proposed to foster
> a community discussion, not charter a working group or pick a winner. If
> further work is required, step one would be identifying individuals
> willing to carry that work forward.
>
> Reading material:
> draft-hoffman-imaa-03.txt
> draft-klensin-emailaddr-i18n-01.txt
> draft-duerst-iri-04.txt
>
> Pete and myself hope people will come with a lot of constructive
> comments and ideas.
>
> Patrik, co-chair of the bof
>
>
>
>
>
WJCarpenter
2003-10-27 23:43:14 UTC
Permalink
mc> As presently constituted, email addresses are limited to the 26
mc> Latin alphabetics, 10 digits, and a limited number of special
mc> characters in the ASCII character set. There is a growing need to

upper and lower case alphabetics
-- bill-***@carpenter.ORG (WJCarpenter) PGP 0x91865119 38 95 1B 69 C9 C6
3D 25 73 46 32 04 69 D6 ED F3
Dave Aronson
2003-10-27 19:43:09 UTC
Permalink
On Mon October 27 2003 12:30, WJCarpenter wrote:

> mc> As presently constituted, email addresses are limited to the 26
> mc> Latin alphabetics, 10 digits, and a limited number of special
> mc> characters in the ASCII character set. There is a growing need
> to
>
> upper and lower case alphabetics

Yes, but with either of those two sets (generally) considered equivalent
to the other, boiling down to effectively 26 choices.

-- Dave Aronson, Senior Software Engineer, Secure Software Inc. Email me
at: work (D0T) 2004 (@T) dja (D0T) mailme (D0T) org Web:
http://destined.to/program http://listen.to/davearonson
Mark Crispin
2003-10-27 19:10:55 UTC
Permalink
On Mon, 27 Oct 2003, Keith Moore wrote:

>> Thanks for taking a stab at a problem statement. I'd like to drill down
>> on this just a bit.
>> What is the source of the "growing need"? Is it:
>> [snip]


I agree that this needs to be stated, but someone other than me will have
to do it.

I believe that the primary push for this functionality comes from regions
which use Latin alphabetics with diacriticals; and that most individuals
in regions which do not use Latin script are accept the use of Latin
script for multinational interchange. In many regions where Latin
diacriticals are used, there is no acceptable transform of a surname to a
form that does not use diacriticals. Simply omitting the diacritical
causes (at least to the inhabitants of those regions) a misspelling.

This set of beliefs naturally biases how I approach the problem. The
problem statement must be free of bias, including mine.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Mark Davis
2003-10-28 00:16:20 UTC
Permalink
I'm curious: why do you think that everyone would be satisfied with Latin
characters only, and no non-Latin characters?

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Mark Crispin" <***@CAC.Washington.EDU>
To: "Keith Moore" <***@cs.utk.edu>
Cc: <***@brandenburg.com>; <***@cisco.com>; <ietf-***@imc.org>;
<***@ops.ietf.org>; <ietf-***@imc.org>; <***@ietf.org>; <ietf-***@imc.org>;
<***@ietf.org>; <***@apps.ietf.org>; <ietf-***@imc.org>;
<ietf-***@imc.org>; <***@qualcomm.com>; <***@qualcomm.com>;
<***@mrochek.com>
Sent: Mon, 2003 Oct 27 11:10
Subject: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


> On Mon, 27 Oct 2003, Keith Moore wrote:
>
> >> Thanks for taking a stab at a problem statement. I'd like to drill down
> >> on this just a bit.
> >> What is the source of the "growing need"? Is it:
> >> [snip]
>
>
> I agree that this needs to be stated, but someone other than me will have
> to do it.
>
> I believe that the primary push for this functionality comes from regions
> which use Latin alphabetics with diacriticals; and that most individuals
> in regions which do not use Latin script are accept the use of Latin
> script for multinational interchange. In many regions where Latin
> diacriticals are used, there is no acceptable transform of a surname to a
> form that does not use diacriticals. Simply omitting the diacritical
> causes (at least to the inhabitants of those regions) a misspelling.
>
> This set of beliefs naturally biases how I approach the problem. The
> problem statement must be free of bias, including mine.
>
> -- Mark --
>
> http://staff.washington.edu/mrc
> Science does not emerge from voting, party politics, or public debate.
> Si vis pacem, para bellum.
>
>
>
>
>
>
>
>
Keith Moore
2003-10-28 00:18:40 UTC
Permalink
[recipient list trimmed]

> I'm curious: why do you think that everyone would be satisfied with Latin
> characters only, and no non-Latin characters?

why do you think that everyone would be satisfied with ten digits on their
telephones? or for that matter, ten fingers?
Mark Crispin
2003-10-28 01:15:00 UTC
Permalink
On Mon, 27 Oct 2003, Mark Davis wrote:
> I'm curious: why do you think that everyone would be satisfied with Latin
> characters only, and no non-Latin characters?

I didn't say that. I stated my belief that, for reasons of practicality,
most individuals in regions which do not use Latin script accept the use
of Latin script for multinational exchange.

It does not work well for an individual in Japan with surname Tanaka to
expect the overwhelming majority of non-Japanese individuals worldwide to
know his surname is written with the Han characters for "rice paddy" and
"middle", or what those characters look like, or how to enter those
characters on the computer.

It does, however, work for him to expect that the overwhelming majority of
individuals worldwide to know how to deal with the 6 Latin letters that
form the romanization "Tanaka".

Nor is it very likely that this situation will change in the future. I
doubt that many individuals in the world are literate in all the world's
active scripts. Literacy in one's native script and basic Latin script is
something that most computer users possess today.

For domestic exchange only, that pair of Han characters are probably
alright. Within Western Europe, it's probably alright to use Latin
characters with diacriticals.

Perhaps the main problem that needs to be decided in any IEA effort is if
it is alright to have email addresses that are only usable in limited
areas of the world; or if not, how to represent internationalized email
addresses in a usable fashion when (not if) the email address needs to be
represented for a person and/or computer is illiterate in that script.

A likely side issue is whether it is "good enough" to promote Latin
characters with diacriticals to the same status of "everybody must know
how to do these" that is required for ASCII.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Mark Davis
2003-10-28 01:39:32 UTC
Permalink
Ok, I understand more about the context.

Based on what I've seen, I think it quite likely that people will want email
addresses in their native script, even if that means that outsiders can't
(easily) use those email address. After all, it is quite easy to have multiple
email addresses. Mr. Tanaka can have one with Latin letters and one with
Japanese (e.g. ムルク@カク.ワシングトン.エデゥ).

We should remember that for a great many people in the world, Latin letters are
quite unnatural; it'd be a bit like if we had to use Greek letters in all email
addresses. And there are many projects underway in less-developed countries to
bring computers to masses of people that will even less familiarity with Latin
letters.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Mark Crispin" <***@CAC.Washington.EDU>
To: "Mark Davis" <***@jtcsv.com>
Cc: "Keith Moore" <***@cs.utk.edu>; <***@brandenburg.com>;
<***@cisco.com>; <ietf-***@imc.org>; <***@ops.ietf.org>; <ietf-***@imc.org>;
<***@ietf.org>; <ietf-***@imc.org>; <***@ietf.org>;
<***@apps.ietf.org>; "IMAP Extensions WG" <ietf-***@imc.org>;
<ietf-***@imc.org>; <***@qualcomm.com>; <***@qualcomm.com>;
<***@mrochek.com>
Sent: Mon, 2003 Oct 27 17:15
Subject: Re: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


> On Mon, 27 Oct 2003, Mark Davis wrote:
> > I'm curious: why do you think that everyone would be satisfied with Latin
> > characters only, and no non-Latin characters?
>
> I didn't say that. I stated my belief that, for reasons of practicality,
> most individuals in regions which do not use Latin script accept the use
> of Latin script for multinational exchange.
>
> It does not work well for an individual in Japan with surname Tanaka to
> expect the overwhelming majority of non-Japanese individuals worldwide to
> know his surname is written with the Han characters for "rice paddy" and
> "middle", or what those characters look like, or how to enter those
> characters on the computer.
>
> It does, however, work for him to expect that the overwhelming majority of
> individuals worldwide to know how to deal with the 6 Latin letters that
> form the romanization "Tanaka".
>
> Nor is it very likely that this situation will change in the future. I
> doubt that many individuals in the world are literate in all the world's
> active scripts. Literacy in one's native script and basic Latin script is
> something that most computer users possess today.
>
> For domestic exchange only, that pair of Han characters are probably
> alright. Within Western Europe, it's probably alright to use Latin
> characters with diacriticals.
>
> Perhaps the main problem that needs to be decided in any IEA effort is if
> it is alright to have email addresses that are only usable in limited
> areas of the world; or if not, how to represent internationalized email
> addresses in a usable fashion when (not if) the email address needs to be
> represented for a person and/or computer is illiterate in that script.
>
> A likely side issue is whether it is "good enough" to promote Latin
> characters with diacriticals to the same status of "everybody must know
> how to do these" that is required for ASCII.
>
> -- Mark --
>
> http://staff.washington.edu/mrc
> Science does not emerge from voting, party politics, or public debate.
> Si vis pacem, para bellum.
>
Mark Crispin
2003-10-28 03:37:37 UTC
Permalink
On Mon, 27 Oct 2003, Mark Davis wrote:
> Based on what I've seen, I think it quite likely that people will want email
> addresses in their native script, even if that means that outsiders can't
> (easily) use those email address.

That may well be the case.

> We should remember that for a great many people in the world, Latin
> letters are quite unnatural; it'd be a bit like if we had to use Greek
> letters in all email addresses. And there are many projects underway in
> less-developed countries to bring computers to masses of people that
> will even less familiarity with Latin letters.

I am not convinced that it is possible to use a computer on the Internet
anywhere in the world without at least a basic acquaintance with Latin
script.

I do not believe many individuals (other than primary school children) are
literate in their native language but are completely illiterate in Latin
script. This does not mean "being able to read or write the English
language"; rather, this simply means knowing the Latin script alphabet.

Put another way, individuals who are completely illiterate in Latin script
are also likely to be illiterate in their native language script as well.

No other script on the planet has such international recognition.

There is undoubtably a *preference* for one's native script; and that
preference should be respected as much as possible.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Abhijit Menon-Sen
2003-10-28 04:22:43 UTC
Permalink
At 2003-10-27 19:37:37 -0800, ***@CAC.Washington.EDU wrote:
>
> I do not believe many individuals (other than primary school children) are
> literate in their native language but are completely illiterate in Latin
> script. This does not mean "being able to read or write the English
> language"; rather, this simply means knowing the Latin script alphabet.

Mark,

The number of people in India who can read and write only their native
language, but have no usable knowledge of Latin script, is much larger
than the tiny number who are familiar with both. I'm told that this is
true for many native speakers of Chinese and Arabic as well.

The use of local scripts is much more than just a "preference" for the
numerous localisation efforts in India which focus on making computing
more accessible to poor farmers and people in villages.

(I agree that it's currently nearly impossible to use computers if one
isn't familiar with the Latin script, of course.)

-- ams
Mark Crispin
2003-10-28 04:42:19 UTC
Permalink
On Tue, 28 Oct 2003, Abhijit Menon-Sen wrote:
> The number of people in India who can read and write only their native
> language, but have no usable knowledge of Latin script, is much larger
> than the tiny number who are familiar with both. I'm told that this is
> true for many native speakers of Chinese and Arabic as well.

I defer to your superior knowledge about India.

I do not believe that this is true for Chinese. AFAIK, Chinese primary
school kids use Latin script with hanyu-pinyin as a stopgap prior to their
mastery of Han script (which takes many years).

> The use of local scripts is much more than just a "preference" for the
> numerous localisation efforts in India which focus on making computing
> more accessible to poor farmers and people in villages.

A poor farmer or villager in China is more likely to be totally illiterate
than to be literate in Han script but unable to recognize Latin script.

Note that when I say "recognize Latin script", I mean the ability to
determine that "dog" is a three-letter word that has the letters "d", "o",
and "g", each of which the individual recognizes and can name. This does
not include the ability to recognize that this refers to a domesticated
canine.

> (I agree that it's currently nearly impossible to use computers if one
> isn't familiar with the Latin script, of course.)

Which probably makes the rest of this discussion academic, unless we're
going to undertake solving *that* problem for Microsoft and the various
UNIX/Linux vendors...

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Arnt Gulbrandsen
2003-10-28 10:15:17 UTC
Permalink
Mark Crispin writes:
> On Tue, 28 Oct 2003, Abhijit Menon-Sen wrote:
>> (I agree that it's currently nearly impossible to use computers if
>> one isn't familiar with the Latin script, of course.)
>
> Which probably makes the rest of this discussion academic, unless
> we're going to undertake solving *that* problem for Microsoft and the
> various UNIX/Linux vendors...

You're not appreciating the full complexity of the problem. ;)

Not only should the email standards permit MUAs and MTAs of the year
2020 to solve the problem Abhijit mentions, but they should even permit
such future programs to interoperate with latinate ones of the present
and near future. And if it's too hard for latinate MUAs to implement
the IEA standard, that won't happen.

--Arnt
Mark Davis
2003-10-28 15:10:59 UTC
Permalink
> > (I agree that it's currently nearly impossible to use computers if one
> > isn't familiar with the Latin script, of course.)
>
> Which probably makes the rest of this discussion academic, unless we're
> going to undertake solving *that* problem for Microsoft and the various
> UNIX/Linux vendors...

It is currently impossible to use the Internet without knowing the Latin script.
However, the goal of most well-designed client software and operating systems is
to permit the user to work entirely within their native language, with a fully
localized system. This is reaching to India and other countries; Microsoft has
introduced fully localized versions of Indic Windows just recently, and Linux
vendors are hard at work to produce fully localized versions of their software.

Email and Web addresses are the big remaining holdouts for most people. People
should not be forced to use a script that they are unfamiliar with, just to use
email addresses and sites in their own countries. Even if they are familiar with
the Latin script, it is very often a very bad match for their languages, making
it very difficult to figure out how native words would be spelled in it.

Mark
Keith Moore
2003-10-28 15:27:10 UTC
Permalink
> It is currently impossible to use the Internet without knowing the Latin script.
> However, the goal of most well-designed client software and operating systems is
> to permit the user to work entirely within their native language, with a fully
> localized system. This is reaching to India and other countries; Microsoft has
> introduced fully localized versions of Indic Windows just recently, and Linux
> vendors are hard at work to produce fully localized versions of their software.
>
> Email and Web addresses are the big remaining holdouts for most people. People
> should not be forced to use a script that they are unfamiliar with, just to use
> email addresses and sites in their own countries. Even if they are familiar with
> the Latin script, it is very often a very bad match for their languages, making
> it very difficult to figure out how native words would be spelled in it.

and yet, as we've seen time and time again, local use of nonportable addresses
can cause major problems for the net as a whole. we saw this in earlier days
of email with the admixture of bitnet/rscs/nje, uucp, decnet, x.400, and Internet
addresses. we've seen it in the IP space with RFC 1918 addresses.

in some ways internationalizing email addresses is a much harder problem than
internationalizing IDNs, because no other application is as dependent on having
human beings actually use addresses as email. (yes, people do sometimes type in
URLs, but not nearly as often as they click on links. and there are apps which
require humans to type in domain names, but for most of them this only happens
at configuration time.)

one way to approach this problem might be to make email less dependent on having
addresses typed in.

Keith
Mark Davis
2003-10-28 16:02:23 UTC
Permalink
An international email address would be portable as far a computers are
concerned; the issue would be for people: how to view it and how to key it in
(as you note). In both cases, that depends more on the client software than the
protocols.

I wouldn't be able to type in an email address in Tamil (though I could
certainly copy and paste it). I would wager that anyone who cares to have his
email used by foreigners would have dual email addresses, or perhaps even more;
e.g. Tamil, Latin, and Chinese.

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Keith Moore" <***@cs.utk.edu>
To: "Mark Davis" <***@jtcsv.com>
Cc: <***@cs.utk.edu>; <***@cac.washington.edu>; <***@wiw.org>;
<ietf-***@imc.org>; <ietf-***@imc.org>
Sent: Tue, 2003 Oct 28 07:27
Subject: Re: FYI: BOF on Internationalized Email Addresses (IEA)


> > It is currently impossible to use the Internet without knowing the Latin
script.
> > However, the goal of most well-designed client software and operating
systems is
> > to permit the user to work entirely within their native language, with a
fully
> > localized system. This is reaching to India and other countries; Microsoft
has
> > introduced fully localized versions of Indic Windows just recently, and
Linux
> > vendors are hard at work to produce fully localized versions of their
software.
> >
> > Email and Web addresses are the big remaining holdouts for most people.
People
> > should not be forced to use a script that they are unfamiliar with, just to
use
> > email addresses and sites in their own countries. Even if they are familiar
with
> > the Latin script, it is very often a very bad match for their languages,
making
> > it very difficult to figure out how native words would be spelled in it.
>
> and yet, as we've seen time and time again, local use of nonportable addresses
> can cause major problems for the net as a whole. we saw this in earlier days
> of email with the admixture of bitnet/rscs/nje, uucp, decnet, x.400, and
Internet
> addresses. we've seen it in the IP space with RFC 1918 addresses.
>
> in some ways internationalizing email addresses is a much harder problem than
> internationalizing IDNs, because no other application is as dependent on
having
> human beings actually use addresses as email. (yes, people do sometimes type
in
> URLs, but not nearly as often as they click on links. and there are apps
which
> require humans to type in domain names, but for most of them this only happens
> at configuration time.)
>
> one way to approach this problem might be to make email less dependent on
having
> addresses typed in.
>
> Keith
>
Keith Moore
2003-10-28 16:29:14 UTC
Permalink
> An international email address would be portable as far a computers
> are concerned; the issue would be for people: how to view it and how
> to key it in(as you note). In both cases, that depends more on the
> client software than the protocols.

it depends on client software, keyboards, and the ability of the users
to use them. but I suspect that the choice of protocol can change the
degree of dependence on these.

> I wouldn't be able to type in an email address in Tamil (though I
> could certainly copy and paste it). I would wager that anyone who
> cares to have his email used by foreigners would have dual email
> addresses, or perhaps even more; e.g. Tamil, Latin, and Chinese.

it's a bit more difficult than that - because email often involves more
than two parties. and it's not as if you can expect those who speak only
(e.g.) Tamil to correspond exclusively with others in that same set.

Keith
Mark Crispin
2003-10-28 16:35:09 UTC
Permalink
On Tue, 28 Oct 2003, Mark Davis wrote:
> I would wager that anyone who cares to have his
> email used by foreigners would have dual email addresses, or perhaps even more;
> e.g. Tamil, Latin, and Chinese.

I wonder if the full implications of this statement are understood by
those who blithely advocate its position. The Law of Unintended
Consequences severely punishes those who disregard it.

Just as a few off the cuff-examples:

Is ASCII formally abolished as the lingua franca of email addresses,
meaning that some people will be unable to use ASCII addresses?

If so, how many email addresses will a diplomat (or any other individual
engaged in multinational business) need?

If not, doesn't that create a two tier world of persons who have usable
international addresses and those who have domestic-only addresses? It is
not difficult to envision that in many countries, an ASCII address would
be a privilege available only to select individuals (much as a phone line
with international dialing).

Authors of spam-blocking software (including me!) will be delighted.
Here will be yet another weapon in our arsenal.


I am not arguing any particular course of action. I am tossing a few
paper airplanes at the clay feet of those who seem to disregard some
serious problems. Those problems need to be addressed; but every time
these problems are mentioned, we keep hearing about the email needs of
illiterate villagers in the third world who will have high-speed Internet
before they have safe drinking water.

Nobody has disputed the desirability of enabling computer users to use
their native scripts to the maximum extent possible. What we are
discussing is what is "possible", and how to make the "impossible" become
"possible".

Many decades ago, one of my professors told me something that I have
always kept in mind since: "you have to learn to walk before you learn to
run."

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
John Cowan
2003-10-28 16:58:49 UTC
Permalink
Mark Crispin scripsit:

> Is ASCII formally abolished as the lingua franca of email addresses,
> meaning that some people will be unable to use ASCII addresses?

All email addresses will continue to have an ASCII *representation*, which
will not, for most people, have mnemonic value. It may therefore be useful
to have two email addresses, one which can be decoded into a non-ASCII
representation and is good for fellow-countrymen, and another which cannot
be decoded, but has mnemonic value for those who use the Latin script.

> If so, how many email addresses will a diplomat (or any other individual
> engaged in multinational business) need?

That's like asking how many languages should appear on your business card.
As many as the different kinds of people with whom you habitually interact.

> If not, doesn't that create a two tier world of persons who have usable
> international addresses and those who have domestic-only addresses? It is
> not difficult to envision that in many countries, an ASCII address would
> be a privilege available only to select individuals (much as a phone line
> with international dialing).

No address is quite domestic-only, though it may be a meaningless jumble
to people who are interacting with the ASCII representation. Still,
a good friend of mine has the already very annoying email address of
<Onederful111s@[censored].net>, yes lower-case ell followed by three
digit ones. Go figure.

> Many decades ago, one of my professors told me something that I have
> always kept in mind since: "you have to learn to walk before you learn to
> run."

Actually, running before walking is quite possible in child development. :-)

--
I marvel at the creature: so secret and John Cowan
so sly as he is, to come sporting in the pool ***@reutershealth.com
before our very window. Does he think that http://www.reutershealth.com
Men sleep without watch all night? --Faramir http://www.ccil.org/~cowan
Mark Crispin
2003-10-28 18:51:16 UTC
Permalink
On Tue, 28 Oct 2003, John Cowan wrote:
> All email addresses will continue to have an ASCII *representation*, which
> will not, for most people, have mnemonic value.

Is that a given as one of the requirements? If so, that greatly
constrains the solution set. I happen to think that that is a desirable
constraint; but I have certainly not heard any definite hum of concensus
that this shall be a requirement.

Talk about using alternate forms of "@" (and the like) doesn't sound like
a hum to me.

> > If so, how many email addresses will a diplomat (or any other individual
> > engaged in multinational business) need?
> That's like asking how many languages should appear on your business card.
> As many as the different kinds of people with whom you habitually interact.

That is not a satisfactory answer to the question.

I am engaged in software support. I habitually interact with people from
around the planet. Any proposal, that states that someone like me must
maintain hundreds (if not thousands) of email addresses, is a non-starter.

It is obvious (at least to me) that there is a constraint that prevents
such an outcome. What is that constraint?

> No address is quite domestic-only, though it may be a meaningless jumble
> to people who are interacting with the ASCII representation.

Again, that assumes a requirement that (as far as I can tell) has not been
made.

If that truly is the requirement, we can have a very short meeting; just
sufficient to say "just use UTF-7" (or any other politically correct
rendering of the day), get the hum, and go home.

Something tells me that isn't going to happen. :-)

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
John Cowan
2003-10-28 19:13:35 UTC
Permalink
Mark Crispin scripsit:

> > That's like asking how many languages should appear on your business card.
> > As many as the different kinds of people with whom you habitually interact.
>
> That is not a satisfactory answer to the question.
>
> I am engaged in software support. I habitually interact with people from
> around the planet. Any proposal, that states that someone like me must
> maintain hundreds (if not thousands) of email addresses, is a non-starter.

First of all, there aren't hundreds of scripts in live use. Second,
you probably support people in at most five languages, using at most
two or three scripts, unless you are a prodigy. (You may have more
nationalized versions of some products than that, but you probably
can't handle more than that many languages *in email*.) An address
like "***@VeryBigCo.com" might need a hundred variants, but you
certainly won't.

For myself, I support people in English, and I need to be able to
recognize support requests in French and Spanish so that I can route them
to the right places. A single Latin-script email address is sufficient
for this. Adding a Cyrillic-script email address would be as silly
as adding Russian contact information to my business card: I can't do
anything for people who want to communicate with me in Russian anyway.

> > No address is quite domestic-only, though it may be a meaningless jumble
> > to people who are interacting with the ASCII representation.
>
> Again, that assumes a requirement that (as far as I can tell) has not been
> made.

Well, it's certainly what John's draft prescribes. As for the fullwidth @,
the only significance of that is that it's folded into an ASCII @ before
doing anything else, so that people who type a fullwidth @ instead of an
ASCII one are not inconvenienced too much.

> If that truly is the requirement, we can have a very short meeting; just
> sufficient to say "just use UTF-7" (or any other politically correct
> rendering of the day), get the hum, and go home.

That doesn't address appropriate folding, nor does it avoid stepping on
email addresses that are likely to be already in use.

--
I suggest you call for help, John Cowan
or learn the difficult art of mud-breathing. ***@reutershealth.com
--Great-Souled Sam http://www.ccil.org/~cowan
Mark Crispin
2003-10-28 19:37:19 UTC
Permalink
On Tue, 28 Oct 2003, John Cowan wrote:
> First of all, there aren't hundreds of scripts in live use.

Well, then, how many are there, and is there a list? This will allow us
to eliminate all the non-live scripts from future consideration.

> Second,
> you probably support people in at most five languages, using at most
> two or three scripts, unless you are a prodigy.

I think that you have misunderstood the problem. You seem to be thinking
about the needs of a user interface developer, who only supports people
who use the languages that are supported by that user interface.

Those of us who develop system tools build software that is used
everywhere in the world. Because system tools are not a user interface,
they are not restricted to the set of languages or scripts which are
understood by the author.

Furthermore, we are not talking about the language used by any support
messages to/from me. We are talking about whether someone in Lower
Slobbovia can enter the developer's email address, and whether the
developer can enter the Lower Slobbovian address.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
John Cowan
2003-10-28 21:30:02 UTC
Permalink
Mark Crispin scripsit:

> Well, then, how many are there, and is there a list? This will allow us
> to eliminate all the non-live scripts from future consideration.

There are a little more than 50 scripts in current use.
http://www.evertype.com/standards/iso15924/document/dis15924.pdf gives
a list, and http://www.unicode.org/Public/4.0-Update/Scripts-4.0.0.txt
can be massaged into a list of scripts currently in Unicode with

awk '$4 = "#" {print $3}' | sort -u

> Those of us who develop system tools build software that is used
> everywhere in the world. Because system tools are not a user interface,
> they are not restricted to the set of languages or scripts which are
> understood by the author.

True. But if they write you for support, they have to use a language you
understand; people who can only handle Hindi can't get support from you
and don't need to be able to type your email address (assuming wlg that
you have no Hindi). This is quite independent of the product type; it
applies just as much to buggy whips as to system software.

> Furthermore, we are not talking about the language used by any support
> messages to/from me. We are talking about whether someone in Lower
> Slobbovia can enter the developer's email address, and whether the
> developer can enter the Lower Slobbovian address.

What's the point of their entering your email address if they can only
write to you in Slobbovian?

--
Eric Raymond is the Margaret Mead John Cowan
of the Open Source movement. ***@reutershealth.com
--Lloyd A. Conway, http://www.ccil.org/~cowan
amazon.com review http://www.reutershealth.com
Mark Crispin
2003-10-29 00:20:00 UTC
Permalink
On Tue, 28 Oct 2003, John Cowan wrote:
> But if they write you for support, they have to use a language you
> understand; people who can only handle Hindi can't get support from you
> and don't need to be able to type your email address (assuming wlg that
> you have no Hindi).

This is confusing apples and oranges again. We are not talking about the
language of the email text. We are talking about the email address, and
whether or not the individuals at each end can read and enter the other's
email address.

We are, in effect, talking about the equivalent of extending the 10 digits
of telephone numbers to have letters. And I'm not talking about the
strange American convention of saying "my phone number is 555-COOL-GUY"
when in fact the number is 555-266-5489. I'm talking about adding
entirely new discreet things that can be dialed. And phones which only
have digits can't call these numbers. Or can they?

If so, how? That, more than anything else, is what any IEA solution will
have to address. We can safely assume that if Hindi IEA addresses are
possible, Hindi environment software will quite capably provide a means
for Hindi users to use these addresses. We can even safely assume that as
long as there is a way (any way), it doesn't really matter what format the
bits are in; the Hindi user interface software will smooth over any rough
edges.

The messy part is how software that doesn't know Hindi from Martian is to
cope with such an address. It is in this case that the format of the bits
becomes important.

> > Furthermore, we are not talking about the language used by any support
> > messages to/from me. We are talking about whether someone in Lower
> > Slobbovia can enter the developer's email address, and whether the
> > developer can enter the Lower Slobbovian address.
> What's the point of their entering your email address if they can only
> write to you in Slobbovian?

Maybe he only knows Slobbovian, but the village schoolteacher may know
enough English to cobble together something that I can comprehend, and
decypher my response.

Like most other developers with an international audience, I've become
fairly skilled in determining the meaning from broken, ungrammatical
English, and in framing my response so that the village schoolteacher (who
dimly remembers learning English in grade school decades ago) can do
likewise.

Alternatively, I may know of a Slobbovian restaurant downtown where I can
either have his letter, or my response, translated. Or perhaps find a
Slobbovian student. Or there may even be a Department of Slobbovian
Studies at the nearby university.

This sort of thing is quite common. It is fallacious to assume that two
people do not communicate just because neither speaks the other's
language. This is not a major barrier.

An incautious implementation of IEA, on the other hand, would be a major
barrier.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Markus Stumpf
2003-10-28 20:38:48 UTC
Permalink
On Tue, Oct 28, 2003 at 02:13:35PM -0500, John Cowan wrote:
> For myself, I support people in English, and I need to be able to
> recognize support requests in French and Spanish so that I can route them
> to the right places. A single Latin-script email address is sufficient
> for this. Adding a Cyrillic-script email address would be as silly
> as adding Russian contact information to my business card: I can't do
> anything for people who want to communicate with me in Russian anyway.

I have friends all over the word. My name doesn't have any German
umlauts but a lot of do.
How does
günther.größ***@hübnerbräu.de
look to you? (This is not an example crafted to be a pain.)
Does your mailclient render this correctly with all the umlauts?
Will a Chinese do? Or a Russian? Or ...?
Aren't we not all on the same mailinglists? We all can talk to each
other using English as the (current) lingua franca on the Internet.
And we can all type our email addresses as every keyboard I know of
is able to produce ASCII characters. Does yours have a mapping for
iso-8859-1 (probably as you have to use French). Does it have a Greek
mapping (iso-8859-7)? Wouldn't it be a pain if you couldn't send email
to the hotel in Athens we're you want to go to the Olymic Games, only
because they have an email address in iso-8859-7 only and you don;t have
the characters on your keyboard? ;-)

If I have latin-1 addresses on my business card and go to some international
fair, do I also need a business card with a ASCII coded address on it?
Do I need a bag only for all the different business cards with my email
address in different codings??
And if I need a ASCII coded address anyway, to avaoid the bag, why would
I want to have another one that cannot be used by probably 98% of the
people on this planet at all?

\Maex

--
SpaceNet AG | Joseph-Dollinger-Bogen 14 | Fon: +49 (89) 32356-0
Research & Development | D-80807 Muenchen | Fax: +49 (89) 32356-299
"The security, stability and reliability of a computer system is reciprocally
proportional to the amount of vacuity between the ears of the admin"
Mark Crispin
2003-10-28 21:28:43 UTC
Permalink
On Tue, 28 Oct 2003, Markus Stumpf wrote:
> How does
>[snip]
> look to you? (This is not an example crafted to be a pain.)

Thanks for a great example!!

It appears to me as:
"g" <center-dot> "ther.gr" <center-dot> "***@h" <center-dot> "nerbr"
<Han character for "moor" (as in a boat)> ".de".

Now, if I set my environment so that I'm interpreting things in Latin-1,
then I see the German string which is Anglicized as:
***@huebnerbraeu.de

Of course, if my environment had been UTF-8 instead of Japanese, it
would have been rendered correctly without change. But, I still would
have had to change my Japanese input environment to enter that string. I
would have had to know how to enter umlauts and ess-tset.

If I was a native Japanese, I would have learned the Latin alphabet
without any diacriticals. I would know how to write my name in the Latin
alphabet, albeit with numerous spelling variances; "Fujishima",
"Huzisima", "Fujisima", "Fuzishima", etc. are all the same name. Umlauts
(and especially ess-tset) are a different matter.


-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Arnt Gulbrandsen
2003-10-28 17:55:58 UTC
Permalink
Mark Crispin writes:
> Just as a few off the cuff-examples:
>
> Is ASCII formally abolished as the lingua franca of email addresses,
> meaning that some people will be unable to use ASCII addresses?

I think the latter will some day be the case. Personally I'm not unhappy
about it - anyone who doesn't know my alphabet is unlikely to know any
language I also know.

> If so, how many email addresses will a diplomat (or any other
> individual engaged in multinational business) need?

1<=n<=m, where m is the number of alphabets that individual uses.

A Chinese diplomat in Saudi Arabia might have three - one for each of
the alphabets in which he communicates. His Indian counterpart might
take the position that his Saudi hosts had better know the Latin
alphabet, and have just one email address. Matter of judgment. (I'm
misquoting an Indian newspaper, btw. Badly so.)

> If not, doesn't that create a two tier world of persons who have
> usable international addresses and those who have domestic-only
> addresses?

Yep. Just like some of us speak foreign languages and others don't.

As long as the email addresses are fully sufficient for communication
with everyone with whom the user shares a language, it's okay.

--Arnt
Mark Crispin
2003-10-28 19:00:48 UTC
Permalink
On Tue, 28 Oct 2003, Arnt Gulbrandsen wrote:
> > Is ASCII formally abolished as the lingua franca of email addresses,
> > meaning that some people will be unable to use ASCII addresses?
> I think the latter will some day be the case. Personally I'm not unhappy
> about it - anyone who doesn't know my alphabet is unlikely to know any
> language I also know.

Such a prospect appalls me.

> > If not, doesn't that create a two tier world of persons who have
> > usable international addresses and those who have domestic-only
> > addresses?
> Yep. Just like some of us speak foreign languages and others don't.
> As long as the email addresses are fully sufficient for communication
> with everyone with whom the user shares a language, it's okay.

This mixes apples and oranges; that is, the ability to have an email
address in your native script with language ability.

-- Mark --

http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Arnt Gulbrandsen
2003-10-29 10:46:33 UTC
Permalink
Mark Crispin writes:
> On Tue, 28 Oct 2003, Arnt Gulbrandsen wrote:
>> As long as the email addresses are fully sufficient for communication
>> with everyone with whom the user shares a language, it's okay.
>
> This mixes apples and oranges; that is, the ability to have an email
> address in your native script with language ability.

Oh? They seem related: The ability to have an email address in script x
is closely tied to knowledge of script x. Knowledge of script x is also
closely tied to a language that is customarily written in script x.

But it's a distraction. Does anyone really think email addresses should
be restricted _by_the_standard_ to some limited set of scripts?

--Arnt
John Wagner
2003-10-28 16:40:07 UTC
Permalink
Mark Davis wrote:
>
> An international email address would be portable as far a computers are
> concerned; the issue would be for people: how to view it and how to key it in
> (as you note). In both cases, that depends more on the client software than the
> protocols.
>
> I wouldn't be able to type in an email address in Tamil (though I could
> certainly copy and paste it). I would wager that anyone who cares to have his
> email used by foreigners would have dual email addresses, or perhaps even more;
> e.g. Tamil, Latin, and Chinese.

I seem to remember this coming up in another discussion and the immediately following point was how hard it is to do a cut and paste from a business card with the "native language" email address into a user interface that has no clue.

The problem is there is more than one human/machine interface for email addresses. It is a technically simple thing to have multiple addresses, but it is not technically simple to be certain that the address I give you works when you try to communicate with me.

--
John Wagner
Marc Blanchet
2003-10-28 16:12:54 UTC
Permalink
it appears to me that this thread is not very different from the idn
considerations on usage of idn in the world. So what is really new in this
discussion?

Marc.

-- Tuesday, October 28, 2003 07:10:59 -0800 Mark Davis
<***@jtcsv.com> wrote/a ecrit:

>> > (I agree that it's currently nearly impossible to use computers if one
>> > isn't familiar with the Latin script, of course.)
>>
>> Which probably makes the rest of this discussion academic, unless we're
>> going to undertake solving *that* problem for Microsoft and the various
>> UNIX/Linux vendors...
>
> It is currently impossible to use the Internet without knowing the Latin
> script. However, the goal of most well-designed client software and
> operating systems is to permit the user to work entirely within their
> native language, with a fully localized system. This is reaching to India
> and other countries; Microsoft has introduced fully localized versions of
> Indic Windows just recently, and Linux vendors are hard at work to
> produce fully localized versions of their software.
>
> Email and Web addresses are the big remaining holdouts for most people.
> People should not be forced to use a script that they are unfamiliar
> with, just to use email addresses and sites in their own countries. Even
> if they are familiar with the Latin script, it is very often a very bad
> match for their languages, making it very difficult to figure out how
> native words would be spelled in it.
>
> Mark
>



------------------------------------------
Marc Blanchet
Hexago
tel: +1-418-266-5533x225
------------------------------------------
http://www.freenet6.net: IPv6 connectivity
------------------------------------------
Mark Davis
2003-10-28 16:22:37 UTC
Permalink
I wouldn't have thought so either, but at least some people questioned the need
for non-Latin characters, and went so far as to exclude them from a proposed
problem statement:

> ... There is a growing need to use additional
> characters, specifically Latin characters with diacriticals and non-Latin
> characters, in email addresses to better serve the needs of the
> multi-national Internet community...

Mark
__________________________________
http://www.macchiato.com
► शिष्यादिच्छेत्पराजयम् ◄

----- Original Message -----
From: "Marc Blanchet" <***@viagenie.qc.ca>
To: "Mark Davis" <***@jtcsv.com>; "Mark Crispin"
<***@CAC.Washington.EDU>; "Abhijit Menon-Sen" <***@wiw.org>
Cc: <ietf-***@imc.org>; <***@ops.ietf.org>; <ietf-***@imc.org>;
<***@ietf.org>; <ietf-***@imc.org>; <***@ietf.org>;
<***@apps.ietf.org>; <ietf-***@imc.org>; <ietf-***@imc.org>
Sent: Tue, 2003 Oct 28 08:12
Subject: Re: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


>
> it appears to me that this thread is not very different from the idn
> considerations on usage of idn in the world. So what is really new in this
> discussion?
>
> Marc.
>
> -- Tuesday, October 28, 2003 07:10:59 -0800 Mark Davis
> <***@jtcsv.com> wrote/a ecrit:
>
> >> > (I agree that it's currently nearly impossible to use computers if one
> >> > isn't familiar with the Latin script, of course.)
> >>
> >> Which probably makes the rest of this discussion academic, unless we're
> >> going to undertake solving *that* problem for Microsoft and the various
> >> UNIX/Linux vendors...
> >
> > It is currently impossible to use the Internet without knowing the Latin
> > script. However, the goal of most well-designed client software and
> > operating systems is to permit the user to work entirely within their
> > native language, with a fully localized system. This is reaching to India
> > and other countries; Microsoft has introduced fully localized versions of
> > Indic Windows just recently, and Linux vendors are hard at work to
> > produce fully localized versions of their software.
> >
> > Email and Web addresses are the big remaining holdouts for most people.
> > People should not be forced to use a script that they are unfamiliar
> > with, just to use email addresses and sites in their own countries. Even
> > if they are familiar with the Latin script, it is very often a very bad
> > match for their languages, making it very difficult to figure out how
> > native words would be spelled in it.
> >
> > Mark
> >
>
>
>
> ------------------------------------------
> Marc Blanchet
> Hexago
> tel: +1-418-266-5533x225
> ------------------------------------------
> http://www.freenet6.net: IPv6 connectivity
> ------------------------------------------
>
Keith Moore
2003-10-28 16:50:56 UTC
Permalink
> it appears to me that this thread is not very different from the idn
> considerations on usage of idn in the world. So what is really new in this
> discussion?

many of the considerations are the same. however, because email addresses
are used differently than domain names, the relative importance of the
considerations may be different.
John C Klensin
2003-10-28 17:36:07 UTC
Permalink
--On Tuesday, October 28, 2003 11:12 -0500 Marc Blanchet
<***@viagenie.qc.ca> wrote:

> it appears to me that this thread is not very different from
> the idn considerations on usage of idn in the world. So what
> is really new in this discussion?

See the draft.

Quick answer: DNS interfaces really exist at the protocol level,
and a large part of the hypothesis behind IDNA was that it would
be possible, after we had enough implementations, to prevent an
end-user from ever seeing an encoded domain name. That story
just doesn't hold for a special encoding-based (aka MUA-only)
email local part implementation, and maybe not for email
generally. For example, under current rules, an MTA is required
to stuff the name it actually sees in HELO/EHLO and the MAIL
command into various headers. If it is expected to notice that
they have special encodings and decodes them, we've gone rather
far into "the infrastructure is involved", even if the actual
on-the-wire transport is not impacted. If it doesn't do that,
the encodings --both the IDNA domain parts and the special mail
encoding-- are going to be in the user's face, in the most
literal sense of that term.

The similarity of that situation to the early IDN discussions is
the importance of the "what problem are you solving" question.
And it is very clear to me that, for email addresses, the answer
has got to "user sees their own characters in their email
addresses and the email addresses of those whose languages they
speak/ recognize. Users typically don't actually see envelopes.
But seeing, e.g., different forms/codings of an address in the
header "From:" field than appears in "Return-path:" or than
appears in a signature line in the message body, is going to
create real unhappiness. Similarly as has been pointed out in
another context, seeing an address in different from when it
appears in a header than when it (and that header) are
encapsulated in a message/?? body part is just not going to
amuse any user to whom we've said "ok, now you have i18n
strings, enjoy your new found local language capabilities". And
I guess that tells you what I think the problem is that we need
to solve. YMMD.

john
James Seng
2003-10-29 23:32:46 UTC
Permalink
> I do not believe that this is true for Chinese. AFAIK, Chinese primary
> school kids use Latin script with hanyu-pinyin as a stopgap prior to their
> mastery of Han script (which takes many years).

Nope. Hanyu Pinyin was designed to replace the Han ideograph but it
never did.

> Note that when I say "recognize Latin script", I mean the ability to
> determine that "dog" is a three-letter word that has the letters "d", "o",
> and "g", each of which the individual recognizes and can name. This does
> not include the ability to recognize that this refers to a domesticated
> canine.

Based on your reasonings, I think we should all reverted back to
numbers, because that is the only universal recongizable set of symbols
we have.

We have this argument in IDN before and we certainly dont need this
again. If you feel we should all stick to Latin, yes, you are entitled
to your opinion but please do so in other place, and not here. The group
is suppose to work on Internationalization of Email address
(identifiers), not debate whether we need it or not.

-James Seng
Keith Moore
2003-10-30 00:07:46 UTC
Permalink
[recipient list trimmed]

> The group
> is suppose to work on Internationalization of Email address
> (identifiers), not debate whether we need it or not.

actually I believe the question at hand (for the BOF) _is_
at least partially whether we need it or not. few people would
doubt that there is *some* need, but perhaps not a need for
new mailbox identifiers.

(am trying to keep an open mind myself, having started out
favoring an IDNA-like approach and now find myself questioning
it)

Keith
V***@vt.edu
2003-10-30 04:44:56 UTC
Permalink
On Thu, 30 Oct 2003 07:32:46 +0800, James Seng said:
> to your opinion but please do so in other place, and not here. The group
> is suppose to work on Internationalization of Email address
> (identifiers), not debate whether we need it or not.

Any group that addresses "how" and "for which contexts" without having
a good grasp on "why" is inventing solutions in search of problems.

Mark actually *does* have a *very* valid point - on today's internet, if you
cannot recognize and enter the glyphs for at least c, h, m, o p, t, w, ':',
'@', '.', and '/' you are effectively unable to use the internet. It may not
make any sense to you, but you can at least recognize and enter them (note that
this same issue was one of the biggest arguments against the .biz domain).

So.. having established that if they're currently using the internet, they can at least
recognize and enter the Latin glyphs, this raises a number of *very* important questions:

1) Is there reason to *not* expect said knowledge of Latin glyphs in the future?
If not, what user group(s) will be literate but not know the Latin charset?

2) Is a "community" approach acceptable? Is usage of Han OK as long as
you're interacting with other Han users, or are the issues of leakage too high?

3) What *are* the issues of leakage? What am I expected to see if I get some Han,
and how am I to interact with it? Equally important, what does the Han user do
with my leaked Latin-A characters?

4) Here's a somewhat related issue - looking at the U0100.pdf from www.unicode,org,
I had to enlarge page 2 quite a bit before I could see the difference between the glyphs
at 0114/0115 (capital/small e with breve) and 011A/011B (capital/small e with caron).
And I know my way around most of the Latin characters - our hypothetical Han
user is going to be swinging in the breeze if he gets a business card with e-caron on it.

And if you can't safely put e-caron on a business card, why are we bothering?
Tan Tin Wee
2003-10-30 05:33:31 UTC
Permalink
***@vt.edu wrote:
> On Thu, 30 Oct 2003 07:32:46 +0800, James Seng said:
>
>>to your opinion but please do so in other place, and not here. The group
>>is suppose to work on Internationalization of Email address
>>(identifiers), not debate whether we need it or not.
>
>
> Any group that addresses "how" and "for which contexts" without having
> a good grasp on "why" is inventing solutions in search of problems.
[snipped]

>
> And if you can't safely put e-caron on a business card, why are we bothering?

We're bothering because it has occurred to some of us that
some folks somewhere in the world may wish to send email only to
themselves within an intranet or a large national intranet or
wish to launch their own internal e-government or
e-education system that involves interpersonal
and interdepartmental communications amongst folks that
speak the same language which doesn't not happen to be Latin-based.
And if they need to send email to outsiders, then they would
send in ASCII email address, as routinely as they would
flipping between the reverse and obverse of their
namecards, one side the local language (including their
local IEA address) and the other, the global lingua franca
latin.

Right now, the Mongolian (or whatever) government cannot launch say
their e-Government intranet email system seamlessly with
the Internet without much pain in getting everyone up to
speed on the latin character set. IEA support will definitely
be a boon to such folks.
V***@vt.edu
2003-10-30 06:24:33 UTC
Permalink
On Thu, 30 Oct 2003 13:33:31 +0800, Tan Tin Wee said:
> And if they need to send email to outsiders, then they would
> send in ASCII email address, as routinely as they would

OK.. I get that part. Now for the big question: You're there in this
Mongolian intranet, and find you need to ask me a technical question,
so my address gets entered in ascii. OK so far. You now decide you
need to cc: somebody on the intranet so they know I've been asked.

1) What does that person do with my ascii-fied address?
2) How do I do a 'reply all' to both of you?

3) How is your From: address encoded so it's usable *BOTH* from where
I am and from where your co-worker is?

3a) Can you achieve goal (3) while using the same From: as you would use
if you were mailing ONLY to the intranet (so people don't have to maintain
2 differently encoded values for your address for filtering purposes, etc).

If whatever Mongolia was doing was guaranteed to stay in Mongolia, it wouldn't
be an issue. However, people inside the enclave *will* want to communicate
with outsiders as well - and the instant you allow an e-mail to cross the
border, you have to get all these types of issues sorted out.

Mark Crispin's point was that currently, knowledge of Latin glyphs *is*
assumed, and as far as anybody has evidenced, this hypothetical Mongolian
intranet with many non-Latin-aware users is still hypothetical - and with no
evidence saying there actually IS one in the works someplace. So Mark quite
reasonably pointed out that it may very well make more *engineering* sense to
simply train the very small number of users who don't know Latin glyphs than to
come up with some very convoluted scheme that annoys everybody else.

The tail has to be a certain size before it's able to wag the dog.
Steve Dyer
2003-10-30 07:58:27 UTC
Permalink
At 01:24 30/10/2003 -0500, ***@vt.edu wrote:
>On Thu, 30 Oct 2003 13:33:31 +0800, Tan Tin Wee said:
> >snip
>If whatever Mongolia was doing was guaranteed to stay in Mongolia, it wouldn't
>be an issue. However, people inside the enclave *will* want to communicate
>with outsiders as well - and the instant you allow an e-mail to cross the
>border, you have to get all these types of issues sorted out.
>
>Mark Crispin's point was that currently, knowledge of Latin glyphs *is*
>assumed, and as far as anybody has evidenced, this hypothetical Mongolian
>intranet with many non-Latin-aware users is still hypothetical - and with no
>evidence saying there actually IS one in the works someplace. So Mark quite
>reasonably pointed out that it may very well make more *engineering* sense to
>simply train the very small number of users who don't know Latin glyphs
>than to
>come up with some very convoluted scheme that annoys everybody else.

Hi

The above is a self-fulfilling prophesy. Of course you have to have a
knowledge of Latin glyphs because the only Internet around is based on it.
The millions of people (including Mongolians) who don't know latin glyphs
(and may not yet have computers, and may be illiterate) are still the
future users, possibly the majority of future users, of the Internet and
this effort is to address their future needs.

My observation is that we are maybe worrying too much about the characters
being represented and how they are used, We should merely define a
protocol. In my book that protocol should allow email addresses to carry
the widest possible eight-bit ASCII payload. (ie. everything except control
characters) so that existing non-unicode special characters can be carried,
and can also carry IDN-style unicoded characters.

Sure there will be problems, - there will be anyway. However we need to
enable local users to communicate in local scripts in the local language.

In this regard Europeans and Americans are maybe too steeped in the
existing latin/ASCII to address this issue. We should listen carefully to
the non-latin world rather than trying to find reasons not to do it.

I often get completely incomprehensible emails which originated in
non-latin scripts - I just delete them. I know they're not important for me
because no author who tries to write to me in Chinese is going to get very
far no matter what communication medium he uses. Why? Because I don't speak
Chinese.

>The tail has to be a certain size before it's able to wag the dog.

This dog is almost all tail!

Regards

Steve Dyer
Dave Crocker
2003-10-30 07:09:24 UTC
Permalink
Valdis,

VKve> Mark actually *does* have a *very* valid point - on today's internet, if you



1. The goal is to go beyond today's internet. (But then, that is always the
goal of a new standard.)

2. Although the primary focus of IETF work is to make standards for global
interoperability, there are fine and valid needs for making global conventions
to support local interoperability. Permitting local-part to be unicode is one
of those. (And, no, I do not believe this requires changing the global parsing
rules.)

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Spencer Dawkins
2003-10-30 12:49:13 UTC
Permalink
Ummm, I'm not a Genius of E-mail, but I have sent a few. :-}

The very-helpful scenario Valdis included a couple of notes back ("if
we punt on common ability to use Latin glyphs") has happened in my
life, at the presentation level - I've been swapping e-mail back and
forth with some very talented people in Korea. So, they have
ascii-fied e-mail addresses that aren't THAT obvious (I think spencer,
or dcrocker, or even vinton.c.cerf, is pretty obvious, but my three
kids are daddys_little_hurl (April), buddha20oz(Daniel), and
gypsycameo(Amy), so you're not going to figure out who's been copied
just by looking at e-mail addresses in the general case), and
presentation names that are localized, so I can't read them. If a
collaborator copies three people who I can't decode from the e-mail
address, and can't read from the presentation name, I can't figure out
if another researcher has been copied or not, without asking.

If I can't read the characters in the local part of the e-mail
address, there's even less chance I'll be able to figure out who's
been copied (and usually, figuring out who hasn't been copied is more
interesting, in my limited experience) - I know I'm going to regret
not reading Korean when I'm at IETF 59, but today, I don't read
Korean. Game over..

I agree with Dave in the general case ("the goal is to go beyond
today's Internet"), but am wondering if that also requires us to go
beyond today's language capability when we start leaking these
addresses between enclaves. I am sensitive to the comments expressed
in this thread, that a heckuva lot of people have to learn two sets of
glyphs, and I'm not one of them - I just don't see how we do it any
other way.

Oh, yeah - the other thing was, discussion about leaking out of a
Mongolian enclave is interesting, but leaking between two non-Latin
enclaves is where the rubber meets the road, and I've worked with too
many smart people from Asia/Pacific and from the Middle East to
believe that we wouldn't have two non-Latin enclaves who would be
collaborating about fifteen minutes after the second enclave starts
up...

Spencer

p.s. Because I'm not a Genius of Internationalization, I apologize for
using country names as if they were character sets in advance - hope
my comment is still somewhat clear.

p.p.s. If we DO discover life on Mars, I'm willing to change my mind.
I know my stepson would love ro have an excuse to learn
Martian/Klingon/etc....
James Seng
2003-10-29 23:26:28 UTC
Permalink
Crispin,

You need to get out of US (or Wsshington) more often.

-James Seng

> I am not convinced that it is possible to use a computer on the Internet
> anywhere in the world without at least a basic acquaintance with Latin
> script.
>
> I do not believe many individuals (other than primary school children) are
> literate in their native language but are completely illiterate in Latin
> script. This does not mean "being able to read or write the English
> language"; rather, this simply means knowing the Latin script alphabet.
>
> Put another way, individuals who are completely illiterate in Latin script
> are also likely to be illiterate in their native language script as well.
>
> No other script on the planet has such international recognition.
>
> There is undoubtably a *preference* for one's native script; and that
> preference should be respected as much as possible.
>
> -- Mark --
>
> http://staff.washington.edu/mrc
> Science does not emerge from voting, party politics, or public debate.
> Si vis pacem, para bellum.
>
>
>
>
>
Stephane Bortzmeyer
2003-10-28 08:50:23 UTC
Permalink
On Mon, Oct 27, 2003 at 05:39:32PM -0800,
Mark Davis <***@jtcsv.com> wrote
a message of 76 lines which said:

> We should remember that for a great many people in the world, Latin letters are
> quite unnatural; it'd be a bit like if we had to use Greek letters in all email
> addresses.

It would be a bit like if we had to use Greek letters in mathematics
:-)
Mark Davis
2003-10-28 14:54:45 UTC
Permalink
Sigh. I used "Greek" in an analogy because I was hoping that some of the Latin-only folks out there would at least recognize the name of one other script. But how comforable would you really be, if يُُ وِرِ فُرسِد تُ رَِد َن ِنتِرِلي دِففِرِنت سكرِٟت، سِمٟلي تُ ُندِرستَند ِمَِل َددرِسسِس، َند َ سكرِٟت تهَت كُُلد ُنلي َٟرتَِللي رِفلِكت تهِ رَِل سِٟللِنگ ُف يُُر وُردس?

Mark
__________________________________


----- Original Message -----
From: "Stephane Bortzmeyer" <***@nic.fr>
To: "Mark Davis" <***@jtcsv.com>
Cc: "Mark Crispin" <***@CAC.Washington.EDU>; "Keith Moore" <***@cs.utk.edu>; <***@brandenburg.com>; <***@cisco.com>; <ietf-***@imc.org>; <***@ops.ietf.org>; <ietf-***@imc.org>; <***@ietf.org>; <ietf-***@imc.org>; <***@ietf.org>; <***@apps.ietf.org>; "IMAP Extensions WG" <ietf-***@imc.org>; <ietf-***@imc.org>; <***@qualcomm.com>; <***@qualcomm.com>; <***@mrochek.com>
Sent: Tue, 2003 Oct 28 00:50
Subject: Re: [idn] Re: FYI: BOF on Internationalized Email Addresses (IEA)


> On Mon, Oct 27, 2003 at 05:39:32PM -0800,
> Mark Davis <***@jtcsv.com> wrote
> a message of 76 lines which said:
>
> > We should remember that for a great many people in the world, Latin letters are
> > quite unnatural; it'd be a bit like if we had to use Greek letters in all email
> > addresses.
>
> It would be a bit like if we had to use Greek letters in mathematics
> :-)
>
>
V***@vt.edu
2003-10-28 20:07:19 UTC
Permalink
On Mon, 27 Oct 2003 17:39:32 PST, Mark Davis said:

> email addresses. Mr. Tanaka can have one with Latin letters and one with
> Japanese (e.g. ムルク@カク.ワシングトン.゚デゥ).

This gets interesting in the context of a "reply all".

Apologies for breaking the UTF-8 in the quote, but it's illustrative - if the
breakage had been in the To/Cc lines, things would have broken even worse
unless whatever scheme we end up using is ASCII-transparent.
Madan Ganesh Velayudham
2003-10-28 14:43:42 UTC
Permalink
> I'm curious: why do you think that everyone would be
> satisfied with Latin characters only, and no non-Latin characters?
>
> Mark
> __________________________________
> http://www.macchiato.com
> ► शिष्यादिच्छेत्पराजयम् ◄

Yes, I also agree. Especially in India, we have more than 10 Languages ( Hindi, Tamil, Telugu,
John C Klensin
2003-10-28 23:40:03 UTC
Permalink
--On Monday, October 27, 2003 11:10 -0800 Mark Crispin
<***@CAC.Washington.EDU> wrote:

> On Mon, 27 Oct 2003, Keith Moore wrote:
>
> >> Thanks for taking a stab at a problem statement. I'd like
> to drill down >> on this just a bit.
> >> What is the source of the "growing need"? Is it:
> >> [snip]
>
>
> I agree that this needs to be stated, but someone other than
> me will have to do it.
>
> I believe that the primary push for this functionality comes
> from regions which use Latin alphabetics with diacriticals;
> and that most individuals in regions which do not use Latin
> script are accept the use of Latin script for multinational
> interchange. In many regions where Latin diacriticals are
> used, there is no acceptable transform of a surname to a form
> that does not use diacriticals. Simply omitting the
> diacritical causes (at least to the inhabitants of those
> regions) a misspelling.
>...

Actually, unlike the original push for internationalization of
email message bodies, and some of the push for IDNs, most of the
push I'm seeing for this are coming from folks with distinctly
non-Latin (i.e., not Cyrillic or Greek either) scripts... e.g.,
east Asia, middle east, etc.

I can't speak for the motivations of the others who have thought
and written about the problem.

So these are real "different characters" issues, not the
complexities of dealing with diacriticals on Latin letters and
what their omission might mean.

john
Arnt Gulbrandsen
2003-10-29 14:14:00 UTC
Permalink
(I stripped the recipient list, as requested.)

John C Klensin writes:
> Actually, unlike the original push for internationalization of email
> message bodies, and some of the push for IDNs, most of the push I'm
> seeing for this are coming from folks with distinctly non-Latin
> (i.e., not Cyrillic or Greek either) scripts... e.g., east Asia,
> middle east, etc.

Do you have any idea why so few are here? It's been bothering me (more
about ietf-822 than imaa).

--Arnt
John C Klensin
2003-10-28 16:33:28 UTC
Permalink
Everyone,

Either for general efficiency or just to do me a favor, can we
please pick one list --I'd recommend IMAA unless Paul objects--
and move these discussions to it only. I'd like to participate
(after all, it is my draft and BOF request that set off these
two threads), but am in an environment that is hostile to my
being to read email in a leisurely way -- the cross-postings are
making the volume look larger than it is, and I don't have time
to sift through and organize it.

Now, an observation or two.

Keith, please read draft-klensin-emailaddr-01.txt -- it contains
a fairly extensive treatment of the issue you identify below.
It also explicitly discusses the tradeoffs along the spectrum
from easy global interoperability (at both the prootocol and the
user interface/perception level) to full, culturally-appropriate
and optimized, localization. Short answer is "can't have it
both ways", but that is a no-brainer. I don't know that my
analysis is any better than yours, or where you would eventually
end up, but, if we can start from a common base and terminology
and then, as needed, argue about it, we will, I think, save a
lot of time.

It also explores the case beyond the one you and Mark are
discussing -- what happens if one decides to start tampering
with the "@" in mail or those nasty ASCII slashes, etc., in
URIs: if one is to go all the way to significantly non-Roman
scripts, those need to go too... or, at least, we need to
explore whether that is sensible and plausible.

There are two additional issues that I should have written about
in the draft and didn't.

(1) Another advantage of "just" using UTF-8 in an appropriately
negotiated, controlled, and constrained environment is that any
idiosyncracies and coding difficulties are Unicode
idiosyncracies and coding difficulties. If we decide to use a
specialized coding designed for email local-parts (and, fwiw, I
think Adam's coding solution is brilliant... I'm ultimately just
unhappy with the problem definition to which it responds), then
we have to deal with both its idiosyncracies _and_ those of
Unicode. Strikes me as a bad idea -- better to just blame
"them" :-)

(2) One might imagine using the machinery outlined in that
draft to transport mail across the network, and then, if needed,
use IMAA encoding to push the message into the mail store, make
it available for IMAP and POP, etc. Not an ideal situation, but
that would clearly put that coding into the category of a
transition strategy that we could incrementally retire. By
contrast, once we start moving tricky encodings across the
network as an alternative to a transport-based solution, every
realistic scenario I can think of says that we are stuck with
them forever. That is, I think, more or less one of Mark's
arguments, but with a slightly different twist.

One final observation for now...

Our success record in not requiring email addresses to be typed
in, usually associated with some version of "The Directory", has
been abysmal. Similarly, if users never had to look at URLs
(which was the intent) we would almost certainly not be having
these arguments about domain names and their formats -- the
"protocol element" argument would fly, and we'd all be working
on internationalization at a less constrained level of
abstraction. But the pigs don't seem to be circling at
altitude, at least here in Carthage.

john


--On Tuesday, October 28, 2003 10:27 -0500 Keith Moore
<***@cs.utk.edu> wrote:

>> It is currently impossible to use the Internet without
>> knowing the Latin script. However, the goal of most
>> well-designed client software and operating systems is to
>> permit the user to work entirely within their native
>> language, with a fully localized system. This is reaching to
>> India and other countries; Microsoft has introduced fully
>> localized versions of Indic Windows just recently, and Linux
>> vendors are hard at work to produce fully localized versions
>> of their software.
>>
>> Email and Web addresses are the big remaining holdouts for
>> most people. People should not be forced to use a script that
>> they are unfamiliar with, just to use email addresses and
>> sites in their own countries. Even if they are familiar with
>> the Latin script, it is very often a very bad match for their
>> languages, making it very difficult to figure out how native
>> words would be spelled in it.
>
> and yet, as we've seen time and time again, local use of
> nonportable addresses can cause major problems for the net as
> a whole. we saw this in earlier days of email with the
> admixture of bitnet/rscs/nje, uucp, decnet, x.400, and Internet
> addresses. we've seen it in the IP space with RFC 1918
> addresses.
>
> in some ways internationalizing email addresses is a much
> harder problem than internationalizing IDNs, because no other
> application is as dependent on having human beings actually
> use addresses as email. (yes, people do sometimes type in
> URLs, but not nearly as often as they click on links. and
> there are apps which require humans to type in domain names,
> but for most of them this only happens at configuration time.)
>
> one way to approach this problem might be to make email less
> dependent on having addresses typed in.
>
> Keith
M***@nokia.com
2003-10-28 16:54:21 UTC
Permalink
Excuse me, but could you please constrain this
conversation to fewer than 9 (nine!) e-mail lists?

The BOF description lists ***@imc.org as the
discussion list, but this discussion is being
cc:ed to ietf-***@imc.org. I'd suggest that you
move this discussion to whichever of those lists
is actually correct.

Margaret
Paul Hoffman / IMC
2003-10-29 01:41:55 UTC
Permalink
At 11:54 AM -0500 10/28/03, ***@nokia.com wrote:
>The BOF description lists ***@imc.org as the
>discussion list, but this discussion is being
>cc:ed to ietf-***@imc.org. I'd suggest that you
>move this discussion to whichever of those lists
>is actually correct.

It is ietf-***@imc.org, although because Patrik sent out the wrong
address, I have made sure that both addresses work. An archive of the
list, and links to the current versions of the drafts, can be found
at <http://www.imc.org/ietf-imaa/>.

No more messages to all lists: that's what the IMAA list is for.

--Paul Hoffman, Director
--Internet Mail Consortium
Michel Suignard
2003-10-29 18:48:20 UTC
Permalink
Could you all read the Unicode spec as pointed by Mark instead of trying
to recreate it (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf,
section 2-4 to 2-6). There is no such a term sequences as a Unicode
native representation. Abstract characters use a code point part of a
set called codespace and can be referred as an encoded character within
that context (paraphrasing text in 2.4).

It just happens that one of the encoding form defined by Unicode
(UTF-32) is a simpler mapping to and from the original Unicode
codespace.

Note also that Unicode favors three encoding form: UTF-8, UTF-16 and
UTF-32. While ACE is as well another encoding, it does not have the same
software libary support that any of the three above.

And now if we could get back to the subject instead of debating Unicode
principle and terminology which belongs to a another list.

Michel Suignard

-----Original Message-----
From: owner-ietf-***@mail.imc.org [mailto:owner-ietf-***@mail.imc.org]
On Behalf Of Dave Crocker

John,

JC> That's only true if you take the position that there are no
JC> native/direct/raw encodings of Unicode.

Oh? You mean that Unicode does not fit directly -- ie, with no special
encoding rules -- into 32 bits, or 24 bits, or somesuch.

You mean that Unicode does not need special rules to stuff it into 8
bits, and another set of rules to stuff it into 16 bits?

Because if the answer is that yes it does -- and the answer _is_ yes it
does
-- then my point stands.

That's the difference between native representation, versus "encoding".

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com> Brandenburg
InternetWorking <www.brandenburg.com> Sunnyvale, CA USA
<tel:+1.408.246.8253>
Dave Crocker
2003-10-29 20:18:36 UTC
Permalink
Michel,


MS> Could you all read the Unicode spec as pointed by Mark instead of trying
MS> to recreate it (http://www.unicode.org/versions/Unicode4.0.0/ch02.pdf,
MS> section 2-4 to 2-6). There is no such a term sequences as a Unicode
MS> native representation.

Unicode documentation is not the only place that defines and discusses
computer science constructs involving data representation and encoding. For
example, since the work is being done in the IETF I suggest you look at RFC
1521. It has a nice review of the difference between native form and encoded
form. Obviously, that text is tailored to MIME, but it communicates the
general concepts adequately.

So, I apologize for trying to keep the discussion generic. Silly me. I thought
we were having an architectural discussion, rather than debating precise
Unicode terminology. In fact I was trying to be careful to express the issues
using generic computer science terminology, just to avoid a linguistic pissing
contest.

So I'm sorry that it has proven difficult for some folks to translate the
typical term "native representation" into the Unicode term "abstract
character" that is used at the beginning of section 2.4 that you cite.


MS> Abstract characters use a code point part of a
MS> set called codespace and can be referred as an encoded character within
MS> that context (paraphrasing text in 2.4).

The third paragraph of 2.4 cites the range of characters, in base 16. Note
that that range consumes 24 bits. That's the native representation I was
referring to. (When I wrote my original note, I was not sure that 24-bits was
the right number and it was not important that I get it exactly right.)


MS> Note also that Unicode favors three encoding form: UTF-8, UTF-16 and
MS> UTF-32. While ACE is as well another encoding, it does not have the same
MS> software libary support that any of the three above.

I used the word "efficient" to focus on the relevant difference between ACE
and UTF-8. My point was bit-encoding efficiency. You want to focus on
software development ease. That's fine too. However neither of these has to
do with inherent goodness or purity. They are all encodings, so that debating
one versus another is only debating trade-offs.


MS> And now if we could get back to the subject instead of debating Unicode
MS> principle and terminology which belongs to a another list.

Ahh. I see that you entirely missed the point I was raising.

Here's a reminder:

JCK>> If one is going to consider internationalization of email
JCK>> addresses in a way that permits them to move through the mail
JCK>> protocol in some traditional Unicode encoding (e.g., UTF-8),
JCK>> then

DC> ...then we get to repeat the mime/esmtp debates all over again. After all,
DC> why should we even try to learn anything from 10 years of experience. (And
DC> no, John, I'm not directing my comment at you.)
DC>
DC> To be specific: I am not suggesting that pure utf-8 is a bad goal -- although
DC> the fact that utf-8 is, itself, a condensed representation of unicode should
DC> strike folks as a just a tad ironic, with respect to these discussions.
DC>
DC> Rather, I suggest that it be a _separate_ goal from near-term support of an
DC> edge-only enhancement for Unicode support, the same as we did for mime and
DC> IDN.

Please note that John was careful to say "traditional" and that I was careful
to respect that goal.

So I heartily agree with your suggestion that we get back to the subject.

The subject was about pursuing an infrastructure-based solution separately
from an end-point solution.

d/
--
Dave Crocker <dcrocker-at-brandenburg-dot-com>
Brandenburg InternetWorking <www.brandenburg.com>
Sunnyvale, CA USA <tel:+1.408.246.8253>
Michel Suignard
2003-10-29 23:18:45 UTC
Permalink
JS>I seen John and Paul proposal but I have not seen Michel. Is there a draft that I can read
JS>up?

I don't have a proposal. I am listed there as a co-editor of the IRI spec which has a peripheral impact on the IEA (such as extending the URI mailto scheme). Like many I am listening with interest to the points made for or against the two proposals.

Michel Suignard
vinton g. cerf
2003-10-30 07:08:07 UTC
Permalink
Valdis,

I think your example underscores the difference between localization
of an interface to make use of local language/script and globalization
that permits interworking among all parties, independent of their local
language and script.

the confusion between these two (familiar user interfaces vs ability
to communicate with everyone) makes for a good deal of debate.

I hope can keep in mind both of these desirable aspects but most
especially our ability to preserve the global communication needed.

The dialing of telephone numbers relies on the ability of every
party to enter digits while the system does not care much about
what language we speak. One might think of Latin-A as the Internet
equivalent of digits - however, I don't know whether it is a valid
analogy.

vint

At 11:44 PM 10/29/2003 -0500, ***@vt.edu wrote:


>*** PGP SIGNATURE VERIFICATION ***
>*** Status: Good Signature from Invalid Key
>*** Alert: Please verify signer's key before trusting signature.
>*** Signer: Valdis Kletnieks <***@vt.edu> (0xB4D3D7B0)
>*** Signed: 10/29/2003 11:44:55 PM
>*** Verified: 10/30/2003 2:02:59 AM
>*** BEGIN PGP VERIFIED MESSAGE ***
>
>On Thu, 30 Oct 2003 07:32:46 +0800, James Seng said:
>> to your opinion but please do so in other place, and not here. The group
>> is suppose to work on Internationalization of Email address
>> (identifiers), not debate whether we need it or not.
>
>Any group that addresses "how" and "for which contexts" without having
>a good grasp on "why" is inventing solutions in search of problems.
>
>Mark actually *does* have a *very* valid point - on today's internet, if you
>cannot recognize and enter the glyphs for at least c, h, m, o p, t, w, ':',
>'@', '.', and '/' you are effectively unable to use the internet. It may not
>make any sense to you, but you can at least recognize and enter them (note that
>this same issue was one of the biggest arguments against the .biz domain).
>
>So.. having established that if they're currently using the internet, they can at least
>recognize and enter the Latin glyphs, this raises a number of *very* important questions:
>
>1) Is there reason to *not* expect said knowledge of Latin glyphs in the future?
>If not, what user group(s) will be literate but not know the Latin charset?
>
>2) Is a "community" approach acceptable? Is usage of Han OK as long as
>you're interacting with other Han users, or are the issues of leakage too high?
>
>3) What *are* the issues of leakage? What am I expected to see if I get some Han,
>and how am I to interact with it? Equally important, what does the Han user do
>with my leaked Latin-A characters?
>
>4) Here's a somewhat related issue - looking at the U0100.pdf from www.unicode,org,
>I had to enlarge page 2 quite a bit before I could see the difference between the glyphs
>at 0114/0115 (capital/small e with breve) and 011A/011B (capital/small e with caron).
>And I know my way around most of the Latin characters - our hypothetical Han
>user is going to be swinging in the breeze if he gets a business card with e-caron on it.
>
>And if you can't safely put e-caron on a business card, why are we bothering?
>
>
>*** END PGP VERIFIED MESSAGE ***

Vint Cerf
SVP Technology Strategy
MCI
22001 Loudoun County Parkway, F2-4115
Ashburn, VA 20147
703 886 1690 (v806 1690)
703 886 0047 fax
***@mci.com
www.mci.com/cerfsup
Loading...