RFC 3490, start the discussion

5 views
Skip to first unread message

Manfred Stienstra

unread,
Mar 15, 2007, 1:09:35 PM3/15/07
to rfc-club
Good day people,

You've had three weeks to gather your notes and drawings, please start
discussing (: Although Evan did secretly start without us.

Manfred

Manfred Stienstra

unread,
Mar 29, 2007, 2:57:07 PM3/29/07
to rfc-club
Hi people,

The RFC-club all started when some people on the Freenode IRC network
in #camping (camping programming, not camping camping) were discussing
whether non-ASCII characters were allowed in URLs, after some frantic
google searching it became clear we didn't know enough about standards
and something had to be done. Here at the RFC-club we read RFCs to get
to know standards better so these discussions don't take forever. Of
course I have to note that RFC 3490 does _not_ define a way to encode
Unicode codepoints in a URL, but only in a domain name. Anyway, let's
get it on, metaphorically speaking.

I really like how the third paragraph of chapter 1 the RFC states: "A
great deal of the discussion of IDN solutions has focused on
transition issues and how IDN will work in a world where not all of
the components have been updated." This struck me because I read last
month that the ICANN did a testrun with IDN and promised that "The
work to introduce these character sets should be finished by
2008." [1] The actual tests were carried out in october 2006, but for
some reason they waited some time to tell the world. I wonder how much
'real world testing' the standard has had in Asian countries...

One the biggest known problems with IDN is 'phishing' [2], and I was a
little bit surprised that the RFC only touches on this topic it very
lightly: "The introduction of the larger repertoire of characters
potentially makes the set of misspellings larger [...]". Not only is
this a big opportunity for criminals, it's also very annoying. Even
though the NAMEPREP algorithm removes most of non-LDH characters, it's
still easy to confuse characters. For instance the CYRILLIC CAPITAL
LETTER A U+0410 and the LATIN CAPITAL LETTER A U+0041. Depending on
the font used these may even be presented to the user with the same
glyph. Because of this it becomes virtually impossible to copy a URL
you saw on a billboard to your browser without trial and error or
internet search.

Another large problem is Han Unification [3]. When you decode an ACE
label and don't know in what language it was intended you never know
which glyphs to use to represent it, it could be hanzi, kanji or
hanja. Depending on the language you choose to represent the domain
name the meaning of the words can be very different, possibly
resulting in unwanted profanity or insults to the user.

Manfred

[1] http://news.bbc.co.uk/1/hi/technology/6441093.stm
[2] http://en.wikipedia.org/wiki/Phishing
[3] http://en.wikipedia.org/wiki/Han_unification

Thijs van der Vossen

unread,
Mar 29, 2007, 3:53:01 PM3/29/07
to rfc-...@googlegroups.com
On Mar 29, 2007, at 20:57, Manfred Stienstra wrote:
> [...] For instance the CYRILLIC CAPITAL LETTER A U+0410 and the
> LATIN CAPITAL LETTER A U+0041. Depending on the font used these may
> even be presented to the user with the same glyph. Because of this
> it becomes virtually impossible to copy a URL you saw on a
> billboard to your browser without trial and error or internet search.

I disagree, If you're planning to put your url on a billboard, you
can easily make sure you register both the url with the CYRILLIC
CAPITAL LETTER A U+0410 and the url with the LATIN CAPITAL LETTER A U
+0041 in it and redirect.

> Another large problem is Han Unification [3]. When you decode an
> ACE label and don't know in what language it was intended you never
> know which glyphs to use to represent it, it could be hanzi, kanji
> or hanja. Depending on the language you choose to represent the
> domain name the meaning of the words can be very different, possibly
> resulting in unwanted profanity or insults to the user.

This is really a language issue. You can get the same thing with pure
ascii, for example http://www.fucking.at/

Kind regards,
Thijs

--
Fingertips - http://www.fngtps.com


Manfred Stienstra

unread,
Mar 29, 2007, 4:54:48 PM3/29/07
to rfc-club
On Mar 29, 9:53 pm, Thijs van der Vossen <t.vandervos...@gmail.com>
wrote:

> This is really a language issue. You can get the same thing with pure
> ascii, for example http://www.fucking.at/

The difference is that a Chinese URL might be presented in a Japanese
font to a Japanese person, and thus hinting that this is a Japanese
URL. For someone not 'in the know' this could be strange. You can
easily circumvent this by choosing a domain name carefully, I just
wanted to point it out.

Thijs van der Vossen

unread,
Mar 30, 2007, 3:17:42 AM3/30/07
to rfc-...@googlegroups.com

Good point, hadn't thought about the font issue. A nice little thing
you could do is choose the correct font based on the country code top-
level domain, just like when you use the lang attribute in html [1].
This won't work for .com domains though.

Kind regards,
Thijs

[1] http://en.wikipedia.org/wiki/Han_unification#Check_your_browser

Manfred Stienstra

unread,
Mar 30, 2007, 3:55:56 AM3/30/07
to rfc-club
On Mar 29, 9:53 pm, Thijs van der Vossen <t.vandervos...@gmail.com>
wrote:
> I disagree, If you're planning to put your url on a billboard, you
> can easily make sure you register both the url with the CYRILLIC
> CAPITAL LETTER A U+0410 and the url with the LATIN CAPITAL LETTER A U
> +0041 in it and redirect.

Except that you need to register n^2 domain names where n is the
number of disputable characters. On the other hand using the character
set of the country you're in would probably work most of the times.

Reply all
Reply to author
Forward
0 new messages