Issues surrounding IDN validation and URLs in general

27 views
Skip to first unread message

Fraser Nevett

unread,
Feb 27, 2010, 3:02:33 PM2/27/10
to Django developers
Validation of IDN (Internationalized Domain Names) was added in
[12474], but I noticed that the verify_exists option doesn't work when
you use an IDN. This is caused by urllib2 not supporting IDN and the
validation code using the original unicode version of the URL when
testing for existence.

The problem is within URLValidator and can be fixed relatively easily
by using the IDNA-encoded version of the domain when testing that the
URL exists. I have a patch worked up for this and will raise a ticket
shortly.

However, I would like to open a discussion on Django's handling of non-
ASCII URLs...

Should the clean method of forms.URLField return the unicode value as
entered, or an IDNA-encoded URL?

What happens if a non-ASCII character is used in another part of the
URL? For example, opening http://en.wikipedia.org/wiki/Café in a
browser will work, but Django will not validate this as a legal URL
(because it isn't). In reality, the browser is converting the URL to
http://en.wikipedia.org/wiki/Caf%C3%A9 and requesting that.

Perhaps Django should permit such URLs and perform the same encoding
as a browser. In which case, should the clean method of forms.URLField
return the unicode value as entered, or urlencoded UTF-8 version of
the URL?

I guess an approach to these URL "complexities" would be to introduce
a utility function within django.utils.http such as:

def safe_url(url):
scheme, netloc, path, query, fragment = urlparse.urlsplit(url)
netloc = netloc.encode('idna')
path = urlquote(path)
# TODO -- should query and fragement be escaped?
return urlparse.urlunsplit((scheme, netloc, path, query,
fragment))

This could then be used by URLValidator, but also anyone who needs to
deal with non-ASCII URLs.

It is probably overkill and overcomplicating things, but I had also
thought about suggesting a "URL" object that would be returned by
URLFields (both forms and models). This could handle unicode URLs and
be responsible for encoding/decoding depending where they were used.

What do people think?

Fraser

P.S. I noticed that #12988 has just been opened, which also relates to
IDN validation.

Ulrich Petri

unread,
Feb 27, 2010, 4:26:53 PM2/27/10
to django-d...@googlegroups.com

Am 27.02.2010 um 21:02 schrieb Fraser Nevett:

> Validation of IDN (Internationalized Domain Names) was added in
> [12474], but I noticed that the verify_exists option doesn't work when
> you use an IDN. This is caused by urllib2 not supporting IDN and the
> validation code using the original unicode version of the URL when
> testing for existence.
>

This seems only to be true for python 2.4. In 2.5 and above urlopen
will happily accept IDNs.

[...]

> It is probably overkill and overcomplicating things, but I had also
> thought about suggesting a "URL" object that would be returned by
> URLFields (both forms and models). This could handle unicode URLs and
> be responsible for encoding/decoding depending where they were used.
>
> What do people think?

Which ever solution is chosen I think it's important to note RFC 3490
section 6.1 which states that:

"""Applications MAY allow input and display of ACE labels, but are not
encouraged to do so except as an interface for special purposes, [...]
ACE encoding is opaque and ugly, and should thus only be exposed to
users who absolutely need it."""

Ulrich

Fraser Nevett

unread,
Feb 27, 2010, 7:22:59 PM2/27/10
to Django developers
On Feb 27, 8:02 pm, Fraser Nevett <fras...@gmail.com> wrote:
[...]

> I have a patch worked up for this and will raise a ticket shortly.

Ticket created and patch uploaded...

http://code.djangoproject.com/ticket/12989

Fraser

Nikolay Panov

unread,
Feb 28, 2010, 4:26:22 AM2/28/10
to django-d...@googlegroups.com
>> Validation of IDN (Internationalized Domain Names) was added in
>> [12474], but I noticed that the verify_exists option doesn't work when
>> you use an IDN. This is caused by urllib2 not supporting IDN and the
>> validation code using the original unicode version of the URL when
>> testing for existence.
> This seems only to be true for python 2.4. In 2.5 and above urlopen will
> happily accept IDNs.

Are you sure?
I have just tested the following:
$ python2.6
Python 2.6.4+ (r264:75706, Feb 16 2010, 00:09:58)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import urllib2
>>> urllib2.urlopen(u'http://пример.испытание/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1133, in do_open
h.request(req.get_method(), req.get_selector(), req.data, headers)
File "/usr/lib/python2.6/httplib.py", line 910, in request
self._send_request(method, url, body, headers)
File "/usr/lib/python2.6/httplib.py", line 947, in _send_request
self.endheaders()
File "/usr/lib/python2.6/httplib.py", line 904, in endheaders
self._send_output()
File "/usr/lib/python2.6/httplib.py", line 776, in _send_output
self.send(msg)
File "/usr/lib/python2.6/httplib.py", line 755, in send
self.sock.sendall(str)
File "<string>", line 1, in sendall
UnicodeEncodeError: 'ascii' codec can't encode characters in position
49-54: ordinal not in range(128)
>>> urllib2.urlopen('http://пример.испытание/')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>

Same for python2.5.

PS:
I was forced to slightly google on this subject and found urlnorm.py
from Sam Ruby. At http://code.google.com/p/url-normalize/ I have
created a fork of this module (pep-ized and with the IDN support
addition)
It would be wonderful if django.utils.http will contain something like
this for URI normalization.

Ulrich Petri

unread,
Feb 28, 2010, 7:40:23 AM2/28/10
to django-d...@googlegroups.com

>> This seems only to be true for python 2.4. In 2.5 and above urlopen
>> will
>> happily accept IDNs.
>
> Are you sure?
Yes:

~/ python2.5
Python 2.5.1 (r251:54863, Feb 6 2009, 19:02:12)
[GCC 4.0.1 (Apple Inc. build 5465)] on darwin


Type "help", "copyright", "credits" or "license" for more information.

>>> from urllib2 import urlopen
>>> urlopen('http://пример.испытание/')
<addinfourl at 5667496 whose fp = <socket._fileobject object at
0x559eb0>>

~/ python2.6
Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
[GCC 4.0.1 (Apple Inc. build 5493)] on darwin


Type "help", "copyright", "credits" or "license" for more information.

>>> from urllib2 import urlopen
>>> urlopen('http://пример.испытание/')
<addinfourl at 7383360 whose fp = <socket._fileobject object at
0x70b330>>


> I have just tested the following:
> $ python2.6
> Python 2.6.4+ (r264:75706, Feb 16 2010, 00:09:58)
> [GCC 4.4.3] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
>>>> import urllib2
>>>> urllib2.urlopen(u'http://пример.испытание/')

> [...]


> UnicodeEncodeError: 'ascii' codec can't encode characters in position
> 49-54: ordinal not in range(128)

This isn't a IDN problem it's a "normal" python unicode string
handling problem. What is the locale on your System?

Ulrich

Nikolay Panov

unread,
Mar 2, 2010, 1:03:49 AM3/2/10
to django-d...@googlegroups.com
2010/2/28 Ulrich Petri <u...@ulo.pe>:

> ~/ python2.6
> Python 2.6.4 (r264:75821M, Oct 27 2009, 19:48:32)
> [GCC 4.0.1 (Apple Inc. build 5493)] on darwin
> Type "help", "copyright", "credits" or "license" for more information.
>>>> from urllib2 import urlopen
>>>> urlopen('http://пример.испытание/')
> <addinfourl at 7383360 whose fp = <socket._fileobject object at 0x70b330>>

Yeah, it seems to be working on MacOS, but why it is not working on my
linux system?

$ python2.6
Python 2.6.4+ (r264:75706, Feb 16 2010, 00:09:58)
[GCC 4.4.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.

>>> from urllib2 import urlopen
>>> urlopen('http://пример.испытание/')

Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/python2.6/urllib2.py", line 126, in urlopen
return _opener.open(url, data, timeout)
File "/usr/lib/python2.6/urllib2.py", line 391, in open
response = self._open(req, data)
File "/usr/lib/python2.6/urllib2.py", line 409, in _open
'_open', req)
File "/usr/lib/python2.6/urllib2.py", line 369, in _call_chain
result = func(*args)
File "/usr/lib/python2.6/urllib2.py", line 1161, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.6/urllib2.py", line 1136, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error [Errno -2] Name or service not known>

> This isn't a IDN problem it's a "normal" python unicode string handling


> problem. What is the locale on your System?

$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Have a nice day,
Nikolay.

Reply all
Reply to author
Forward
0 new messages