#36013: Inconsistent handling of IDNs in urlize and AdminURLFieldWidget
----------------------------------------+------------------------------
Reporter: Mike Edmunds | Owner: Mike Edmunds
Type: Bug | Status: assigned
Component: Utilities | Version: 5.1
Severity: Normal | Keywords: idna
Triage Stage: Unreviewed | Has patch: 1
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
----------------------------------------+------------------------------
django.utils.html.smart_urlquote() and Urlizer use obsolete IDNA 2003
encoding on some—but not all—international domain names (IDNs), leading to
inconsistent URLs, failure to urlize email addresses for some IDNs, and
other problems.
That code is used ''only'' by the urlize/urlizetrunc template filters and
the AdminURLFieldWidget. Django does not provide IDNA encoding for any
other non-ASCII URLs. This ticket proposes dropping those IDNA 2003
special cases, so that the browser can handle IDNs consistently for all
URLs coming from Django.
[IDNA 2003 was superseded by IDNA 2008 starting in ~2010. Browsers follow
the WHATWG URL Standard and implement UTS !#46, which builds on IDNA
2008.]
== Examples ==
Urlizer and smart_urlquote() apply IDNA encoding to some URLs, but use
percent-encoded UTF-8 for others:
{{{#!python
from django.template.defaultfilters import urlize
urlize("https://עִתוֹן.example.il")
# '<a href="
https://xn--cdbk7fubl3c.example.il"
rel="nofollow">https://עִתוֹן.example.il</a>'
urlize("https://މިހާރު.example.mv")
# '<a href="https://%DE%89%DE%A8%DE%80%DE%A7%DE%83%DE%
AA.example.mv"
rel="nofollow">https://މިހާރު.example.mv</a>'
from django.utils.html import smart_urlquote
smart_urlquote("https://עִתוֹן.example.il")
# '
https://xn--cdbk7fubl3c.example.il'
smart_urlquote("https://މިހާރު.example.mv")
# 'https://%DE%89%DE%A8%DE%80%DE%A7%DE%83%DE%
AA.example.mv'
}}}
Urlizer linkifies email addresses in some IDNs, but rejects email
addresses in others:
{{{#!python
from django.template.defaultfilters import urlize
urlize("
editor@עִתוֹן.example.il")
# '<a href="mailto:
editor@xn--
cdbk7fubl3c.example.il">
editor@עִתוֹן.example.il</a>'
urlize("
editor@މިހާރު.example.mv")
# '
editor@މިހާރު.example.mv'
}}}
Examples were run with Django 5.1.4. މިހާރު is the local name of a
Maldivian newspaper. It can be encoded in IDNA 2008, but was invalid under
IDNA 2003. עִתוֹן is Hebrew, and can be encoded with all IDNA versions.
Both use RTL scripts, but RTL support was improved in IDNA 2008.
Using obsolete IDNA 2003 encoding can cause other problems. For example,
it strips Unicode characters necessary for accurate rendering of some
scripts: ශ්රී (the ''Sri'' part of ''Sri Lanka'' in Sinhalese) is
corrupted to ශ්රී after passing through IDNA 2003. (It's maybe still
readable, but is sort of like writing "mañana" as "man~ana".) IDNA 2008
addresses this.
== Proposed change ==
The easiest fix seems to be simply removing the IDNA 2003 encoding (calls
to punycode()) from
[
https://github.com/django/django/blob/54059125956789ad4c19b77eb7f5cde76eec0643/django/utils/html.py#L254-L257
URL generation] in django.utils.html. Instead, run unquote_quote() on the
netloc to render IDNs as percent-encoded UTF-8, like other non-ASCII
characters in the URL. That leaves IDNA encoding details to the browser,
ensuring consistent handling of all international URLs.
There doesn't seem to be any need for Django to IDNA encode domains in
URLs. In fact, apart from the urlize template filters and the
AdminURLFieldWidget, ''nothing else'' in Django applies IDNA encoding to
URL hosts. (The iriencode and urlencode template filters generate
%-encoded UTF-8, not IDNA. And it seems like many projects just render IDN
URLs as raw UTF-8. Modern browsers support both.)
This approach complies with relevant standards:
* WHATWG's [
https://html.spec.whatwg.org/multipage/urls-and-
fetching.html#urls HTML Standard] (working your way down from section 2.4,
''URLs'') and [
https://url.spec.whatwg.org/#host-parsing URL Standard]
(arriving at section 3.5, ''Host parsing'', step 4) allow both raw and
%-encoded UTF-8 in URL hosts. This is what all modern browsers support.
*
[
https://datatracker.ietf.org/doc/html/rfc3986#section-3.2.2:~:text=The%20reg%2Dname,legacy%20URI%20resolvers
RFC 3986], which specifies URIs, permits %-encoded UTF-8 in URL hosts
(last paragraph of section 3.2.2; a "registered name" is a host).
[
https://datatracker.ietf.org/doc/html/rfc6068#section-2:~:text=Percent%2Dencoding%20can,mailto%27%0A%20%20%20%20%20%20%20URI%20interpreters.
RFC 6068], for mailto URIs, includes similar language (section 2 item 4).
(Although the RFCs suggest that apps should apply "IDNA encoding, rather
than a percent-encoding, ''if they wish to maximize interoperability with
legacy URI resolvers''" [emphasis added], this is a ''should''
recommendation, not a ''must'' requirement.)
Past discussions have proposed updating django.utils.encoding.punycode()
from IDNA 2003 to IDNA 2008. But that would not be a workable solution.
IDNA 2008 alone does not perform case folding and other transformations
needed to match user expectations around IDN resolution. And IDNA 2008
disallows some domains that WHATWG's URL Specification
[
https://url.spec.whatwg.org/#idna:~:text=This%20document%20and%20the%20web%20platform%20at%20large%20use%20Unicode%20IDNA%20Compatibility%20Processing%20and%20not%20IDNA2008.%20For%20instance%2C%20%E2%98%95.example%20becomes%20xn%2D%2D53h.example%20and%20not%20failure.
specifically permits] (such as emoji domains).
To match browser IDN handling, the correct spec to follow (per WHATWG)
would be Unicode [
https://unicode.org/reports/tr46/ UTS #46] "non-
transitional." Currently, there doesn't seem to be any complete Python
implementation of that standard. (The idna package's uts46 option
implements only UTS !#46 preprocessing, section 4.4.)
Considering all that, deferring IDN encoding to the browser seems like the
cleanest and most reliable approach. (And, indeed, is what already happens
for URLs that aren't generated by urlize or displayed in an
AdminURLFieldWidget.)
== Compatibility ==
Django's admin app fully supports only "recent versions of modern, web
standards compliant browsers"
([
https://docs.djangoproject.com/en/5.1/faq/admin/#what-browsers-are-
supported-for-using-the-admin admin faq], language has been there since
Django 3.1). Modern browsers all follow the WHATWG standards cited above,
so there should be no compatibility concerns with the AdminURLFieldWidget.
In theory, the Urlizer changes could impact an existing Django app which
both (1) uses the urlize or urlizetrunc template filter on text containing
IDN URLs, and (2) needs the resulting links to work with a "legacy URI
resolver" user agent that either doesn't understand %-encoded UTF-8 or
doesn't perform IDNA encoding. If both of those are true, any urlized IDN
links would probably be broken (not navigable) in that user agent after
this change. (Of course, even before this change, any IDN links the app
renders directly in its HTML—not by urlizing plaintext—are already broken
for that legacy user agent.)
== More info ==
IDNA encoding was added to smart_urlquote() in #13704 (2010), because
"urlquote … incorrectly handles domain names with unicode characters in
them." Unfortunately, that ticket doesn't include examples of the
incorrect results, and I haven't tried to get Django 1.2 working to test
it myself. My best guess is that %-encoded UTF-8 wasn't considered
"correct" back then. (It's just fine now, per the standards cited above.)
For reference, here's how Django's punycode() (IDNA 2003) and the third-
party idna package (IDNA 2008) handle the IDNs from the earlier examples
(Django 5.1.4; Python 3.12.4):
{{{#!python
# IDNA 2003:
from django.utils.encoding import punycode
punycode("עִתוֹן.example.il")
# '
xn--cdbk7fubl3c.example.il'
punycode("މިހާރު.example.mv")
# UnicodeError: Violation of BIDI requirement 3
# encoding with 'idna' codec failed
# IDNA 2008:
import idna
idna.encode("עִתוֹן.example.il")
# b'
xn--cdbk7fubl3c.example.il'
idna.encode("މިހާރު.example.mv")
# b'
xn--hqbgq5jdp.example.mv'
# Unicode code points:
assert "עִתוֹן" == '\u05e2\u05b4\u05ea\u05d5\u05b9\u05df'
assert "މިހާރު" == '\u0789\u07a8\u0780\u07a7\u0783\u07aa'
}}}
--
Ticket URL: <
https://code.djangoproject.com/ticket/36013>
Django <
https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.