Add support for IDNA 2008

171 views
Skip to first unread message

Julien Bernard

unread,
Sep 1, 2022, 11:19:17 AM9/1/22
to Django developers (Contributions to Django itself)
Hi,
I'm starting this discussion following ticket #33968 recommendation in comment.
Currently the punycode method is converting domain names from Unicode to ASCII using the deprecated IDNA 2003 standard.
As suggested in the ticket, the method should use the idna package that is fully compliant with the latest IDNA 2008 standard.
Ticket comment points that adding a new dependency is problematic, however, in most case, someone using Django will end up with this dependency anyway as it is widely used. Django GeoIP2 contrib uses it for instance.
Best regards,
Julien

Adam Johnson

unread,
Sep 1, 2022, 1:33:29 PM9/1/22
to Django developers (Contributions to Django itself)
Some data...

The idna package has ~9M downloads a day ( https://pypistats.org/packages/idna ) compared with Django's ~350k ( https://pypistats.org/packages/django ).

However it has 191 GitHub stars ( https://github.com/kjd/idna ) compared to Django's 66k ( https://github.com/django/django ).

I imagine most idna installs are from popular packages depending on it: at least requests, and twisted do. Most people probably don't care about complete domain validation (until they do). To me, it seems like a reasonable dependency for Django to adopt, given so much of the ecosystem uses it.

But I don't think there's much evidence of demand. I found one other ticket mentioning idna, and this was with an extended regex, not the package ( https://code.djangoproject.com/ticket/18119 , seems stale, maybe closeable? ).

I'd suggest at this point: implement idna validators in a third party package, do some advocacy for why projects would need it, and show some adoption. This would make the case for Django itself stronger.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/d4331b21-d9e8-4647-aa1f-357b00fb0125n%40googlegroups.com.

Julien Bernard

unread,
Sep 2, 2022, 11:32:42 AM9/2/22
to Django developers (Contributions to Django itself)
Thanks Adam.
The ticket was targetting EmailValidator but the punycode method is used at more places in Django core.
If you look for Unicode characters support in email addresses in tickets you will get more than looking for IDN but indeed the demand may not be that much.
This is a vicious circle, IDNs and especially EAIs are not widely used because of the poor support in software and the lack of compliance with the latest version.
Regards,
Julien

Carlton Gibson

unread,
Sep 4, 2022, 5:16:19 AM9/4/22
to django-d...@googlegroups.com
Hi Julian. 

We've had several tickets and discussions surrounding how far e.g. URLValidator needs to match all valid URLs. The conclusion we've come to (which is always provisional) is that actually we **don't** want such match all valid (according to the relevant RFC and such) input. Rather, we'd prefer a simpler implementation — that's realistically maintainable for us — that captures the 95% case, and then ask users to implement a custom validator if they need more. This seems like a happy compromise. 

I hope that makes sense. 

I agree with Adam here that a third-party package is the immediate way forward here. If it shows a lot of demand then it's always open to revisit whether that should be included in Django itself. (Taking on an extra dependency for a small subset of users is always going to be up for debate — it's not a blocker per se, but it does need weighing...) 

Kind Regards,

Carlton

Julien Bernard

unread,
Sep 6, 2022, 2:23:27 PM9/6/22
to Django developers (Contributions to Django itself)
Thanks Carlton.

This makes total sense to keep things simple and avoid bringing another dependency in the context of validation, providing that you won't prevent valid URLs to be accepted. That's where it can be tricky but it seems reasonable to think that the current domain validation  is too permissive from what I saw.
But the punycode method is also used in other places where it is more "critical" than a validator:
  • Urlizer
  • CachedDnsName
  • email sending.
The need for the idna package should be evaluated regarding those usages rather than the validation.

Best regards,
Julien

Carlton Gibson

unread,
Sep 6, 2022, 2:39:49 PM9/6/22
to django-d...@googlegroups.com
Hey Julian. 

What's maybe missing is some concrete cases. "This conversion should be made IDNA 2008 compliant." — OK, but what does that buy us? 

Maybe the idna package is OK... It's widely depended on already — I got it for free yesterday installing httpx in a project — and packaging isn't what it was... — but **if** we take on an extra dependency it needs to be for a clear gain. 

Likely (still) a proof-of-concept at least showing what's added (as a separate package? 🤔)  is the easiest way forward? 

Others may yet agree that this is something needed. 

Kind Regards,

Carlton

Julien Bernard

unread,
Sep 6, 2022, 5:03:29 PM9/6/22
to Django developers (Contributions to Django itself)
Hi Carlton,

IDNA 2008 made some changes in the valid or invalid IDNs and some differences in the ways some characters are transformed in Punycode compared to IDNA 2003 for multiple reasons.
A difference that is often used as an example is the german 'ß' character. In IDNA 2003 it is transformed into 'ss' while it is converted into Punycode in IDNA 2008.
It means that, depending on the standard that is implemented, you may reach totally different domains with the same IDN, which may lead to security issues.
For example, the URL https://fuß.standcore.com/ would be https://fuss.standcore.com/ with IDNA 2003 and https://xn--fu-hia.standcore.com/ with IDNA 2008.
This is only a very brief insight, for further quick readings, https://www.unicode.org/faq/idn.html is quite informative too.

Best regards,
Julien

Carlton Gibson

unread,
Sep 7, 2022, 2:18:13 AM9/7/22
to django-d...@googlegroups.com
Hey Julien. 

Thanks, OK... 📖


If you need the IDNA 2008 standard from RFC 5891 and RFC 5895, use the third-party idna module.

So the question is do we **need** the newer standard?

I will have a read of the various resources here, and I'll also ask the Django Security Team if they have any thoughts. 

Kind Regards,

Carlton


Carlton Gibson

unread,
Sep 13, 2022, 7:17:31 AM9/13/22
to Django developers (Contributions to Django itself)
Hi Julien. 

I didn't get a canonical answer from the security team yet, but it may be that we can make the idna an optional dependency quite easily. I already have it installed in my dev environment, for instance, coming from selenium and requests. 

From the package docs: https://pypi.org/project/idna/

   You may use the codec encoding and decoding methods using the idna.codec module:

   >>> import idna.codec 
   >>> print('домен.испытание'.encode('idna')) 
   b'xn--d1acufc.xn--80akhbyknj4f'

So "use if installed" (catching the ImportError if not) would look feasible. (The usage in the punycode helper is just `domain.encode("idna")` which matches this example already.)

Would you fancy looking a PR around that? 

We'd need *some* tests for both the installed and not-installed cases, ideally showing the difference. I didn't immediately have success with your https://fuss.standcore.com/ example: 

    % python
    Python 3.10.6 (v3.10.6:9c7b4bd164, Aug  1 2022, 17:13:48) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print('https://fuß.standcore.com/'.encode('idna'))
    b'https://fuss.standcore.com/'
    >>> import idna.codec
    >>> print('https://fuß.standcore.com/'.encode('idna'))
    b'https://fuss.standcore.com/'  # Was expecting https://xn--fu-hia.standcore.com/ from discussion 🤔
    >>> import idna
    >>> idna.encode('https://fuß.standcore.com/')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 357, in encode
        s = alabel(label)
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 269, in alabel
        check_label(label)
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 250, in check_label
        raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
    idna.core.InvalidCodepoint: Codepoint U+003A at position 6 of 'https://fuß' not allowed

Possibly there's some objection to such a change, but I'm struggling to imagine it short of concrete cases... 

Thanks! 

Kind Regards,

Carlton

Julien Bernard

unread,
Sep 14, 2022, 9:54:08 AM9/14/22
to Django developers (Contributions to Django itself)
Hi Carlton,

Le mardi 13 septembre 2022 à 07:17:31 UTC-4, carlton...@gmail.com a écrit :
Hi Julien. 

I didn't get a canonical answer from the security team yet, but it may be that we can make the idna an optional dependency quite easily. I already have it installed in my dev environment, for instance, coming from selenium and requests. 

From the package docs: https://pypi.org/project/idna/

   You may use the codec encoding and decoding methods using the idna.codec module:

   >>> import idna.codec 
   >>> print('домен.испытание'.encode('idna')) 
   b'xn--d1acufc.xn--80akhbyknj4f'

So "use if installed" (catching the ImportError if not) would look feasible. (The usage in the punycode helper is just `domain.encode("idna")` which matches this example already.)

That's great news! Thanks.
 

Would you fancy looking a PR around that? 

Yes, no problem.
 

We'd need *some* tests for both the installed and not-installed cases, ideally showing the difference. I didn't immediately have success with your https://fuss.standcore.com/ example: 

    % python
    Python 3.10.6 (v3.10.6:9c7b4bd164, Aug  1 2022, 17:13:48) [Clang 13.0.0 (clang-1300.0.29.30)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    >>> print('https://fuß.standcore.com/'.encode('idna'))
    b'https://fuss.standcore.com/'
    >>> import idna.codec
    >>> print('https://fuß.standcore.com/'.encode('idna'))
    b'https://fuss.standcore.com/'  # Was expecting https://xn--fu-hia.standcore.com/ from discussion 🤔
    >>> import idna
    >>> idna.encode('https://fuß.standcore.com/')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 357, in encode
        s = alabel(label)
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 269, in alabel
        check_label(label)
      File "/Users/carlton/Envs/django/lib/python3.10/site-packages/idna/core.py", line 250, in check_label
        raise InvalidCodepoint('Codepoint {} at position {} of {} not allowed'.format(_unot(cp_value), pos+1, repr(label)))
    idna.core.InvalidCodepoint: Codepoint U+003A at position 6 of 'https://fuß' not allowed

I was not able to get .encode('idna') to work either. I reported this issue https://github.com/kjd/idna/issues/128 to check why this is not working as expected.

For the last part, idna works with labels or domains, so you would have to provide only the domain to the encode method:

% python
Python 3.10.7 (main, Sep  6 2022, 21:22:27) [GCC 12.2.0] on linux

Type "help", "copyright", "credits" or "license" for more information.
>>> import idna
>>> idna.encode('fuß.standcore.com')
b'xn--fu-hia.standcore.com'

Best regards,
Julien

Carlton Gibson

unread,
Sep 14, 2022, 9:58:24 AM9/14/22
to django-d...@googlegroups.com
OK, great, thanks.

I'll await your PR. Let's continue on GitHub for the moment then 
Good hustle 👍

Reply all
Reply to author
Forward
0 new messages