Unicode normalization for username field

580 views
Skip to first unread message

Rick Leir

unread,
Apr 21, 2016, 11:22:54 AM4/21/16
to Django developers (Contributions to Django itself)
Hi all,
We have discussed the possibility of username spoofing in the users list. 


"It's not important until this happens: 
https://labs.spotify.com/2013/06/18/creative-usernames/ 

But my searches did not turn up anything in this list. Would you point me at any relevant discussions here please?
Thanks -- Rick

Tim Graham

unread,
Apr 21, 2016, 11:43:56 AM4/21/16
to Django developers (Contributions to Django itself)
Here is one: https://groups.google.com/d/topic/django-developers/6aAHgP5g0lA/discussion
(all I did was search "unicode username")

Here's a relevant Trac ticket: https://code.djangoproject.com/ticket/21379

Rick Leir

unread,
Apr 21, 2016, 2:16:14 PM4/21/16
to Django developers (Contributions to Django itself)
Thanks. To summarize quickly, (corrections please)

2008 - Usernames in django.contrib.auth are restricted to ASCII  
alphanumerics. Allowing Unicode seems fairly simple: compile the  
validator's regular expression with the re.UNICODE flag.

but:

2014 - trac issue still open, no tests, no patches, with problems in difference between py2 and py3 (py2 is supported until 2017)

normalization could be done with 
unicodedata.normalize(input, 'NFKD')

Aymeric Augustin

unread,
Apr 21, 2016, 3:23:16 PM4/21/16
to django-d...@googlegroups.com
Hello,

Judging from the (rather confused) discussion on the users lists, it looks like we’re discussing in the abstract. No one has tested whether the problem can happen with Django.

Since the ticket quoted below says Django (unexpectedly) accepts non-ascii usernames on Python 3, it’s just a matter of trying to create a user called rené and one called rené.

Here’s how to create these strings if copy-pasting them doesn’t suffice:

>>> composed = 'rené'
>>> composed.encode('utf-8')
b'ren\xc3\xa9'
>>> import unicodedata
>>> decomposed = unicodedata.normalize('NFKD', composed)
>>> decomposed.encode('utf-8')
b'rene\xcc\x81’

I suspect the problem may not exist on full-featured database (i.e. not SQLite), depending on the database’s collation settings, which will cause these two strings to compare identical on most reasonable setups.

Who wants to try on various databases?

If it turns out that the problem does exist and that Django should normalize things, it should normalize to NFKC. Proposing normalization to NFKD suggests a lack of familiarity with the topic.

For what it’s worth, I’m in favor of restoring the intended behavior of restricting usernames to ASCII on Python 3 and letting developers who want something more elaborate implement their own requirements.

One last anecdote: I live in a country where many people have non-ASCII names and no one would ask for a non-ASCII username because everyone knows it would cause problems at some point, even if IT thinks otherwise! :-)

-- 
Aymeric.

--
You received this message because you are subscribed to the Google Groups "Django developers (Contributions to Django itself)" group.
To unsubscribe from this group and stop receiving emails from it, send an email to django-develop...@googlegroups.com.
To post to this group, send email to django-d...@googlegroups.com.
Visit this group at https://groups.google.com/group/django-developers.
To view this discussion on the web visit https://groups.google.com/d/msgid/django-developers/746795c5-7009-48c4-8065-9d5c1997033c%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Claude Paroz

unread,
Apr 22, 2016, 8:25:59 AM4/22/16
to Django developers (Contributions to Django itself)
Le jeudi 21 avril 2016 21:23:16 UTC+2, Aymeric Augustin a écrit :
For what it’s worth, I’m in favor of restoring the intended behavior of restricting usernames to ASCII on Python 3 and letting developers who want something more elaborate implement their own requirements.

I'm sorry to disagree, you know that I'm a Unicode's nerd :-) We should have probably done that when adding Python 3 support, but it might be a bit late now. I'll see if I can find the time to work on something acceptable, allowing people to choose either policy without too much hassle and backwards incompatibility. Of course, anyone else could try it, too.
 
One last anecdote: I live in a country where many people have non-ASCII names and no one would ask for a non-ASCII username because everyone knows it would cause problems at some point, even if IT thinks otherwise! :-)

As for me, I think that's a behavior inherited from the past, where pure ASCII was king. It feels to me a bit ethnocentric (even if I know there are/were technical reasons for that).

Claude

Claude Paroz

unread,
Apr 22, 2016, 6:04:04 PM4/22/16
to Django developers (Contributions to Django itself)
Le vendredi 22 avril 2016 14:25:59 UTC+2, Claude Paroz a écrit :
 I'll see if I can find the time to work on something acceptable, allowing people to choose either policy without too much hassle and backwards incompatibility. Of course, anyone else could try it, too.

Here's some code, unpolished, but a base for discussion.
https://github.com/django/django/pull/6494

Claude

Aymeric Augustin

unread,
Apr 23, 2016, 8:33:56 AM4/23/16
to django-d...@googlegroups.com
Hi Claude,

> Le 23 avr. 2016 à 00:04, Claude Paroz <cla...@2xlibre.net> a écrit :
>
> Le vendredi 22 avril 2016 14:25:59 UTC+2, Here's some code, unpolished, but a base for discussion.
> https://github.com/django/django/pull/6494

This patch looks pretty good. I have a few questions, not necessarily because I disagree with your proposal, but to make sure we have considered alternatives. Actually I don't think there's exactly one correct solution here; it's more a matter of tradeoffs.

You added a username_validator attribute instead of documenting how to override the whole username field. Can you elaborate on this decision? I simplifies the use case targeted by the patch by introducing a one-off API. As a matter of principle I'm a bit skeptical of such special cases. But I understand the convenience.

Normalization happens at the form layer. I'm wondering whether it would be safer to do it at the model layer. That would extend the security hardening to cases where users aren't created with a form — for example if they're created through an API or programmatically.

I would keep ASCII usernames as the default because:

- this has always been the intent;
- allowing non ASCII usernames may result in interoperability problems with other software e.g. if a Django project is used as SSO server;
- these interoperability issues might escalate into security vulnerabilities — there isn't a straightforward connection but (1) non ASCII data can be used for breaking out of parsing routines (2) I'm paranoid with anything that manipulates authentication credentials;
- I'm afraid this change may result in boilerplate as most custom user models will revert to Django's historical (and in my opinion sensible) username validation rules.

Finally, I would add a test to check that a username containing a zero-width space is rejected, just to make sure we never accidentally make it trivial to create usernames that render identically, which this PR aims at preventing. It will be rejected because it won't match \w.

Best regards,

--
Aymeric.

Claude Paroz

unread,
Apr 24, 2016, 2:58:55 PM4/24/16
to Django developers (Contributions to Django itself)
Hi Aymeric,


Le samedi 23 avril 2016 14:33:56 UTC+2, Aymeric Augustin a écrit :
> https://github.com/django/django/pull/6494

This patch looks pretty good. I have a few questions, not necessarily because I disagree with your proposal, but to make sure we have considered alternatives. Actually I don't think there's exactly one correct solution here; it's more a matter of tradeoffs.

You added a username_validator attribute instead of documenting how to override the whole username field. Can you elaborate on this decision? I simplifies the use case targeted by the patch by introducing a one-off API. As a matter of principle I'm a bit skeptical of such special cases. But I understand the convenience.

My preoccupation here was not to force users to create a custom user model just to change the username validation, especially as the migration system doesn't seem to support yet upgrading from the standard auth User to a custom user. I thought that creating a proxy custom user is easier migration-wise, as no new table is required. But I may be wrong.
 
Normalization happens at the form layer. I'm wondering whether it would be safer to do it at the model layer. That would extend the security hardening to cases where users aren't created with a form — for example if they're created through an API or programmatically.

Normalization happens both at the form layer and at the model layer in _create_user. You may have missed the _create_user change.
 
I would keep ASCII usernames as the default because:

- this has always been the intent;

Until now! Things are evolving, we see that for example with internationalized domain names. I think that most if not all technical reasons requiring pure ASCII usernames have vanished nowadays.
 
- allowing non ASCII usernames may result in interoperability problems with other software e.g. if a Django project is used as SSO server;

These are still not the majority of Django use cases. And even then, I think that LDAPv3 for example should support unicode in attributes. Those project could still configure the ASCIIUsernameValidator if desired.
 
- these interoperability issues might escalate into security vulnerabilities — there isn't a straightforward connection but (1) non ASCII data can be used for breaking out of parsing routines (2) I'm paranoid with anything that manipulates authentication credentials;

Sure, the more characters, the more attack surface. As you said before, it's a tradeoff. My thinking is that sooner or later, we'll have to cope with unicode in usernames. So let's do our most to not open security holes, based on some passed issues (BTW I think you forgot you references).
 
- I'm afraid this change may result in boilerplate as most custom user models will revert to Django's historical (and in my opinion sensible) username validation rules.

That's a tough question to estimate. This might be true for most English monolingual web sites, but not necessarily for the majority of Django sites. Hopefully we'll get some more user inputs in this thread.
 
Finally, I would add a test to check that a username containing a zero-width space is rejected, just to make sure we never accidentally make it trivial to create usernames that render identically, which this PR aims at preventing. It will be rejected because it won't match \w.

Sure, good idea.
 
Globally, I totally understand your opinion, and I agree there is no "right" or "wrong" solution. Eventually, this might be a decision to be brought to the technical broad.

Claude

Erik Cederstrand

unread,
Apr 25, 2016, 3:29:16 AM4/25/16
to django-d...@googlegroups.com

> Den 24. apr. 2016 kl. 20.58 skrev Claude Paroz <cla...@2xlibre.net>:
>
> - I'm afraid this change may result in boilerplate as most custom user models will revert to Django's historical (and in my opinion sensible) username validation rules.
>
> That's a tough question to estimate. This might be true for most English monolingual web sites, but not necessarily for the majority of Django sites. Hopefully we'll get some more user inputs in this thread.

From a security point of view, I understand the reasoning here. Everyone expects ASCII-only usernames. There were the same discussions when IDN was introduced. But even with ASCII, people are attempting to spoof (googel.com, gogle.com etc). We need a way to normalize unicode - preferably an external library or method, since this is not a Django-specific problem.

Being from .dk, we're used to translating "Åge Æbelø" to "aage_aebeloe" when creating usernames. But just like 8.3 filenames were the norm in DOS and we got around that, we now expect to be able to create long filenames with spaces and unicode characters. There was a 15-year transition period where unicode filenames would maybe work and often break things in unexpected ways (I *still* find issues from time to time). Unicode was for the brave and those with lots of free time to waste. But I think that in 2016, software that cannot handle unicode is simply broken and must be fixed. Sure, unicode can be a hassle. I still need to hexdump strings once in a while to find out what is going on. But there is no way we can continue to not support the billions of people in the world that use a language that doesn't fit into ASCII.

For usernames, most people may still want the old behavior, and they can do that. It could even be the default (remember POLA). But being able to create unicode usernames should be possible and supported.

Erik

Aymeric Augustin

unread,
Apr 25, 2016, 2:11:39 PM4/25/16
to django-d...@googlegroups.com
Hi Claude,

On 24 Apr 2016, at 20:58, Claude Paroz <cla...@2xlibre.net> wrote:

Le samedi 23 avril 2016 14:33:56 UTC+2, Aymeric Augustin a écrit :

You added a username_validator attribute instead of documenting how to override the whole username field. Can you elaborate on this decision? I simplifies the use case targeted by the patch by introducing a one-off API. As a matter of principle I'm a bit skeptical of such special cases. But I understand the convenience.

My preoccupation here was not to force users to create a custom user model just to change the username validation, especially as the migration system doesn't seem to support yet upgrading from the standard auth User to a custom user. I thought that creating a proxy custom user is easier migration-wise, as no new table is required. But I may be wrong.

I believe that you can switch to a custom user model that has the same fields as auth.User just by declaring db_table = ‘auth_user’. You may still have to throw away your migration history and recreate a fresh set of migrations. I made these tests some time ago and I’m not sure of the results. Indeed, a proxy model is easier.

On a side note, we should recommend to always start with a custom user model… I don’t know if we added that to the docs.

Globally, I totally understand your opinion, and I agree there is no "right" or "wrong" solution. Eventually, this might be a decision to be brought to the technical broad.

It’s a -0 from me, not a -1, and it may turn into a +0 as time passes... More arguments or opinions, especially backed by data or experience, would certainly be useful.

-- 
Aymeric.

Shai Berger

unread,
Apr 25, 2016, 2:32:20 PM4/25/16
to django-d...@googlegroups.com
On Monday 25 April 2016 21:11:51 Aymeric Augustin wrote:
>
> It’s a -0 from me, not a -1, and it may turn into a +0 as time passes...
> More arguments or opinions, especially backed by data or experience, would
> certainly be useful.

As far as I can see, the force of the push to use non-ASCII usernames is
inversely proportional to the size of your native alphabet's intersection with
ASCII (for me, this intersection is empty).

Shai.

Aymeric Augustin

unread,
Apr 25, 2016, 5:45:24 PM4/25/16
to django-d...@googlegroups.com
Based on further clarifications by Shai on IRC, I’m changing my -0 to +1.

Rather stupidly, I didn’t realize countries with non-latin alphabets are
already using non-ASCII usernames and mostly getting away with it.

--
Aymeric.

Rick Leir

unread,
May 3, 2016, 9:29:06 AM5/3/16
to Django developers (Contributions to Django itself)
Hi all
Could there be a consensus with
-default to ASCII
-optionally, UTF8 with normalization
-based on Claude's code
-Python 3 required so we are not distracted by compatibility issues
Cheers -- Rick

Tim Graham

unread,
May 9, 2016, 8:48:06 AM5/9/16
to Django developers (Contributions to Django itself)
Rather than change the behavior of Python 2 near its last supported version of Django, I would make the default validator ASCII on Python 2 and Unicode on Python 3.

Claude Paroz

unread,
May 9, 2016, 4:00:27 PM5/9/16
to Django developers (Contributions to Django itself)
Le lundi 9 mai 2016 14:48:06 UTC+2, Tim Graham a écrit :
Rather than change the behavior of Python 2 near its last supported version of Django, I would make the default validator ASCII on Python 2 and Unicode on Python 3.

I can buy this, providing we don't face migration issues.

Claude

Tim Graham

unread,
May 12, 2016, 12:45:15 PM5/12/16
to Django developers (Contributions to Django itself)
Just to be sure, do you mean django.db.migrations (referencing the appropriate validator in the migration file, I guess?) or some problem a project would face when migrating from Python 2 to 3?

Claude Paroz

unread,
May 14, 2016, 9:11:07 AM5/14/16
to Django developers (Contributions to Django itself)
Le jeudi 12 mai 2016 18:45:15 UTC+2, Tim Graham a écrit :
Just to be sure, do you mean django.db.migrations (referencing the appropriate validator in the migration file, I guess?) or some problem a project would face when migrating from Python 2 to 3?

Both things, hopefully not an issue, but who knows?

I have attached the new PR to the ticket.

Claude

David Tan

unread,
May 19, 2016, 2:48:39 PM5/19/16
to Django developers (Contributions to Django itself)
- I'm afraid this change may result in boilerplate as most custom user models will revert to Django's historical (and in my opinion sensible) username validation rules. 

That's a tough question to estimate. This might be true for most English monolingual web sites, but not necessarily for the majority of Django sites. Hopefully we'll get some more user inputs in this thread.
 
Hi, just wanted to give my input on this point, I agree with Aymeric Augustin here and my vote is to keep usernames as ASCII by default.

I created a Django ticket for this, I will copy my reasoning here:

A Django user who is trying to save time and get a product out the door isn't going to focus on finer details such as Unicode usernames, and will be in for a shock when he finds out a bunch of his users have registered themselves with Egyptian hieroglyphics. He may be very frustrated, eventually figuring out that he must subclass the User model and setusername_validator = ASCIIUsernameValidator() to get the functionality he expected. And what is he to do with the existing Unicode users, delete all their accounts?

Whereas a technologically forward user might be friendlier towards Unicode usernames, and would be well-informed on these capabilities within Django. Furthermore, the technologically forward user will be more likely to already have a custom user model, and won't find it cumbersome to explicitly enable Unicode usernames. Enabling Unicode usernames isn't destructive like disabling it would be (no need to figure out what to do with the existing users offending the validation), so users can simply start using it immediately.

charettes

unread,
May 19, 2016, 3:09:03 PM5/19/16
to Django developers (Contributions to Django itself)
Hi David,

I agree with your reasoning but I think you're missing an important detail about
unicode username support: they have been mistakenly enabled on Python 3 since
Django added support for it (1.5-1.6).

If we were to disallow non-ASCII characters silently from Django 1.10 Python 3
developers would be left with the same problem you mentioned about existing
users with usernames containing unicode characters.

Cheers,
Simon
Reply all
Reply to author
Forward
0 new messages