[Django] #30892: Slugify and SlugField with Diacritics

Django

unread,

Oct 18, 2019, 10:52:50 AM10/18/19

to django-...@googlegroups.com

#30892: Slugify and SlugField with Diacritics
--------------------------------------+------------------------
Reporter: originell | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: 2.2
Severity: Normal | Keywords:
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
--------------------------------------+------------------------
While working on an international project, we discovered that the
turkish/azerbaijani letter `İ` can not be properly processed when
`SlugField` and `slugify` are run with `allow_unicode=True`.

The project itself runs with Django 2.2.6 and Wagtail 2.6.2. I first
talked about this in the Wagtail Support Channel and while researching
further, discovered that this is a Django/Python related issue. This was
tested on Python 3.6 and on Python 3.7.

(quick shoutout to Matt Wescott @gasmanic of Wagtail Fame for being a
sparing partner in this)

There is a rather detailed analysis (README) in a git repo I created
https://github.com/originell/django-wagtail-turkish-i - it was also the
basis for my initial call for help in wagtail's support channel. Meanwhile
I have extended it with a Django-only project, as to be a 100% sure this
has nothing to do with Wagtail.

I was not able to find anything similar in trac. While I encourage whoever
is reading this to actually read the README in the git repo, I want to
honor your time and will try to provide a more concise version of the bug
here.

=== Explanation

`models.py`

{{{#!python
from django.db import models
from django.utils.text import slugify

class Page(models.Model):
title = models.CharField(max_length=255)
slug = models.SlugField(allow_unicode=True)

def __str__(self):
return self.title
}}}

Using this in a shell/test like a (Model)Form might:

{{{#!python
from django.utils.text import slugify

page = Page(title="Hello İstanbul")
page.slug = slugify(page.title, allow_unicode=True)
page.full_clean()
}}}

`full_clean()` then raises

django.core.exceptions.ValidationError: {'slug': ["Enter a valid
'slug' consisting of Unicode letters, numbers, underscores, or hyphens."]}

Why is that?

`slugify` does the following internally:

{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}

Thing is, Python's `.lower()` of the `İ` in `İstanbul` looks like this:

{{{#!python
>>> [unicodedata.name(character) for character in 'İ'.lower()]
['LATIN SMALL LETTER I', 'COMBINING DOT ABOVE']
}}}

So, while `slugify()` finishes, the result is then passed into `SlugField`
which uses the `slug_unicode_re` (`^[-\w]+\Z`). However, the ''Combining
Dot Above'' is not a valid `\w`:

{{{#!python
>>> [(character, unicodedata.name(character),
slug_unicode_re.match(character)) for character in 'İ'.lower()]
[
('i', 'LATIN SMALL LETTER I', <re.Match object; span=(0, 1), match='i'>),
# EUREKA!!
('̇', 'COMBINING DOT ABOVE', None)
]
}}}

So that's why the `ValidationError` is raised.

=== Proposed Solution

The culprit seems to be the order in which `lower()` is called in slugify.
The assumption that the lowercase version of a `re.sub`-ed string is still
a valid `slug_unicode_re`, does not seem to hold true.

Hence, instead of doing this in `slugify()`

{{{#!python
re.sub(r'[^\w\s-]', '', value).strip().lower()
}}}

It might be better to do it like this

{{{#!python
re.sub(r'[^\w\s-]', '', value.lower()).strip()
}}}

=== Is Python the actual culprit?

Yeah it might be. Matt (@gasmanic) urged me to also take a look if Python
might be doing this wrong.

- The `İ` is the ''Latin Capital Letter I with Dot Above''. It's
codepoint is `U+0130` According to the chart for the Latin Extended-A set
(https://www.unicode.org/charts/PDF/U0100.pdf), it's lowercase version is
`U+0069`.
- `U+0069` lives in the ''C0 Controls and Basic Latin set''
(https://www.unicode.org/charts/PDF/U0000.pdf). Lo and behold: it is the
''Latin small letter I''. So a latin lowercase `i`.

Does this really mean that Python is doing something weird here by adding
the ''Combining dot above''? Honestly, I can't imagine that and I am
probably missing an important thing here because my view is too naive.

---

I hope this shorter explanation makes sense. If it does not, please try to
read through the detailed analysis in the repo
(https://github.com/originell/django-wagtail-
turkish-i/blob/master/README.md). If that also does not make a ton of
sense, let me know.

In any case, thank you for taking the time to read this bug report.
Looking forward to feedback and thoughts. I am happy to oblige in any
capacity necessary.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,

Oct 21, 2019, 8:57:19 AM10/21/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".
---------------------------+------------------------------------
Reporter: Luis Nell | Owner: nobody
Type: Bug | Status: new
Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

---------------------------+------------------------------------
Changes (by felixxm):

* version: 2.2 => master
* stage: Unreviewed => Accepted

Comment:

Thanks for this ticket. I'm not sure if your solution is correct because
I'm afraid that we may hit another edge case, nevertheless `slugify()`
should return a valid slug.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:1>

Django

unread,

Oct 21, 2019, 9:30:55 AM10/21/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".
---------------------------+------------------------------------

Reporter: Luis Nell | Owner: nobody
Type: Bug | Status: new

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

---------------------------+------------------------------------

Comment (by Luis Nell):

True that.

I also thought about looping over the characters and simply removing
everything that does not match the `\w`. However, I threw that away
because I thought that this might as well hit some edge case where it
could modify the meaning of a word.

In any case, if we can find a (more solid) process, I'd be very happy to
produce a patch and tests for that.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:2>

Django

unread,

Dec 10, 2019, 9:20:25 AM12/10/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".

-------------------------------------+-------------------------------------
Reporter: Luis Nell | Owner:
| Christoffer Sjöbergsson
Type: Bug | Status: assigned

Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Christoffer Sjöbergsson):

* owner: nobody => Christoffer Sjöbergsson
* status: new => assigned

Comment:

I did some digging into this and found some interesting things that I
thought that I would share.

In short my conclusion is that python does as it should.

Firstly because if you perform a lowercase operation on `İ` in JavaScript
the result becomes the same and secondly because in
[https://unicode.org/Public/UNIDATA/SpecialCasing.txt] which describes
casing rules in some special occasions we can see that the lower case
mapping of `İ` is indeed `['LATIN SMALL LETTER I', 'COMBINING DOT
ABOVE']`.

Things are however a little bit more complicated than that, as it turns
out that the casing operation is performed differently depending on which
locale is used. Since locale settings should not break code I will not go
in to much on it here but for further reading take a look att JavaSript's
toLocalLowerCase or at this stack overflow answer
[https://stackoverflow.com/a/19031612]. If the locale setting 'TR' is used
in these examples then the lowercase version of `İ` is only `LATIN SMALL
LETTER I`.

Now to the possible solution:

Replying to [comment:2 Luis Nell]:
>...I also thought about looping over the characters and simply removing
everything that does not match the builtin slug validation regular
expressions...

As far as I understand it this is mostly what happens by changing the
placement of `lower()` the way you suggests.
`re.sub(r'[^\w\s-]', '', value)` is removing all symbols that are not
standard Unicode characters or spaces which is almost the same regexp as
the slug is then validated against. As previously discovered the problem
is when the `lower()` operation then add new symbols that are not allowed
by the regexp. I would therefore argue that moving `lower()` is a decent
solution because it will make the generated slug validate as long as the
validation regexp is the same as now. I would however make the case for
moving the `lower()` operation to a different place since Unicode
documentation
[https://www.unicode.org/versions/Unicode12.0.0/UnicodeStandard-12.0.pdf]
states that normalization is not kept during casing operations.

Casing operations as defined in Section 3.13, Default Case Algorithms
are not guaranteed to
preserve Normalization Forms. That is, some strings in a particular
Normalization Form
(for example, NFC) will no longer be in that form after the casing
operation is performed.

Therefore I would argue that it would be better to place the lower
operation over the normalization as follows:
{{{#!python
value = str(value).lower()
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii',
'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value).strip()
return re.sub(r'[-\s]+', '-', value)
}}}

This way the string is lower cased then normalized to keep as much of the
special characters as possible and then the remaining symbols are removed
with the regexp substitution.

I guess this could in theory lead to unintended different meaning of words
but I don't know if it would be feasible to do this kind of string
manipulation with guaranteed preserved meaning.

I have started to prepare a patch with my proposed change as well as tests
so I have assigned this issue to me for now. My intention is to submit a
patch later this week that could be tested by a few others maybe to check
for further issues. Luis Nell, if you feel that you would rather write the
patch yourself and that I overstepped by assigning myself, just claim the
issue for yourself no hard feelings on my part :)

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:3>

Django

unread,

Dec 10, 2019, 9:31:48 AM12/10/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".
-------------------------------------+-------------------------------------
Reporter: Luis Nell | Owner:
| Christoffer Sjöbergsson
Type: Bug | Status: assigned
Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------

Comment (by Luis Nell):

Replying to [comment:3 Christoffer Sjöbergsson]:

Thanks for looking further into this. The part about the special casing is
very enlightening and confirms my view being a tad too naive haha

> I have started to prepare a patch with my proposed change as well as
tests so I have assigned this issue to me for now. My intention is to
submit a patch later this week that could be tested by a few others maybe
to check for further issues. Luis Nell, if you feel that you would rather
write the patch yourself and that I overstepped by assigning myself, just
claim the issue for yourself no hard feelings on my part :)

By no means are you overstepping. Please, if you can find the time, go
ahead. I was just offering my time because I know how stressful
maintaining can be and how nice it is to not have to fix "yet another
bug". In case you can't find the time, just let me know.

Again, thanks for looking into this!

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:4>

Django

unread,

Dec 21, 2019, 7:24:57 AM12/21/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".
-------------------------------------+-------------------------------------
Reporter: Luis Nell | Owner:
| Christoffer Sjöbergsson
Type: Bug | Status: assigned
Component: Utilities | Version: master
Severity: Normal | Resolution:
Keywords: | Triage Stage: Accepted

Has patch: 1 | Needs documentation: 0

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Christoffer Sjöbergsson):

* has_patch: 0 => 1

Comment:

This far there are two different but similar fixes. Fix 1 proposed by Luis
Nell to move the `lower()` operation to before the re.sub like so

{{{#!python
re.sub(r'[^\w\s-]', '', value.lower()).strip()
}}}

The second idea that I proposed were to move the `lower()` operation
before the how if statement and as such place it before the normalization
operation.

I have now done some further testing and my conclusion is that the first
alternative seems to be bests. Both solutions does fix the bug in question
but the second
solution seems to change the output of strings quite a bit when they
contain certain characters. I would assume that it would be better to fix
the bug without altering the behavior too much from the original behavior.
As far as I can tell the first solution does not alter the output for any
other characters but the intended one.

I have created a pull request. It can be found here
[https://github.com/django/django/pull/12237]
That one have a quite small set of tests that should cover most of the
issues I have found.
I have also created a separate branch in my fork of Django with an
extended set of tests that I have made to test a large number of
characters in a systematic way. They are however both slow and quite ugly
so I will not include them in the pull request. You can check them out
here
[https://github.com/Sjbrgsn/django/blob/ticket_30892_extended_tests/tests/model_fields/test_slugfield.py]

So right now I have not been able to find any unwanted behavior with the
solution in the patch. But there might very well be some so feel free to
test and report any issues.

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:5>

Django

unread,

Dec 31, 2019, 1:56:58 AM12/31/19

to django-...@googlegroups.com

#30892: slugify() doesn't return a valid slug for "İ".
-------------------------------------+-------------------------------------
Reporter: Luis Nell | Owner:
| Christoffer Sjöbergsson

Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0

-------------------------------------+-------------------------------------
Changes (by Mariusz Felisiak <felisiak.mariusz@…>):

* status: assigned => closed
* resolution: => fixed

Comment:

In [changeset:"b2bd08bb7a912a1504f5fb5018f5317e6b5423cd" b2bd08bb]:
{{{
#!CommitTicketReference repository=""
revision="b2bd08bb7a912a1504f5fb5018f5317e6b5423cd"
Fixed #30892 -- Fixed slugify() and admin's URLify.js for "İ".

Thanks Luis Nell for the implementation idea and very detailed report.

Co-Authored-By: Mariusz Felisiak <felisiak...@gmail.com>
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/30892#comment:6>

Reply all

Reply to author

Forward