[Django] #33218: slugify() can't handle Turkish İ while allow_unicode = True

18 views
Skip to first unread message

Django

unread,
Oct 21, 2021, 12:54:34 PM10/21/21
to django-...@googlegroups.com
#33218: slugify() can't handle Turkish İ while allow_unicode = True
--------------------------------------+-------------------------
Reporter: sowinski | Owner: nobody
Type: Bug | Status: new
Component: CSRF | Version: dev
Severity: Normal | Keywords: slugify
Triage Stage: Unreviewed | Has patch: 0
Needs documentation: 0 | Needs tests: 0
Patch needs improvement: 0 | Easy pickings: 0
UI/UX: 0 |
--------------------------------------+-------------------------
Please see the following example.
The first character **test_str = "i̇zmit"** is not a normal i. It is the
**İ** from the Turkish alphabet.

Using allow_unicode=True should keep the Turkish **İ** instead of
replacing it with a normal i.

{{{
import unicodedata
import re

def slugify(value, allow_unicode=False):
"""
Convert to ASCII if 'allow_unicode' is False. Convert spaces or
repeated
dashes to single dashes. Remove characters that aren't alphanumerics,
underscores, or hyphens. Convert to lowercase. Also strip leading and
trailing whitespace, dashes, and underscores.
"""
value = str(value)
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii',
'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value.lower())
return re.sub(r'[-\s]+', '-', value).strip('-_')


test_str = "i̇zmit"

output = slugify(test_str, allow_unicode = True)

print(test_str)
print(output)
print(test_str == output)
}}}

--
Ticket URL: <https://code.djangoproject.com/ticket/33218>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.

Django

unread,
Oct 21, 2021, 1:33:12 PM10/21/21
to django-...@googlegroups.com
#33218: slugify() can't handle Turkish İ while allow_unicode = True
---------------------------+--------------------------------------
Reporter: sowinski | Owner: nobody
Type: Bug | Status: closed
Component: Utilities | Version: dev
Severity: Normal | Resolution: invalid

Keywords: slugify | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------+--------------------------------------
Changes (by Mariusz Felisiak):

* status: new => closed
* resolution: => invalid
* component: CSRF => Utilities


Comment:

It's not about 'İ' but about '̇' which is the second character. IMO,
`slugify()` properly removes '̇', see:
{{{


>>> test_str = "i̇zmit"
>>> output = slugify(test_str, allow_unicode = True)

>>> for x, y in enumerate(test_str):
... print(y, output[x], y == output[x])
i i Truė
z False
z m False
m i False
i t False
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: string index out of range
}}}
See also related ticket #30892 about "İ".

--
Ticket URL: <https://code.djangoproject.com/ticket/33218#comment:1>

Django

unread,
Oct 22, 2021, 3:42:09 AM10/22/21
to django-...@googlegroups.com
#33218: slugify() can't handle Turkish İ while allow_unicode = True
---------------------------+--------------------------------------
Reporter: sowinski | Owner: nobody

Type: Bug | Status: closed
Component: Utilities | Version: dev
Severity: Normal | Resolution: invalid
Keywords: slugify | Triage Stage: Unreviewed
Has patch: 0 | Needs documentation: 0
Needs tests: 0 | Patch needs improvement: 0
Easy pickings: 0 | UI/UX: 0
---------------------------+--------------------------------------

Comment (by sowinski):

Thank you for the fast response.

I do not agree, because of this behavior it would be impossible to create
an article for the capital of Turkey while allow_unicode=True.
https://tr.wikipedia.org/wiki/%C4%B0stanbul

Maybe someone else have a international website and will hit this problem.

I solved the problem by adding the I to the regular expression.

{{{
value = re.sub(r'[^\w\si̇-]', '', value.lower())
}}}

I testes the implementation with all cities in the world with all the
different language variants of the city name and it worked for me.
http://www.geonames.org/

It is interesting to see that this the only edge case. Not sure if this
will work in all situations. So I run only my modification if the strange
i is in the string. Otherwise is jump to the django implementation.

See: https://github.com/wagtail/wagtail/issues/7637#issuecomment-949366560

--
Ticket URL: <https://code.djangoproject.com/ticket/33218#comment:2>

Reply all
Reply to author
Forward
0 new messages