Using allow_unicode=True should keep the Turkish **İ** instead of
replacing it with a normal i.
{{{
import unicodedata
import re
def slugify(value, allow_unicode=False):
"""
Convert to ASCII if 'allow_unicode' is False. Convert spaces or
repeated
dashes to single dashes. Remove characters that aren't alphanumerics,
underscores, or hyphens. Convert to lowercase. Also strip leading and
trailing whitespace, dashes, and underscores.
"""
value = str(value)
if allow_unicode:
value = unicodedata.normalize('NFKC', value)
else:
value = unicodedata.normalize('NFKD', value).encode('ascii',
'ignore').decode('ascii')
value = re.sub(r'[^\w\s-]', '', value.lower())
return re.sub(r'[-\s]+', '-', value).strip('-_')
test_str = "i̇zmit"
output = slugify(test_str, allow_unicode = True)
print(test_str)
print(output)
print(test_str == output)
}}}
--
Ticket URL: <https://code.djangoproject.com/ticket/33218>
Django <https://code.djangoproject.com/>
The Web framework for perfectionists with deadlines.
* status: new => closed
* resolution: => invalid
* component: CSRF => Utilities
Comment:
It's not about 'İ' but about '̇' which is the second character. IMO,
`slugify()` properly removes '̇', see:
{{{
>>> test_str = "i̇zmit"
>>> output = slugify(test_str, allow_unicode = True)
>>> for x, y in enumerate(test_str):
... print(y, output[x], y == output[x])
i i Truė
z False
z m False
m i False
i t False
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
IndexError: string index out of range
}}}
See also related ticket #30892 about "İ".
--
Ticket URL: <https://code.djangoproject.com/ticket/33218#comment:1>
Comment (by sowinski):
Thank you for the fast response.
I do not agree, because of this behavior it would be impossible to create
an article for the capital of Turkey while allow_unicode=True.
https://tr.wikipedia.org/wiki/%C4%B0stanbul
Maybe someone else have a international website and will hit this problem.
I solved the problem by adding the I to the regular expression.
{{{
value = re.sub(r'[^\w\si̇-]', '', value.lower())
}}}
I testes the implementation with all cities in the world with all the
different language variants of the city name and it worked for me.
http://www.geonames.org/
It is interesting to see that this the only edge case. Not sure if this
will work in all situations. So I run only my modification if the strange
i is in the string. Otherwise is jump to the django implementation.
See: https://github.com/wagtail/wagtail/issues/7637#issuecomment-949366560
--
Ticket URL: <https://code.djangoproject.com/ticket/33218#comment:2>