SlugField utf-8 support

439 views
Skip to first unread message

Viktor

unread,
Apr 13, 2006, 4:59:49 PM4/13/06
to django-d...@googlegroups.com
The story about non ASCII characters in slug started in #1602: urlify.js
blocks out non-English chars.

We tested utf-8 urls against ie, opera and konqueror and they work
great. Firefox by default shows the urls urlencoded, but you can change:
network.standard-url.escape-utf8 to false
and it will show the url in original encoding.


So I changed the validator for the SlugField so it can accept utf-8 strings.

Here is the patch.

slug.patch

gabor

unread,
Apr 13, 2006, 5:31:25 PM4/13/06
to django-d...@googlegroups.com

could you please describe some use-cases for this feature?

because for me it seems a little...strange..
the whole point of the Slug is to only contain alphanumeric (as in
ascii-alphanumeric) characters + underscore + hyphen.

with the utf-8 change it can contain underscore+hyphen + all unicode
symbols that are considered as alphanumeric...

but if you allow so many symbols, why to use a SlugField at all? why
don't allow all the unicode characters?

i'm pretty sure i'm missing something there... :-(


gabor

Viktor

unread,
Apr 13, 2006, 5:58:34 PM4/13/06
to django-d...@googlegroups.com
The problem begins when you want to have nonenglish slug (slug on a
language that uses nonascii characters).
For example if I want to have a serban cyrillic slug: наслов_на_српском,
it is perfectly valid slug if you read it on serbian, just aphanumeric
characters + underscore + hyphen, nothing more... But the current isSlug
validator does not accept it.

gabor

unread,
Apr 13, 2006, 6:08:19 PM4/13/06
to django-d...@googlegroups.com
Viktor wrote:
> The problem begins when you want to have nonenglish slug (slug on a
> language that uses nonascii characters).
> For example if I want to have a serban cyrillic slug: наслов_на_српском,
> it is perfectly valid slug if you read it on serbian, just aphanumeric
> characters + underscore + hyphen, nothing more... But the current isSlug
> validator does not accept it.

i see...

seems that it boils down to what you call 'valid'...
i had the feeling that slugs were meant to be used as parts of URLs...
and i'm sure you know that those cyrillic slugs, if part of an url will
get submitted in URLencoded form. even if it is hidden by the browser.


anyway...maybe it would be a good idea to add a boolean setting that
limits the slug to ascii (the old behaviour) ... like

slug = SlugField(only_ascii=True),

or something like that, because i can imagine that some people want to
restrict to that.


p.s: to make sure we understand each other. i understand the need for
being able to create non-english links/urls. i'm just not sure if a Slug
is the best place for it.

p.s.2: btw. are non-english URLs common/widespread nowadays?

gabor

Viktor

unread,
Apr 13, 2006, 7:52:31 PM4/13/06
to django-d...@googlegroups.com
gabor wrote:
> and i'm sure you know that those cyrillic slugs, if part of an url will
> get submitted in URLencoded form. even if it is hidden by the browser.

Of course, every requested url must be URLencoded, but that is browsers
job, and they all do it quite well.


> slug = SlugField(only_ascii=True),

+1 from me...


> p.s.2: btw. are non-english URLs common/widespread nowadays?

Yes, they are (more and more every day)... (for example Wikipedia uses
them: http://sr.wikipedia.org/wiki/Главна_страна).

James Bennett

unread,
Apr 13, 2006, 10:34:06 PM4/13/06
to django-d...@googlegroups.com
On 4/13/06, Viktor <alef...@gmail.com> wrote:
> Of course, every requested url must be URLencoded, but that is browsers
> job, and they all do it quite well.

Except that that runs contrary to the purpose of the slug. See below.

> Yes, they are (more and more every day)... (for example Wikipedia uses
> them: http://sr.wikipedia.org/wiki/Главна_страна).

Even when I copy/paste that URL (which I had to do, because my
web-based email client didn't recognize that the Cyrillic portion of
it was part of the URL) into Firefox, the URL encoding turns
"Главна_страна" into
"%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0".

Which is a problem, because really the whole idea behind a URL slug is
to provide some useful information about the page which lives at that
URL. But a long string of URL-encoded characters doesn't convey any
more useful information than the sorts of
"index.php?ar=Q76&pqb=r33857326&bdz=efweofgh" gibberish that was and
still is so common on the Web. Given that, what is the advantage of
UTF-8 URL slugs?

--
"May the forces of evil become confused on the way to your house."
-- George Carlin

Julio Nobrega

unread,
Apr 13, 2006, 11:39:02 PM4/13/06
to django-d...@googlegroups.com
On 4/13/06, James Bennett <ubern...@gmail.com> wrote:
> Even when I copy/paste that URL (which I had to do, because my
> web-based email client didn't recognize that the Cyrillic portion of
> it was part of the URL) into Firefox, the URL encoding turns
> "çÌÁ×ÎÁ_ÓÔÒÁÎÁ" into
> "%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0".

Firefox also did this to me, but not Opera. IE turned into this:
http://sr.wikipedia.org/wiki/??????_?????? (yes, question marks) and
wikipedia loaded its index page. But I suppose a language pack
resolves this.

>Given that, what is the advantage of
> UTF-8 URL slugs?

Because that's what people around the world use :)

Maybe not everyone today, maybe even not next year, but it doesn't
hurt to have the support there.

The only place where you can register domain in Brazil, registro.br,
has changed its system to accept more characters, using something
called ACE and IDNA.

RFC here: ftp://ftp.registro.br/rfc/rfc3490.txt

registro.br announcement here: http://registro.br/anuncios/20050504.html

I have no idea if IDNA covers UTF, I didn't even read the RFC, just
showing that there's interest in URLs that use other characters.

--
Julio Nobrega - http://www.inerciasensorial.com.br

Ivan Sagalaev

unread,
Apr 14, 2006, 3:06:11 AM4/14/06
to django-d...@googlegroups.com
James Bennett wrote:

>Even when I copy/paste that URL (which I had to do, because my
>web-based email client didn't recognize that the Cyrillic portion of
>it was part of the URL) into Firefox, the URL encoding turns
>"Главна_страна" into
>"%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0".
>
>Which is a problem,
>

I agree with James.

This is exactly THE problem, I'd say. Non-ascii URLs are breaking up all
the time all over the web because there are too much applications that
don't understand them. Browser's location bar is an exception. And in
case of Firefox am I supposed to stick a big red banner on start page
saying "To see this page properly in your browser go to
about:config..."? :-)

Viktor

unread,
Apr 14, 2006, 4:35:48 AM4/14/06
to django-d...@googlegroups.com
James Bennett wrote:
> Even when I copy/paste that URL (which I had to do, because my
> web-based email client didn't recognize that the Cyrillic portion of
> it was part of the URL) into Firefox, the URL encoding turns
> "������_������" into
> "%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0".

That is only in firefox (they said that urlencoding=false would be
default in FF 1.5, but it still isn't :-/). Every other browser supports
it well.

But like Julio said:
> Maybe not everyone today, maybe even not next year, but it doesn't
> hurt to have the support there.

It will be common i future, and I suppose You are not making Django just
for today... ;)

Gabors suggestion with: only_ascii, set by default to True, can solve
the problem of today and live an open door for tommorow.

gabor

unread,
Apr 14, 2006, 4:35:16 AM4/14/06
to django-d...@googlegroups.com
Ivan Sagalaev wrote:
> And in
> case of Firefox am I supposed to stick a big red banner on start page
> saying "To see this page properly in your browser go to
> about:config..."? :-)
>
as i understand, that's not required. firefox will correctly go to the
required url. it just will display it url-encoded...

example

http://www.example.com/gábor.

every browser should-submit/submits the request to the webserver as:

http://www.example.com/g%C3%A1bor

but the difference is that the location-bar in safari will show
http://www.example.com/gábor

and in firefox will show

http://www.example.com/g%C3%A1bor

(unless you switch that setting)


gabor

Todd O'Bryan

unread,
Apr 14, 2006, 6:36:11 AM4/14/06
to django-d...@googlegroups.com
On Apr 13, 2006, at 10:34 PM, James Bennett wrote:

> Which is a problem, because really the whole idea behind a URL slug is
> to provide some useful information about the page which lives at that
> URL.

And if your users aren't speakers of languages that are normally
represented in ASCII text, limiting your URLs to ASCII makes that
impossible.

Todd
(whose hoping to spend a lot of time with Django this summer, after
school lets out and I can stop grading stuff -- 28 more school days
-- woohoo!)

Jeroen Ruigrok van der Werven

unread,
Apr 14, 2006, 11:43:43 AM4/14/06
to django-d...@googlegroups.com
On 4/13/06, Viktor <alef...@gmail.com> wrote:
> We tested utf-8 urls against ie, opera and konqueror and they work
> great. Firefox by default shows the urls urlencoded, but you can change:
> network.standard-url.escape-utf8 to false
> and it will show the url in original encoding.

Using this with http://sr.wikipedia.org/wiki/Главна_страна didn't work
for me with Firefox 1.5.0.1.
Opera 9p2 works perfectly for this though.

> So I changed the validator for the SlugField so it can accept utf-8 strings.

I am all for this.

At the moment latin script derived languages do not get proper slugs,
aside from English. In my native language we use tremas (") for some
words and constructs as well as various other accents, just like
German and French, to name two huge examples.

Let alone if we start talking about other scripts entirely.

--
Jeroen Ruigrok van der Werven

Jeroen Ruigrok van der Werven

unread,
Apr 14, 2006, 11:59:42 AM4/14/06
to django-d...@googlegroups.com
On 4/14/06, James Bennett <ubern...@gmail.com> wrote:
> Which is a problem, because really the whole idea behind a URL slug is
> to provide some useful information about the page which lives at that
> URL. But a long string of URL-encoded characters doesn't convey any
> more useful information than the sorts of
> "index.php?ar=Q76&pqb=r33857326&bdz=efweofgh" gibberish that was and
> still is so common on the Web. Given that, what is the advantage of
> UTF-8 URL slugs?

I would sooner blame your setup or software for not properly
supporting such links.

http://ja.wikipedia.org/wiki/京都 is a perfectly valid URL nowadays and
for anyone a bit versed in Japanese they would immediately recognise
Kyouto. And that's why UTF-8 URLs are useful, you can keep your slugs
in the language you want. For me, I use multiple languages on my
weblog and would love these to be reflected in UTF-8 for every slug I
use.

With all due respect, the world is much larger than English. Who are
we to dictate their slugs are to be encoded in ASCII only?

(Heck, even my textbased browser w3m and links can open such URLs. :))

James Bennett

unread,
Apr 14, 2006, 12:55:50 PM4/14/06
to django-d...@googlegroups.com
On 4/14/06, Jeroen Ruigrok van der Werven <ashe...@gmail.com> wrote:
> I would sooner blame your setup or software for not properly
> supporting such links.

The thing is, this *is* "proper support". The URL is still accessed
correctly, and the page is displayed correctly. But the URL string
itself is displayed encoded in some browsers, and this is arguably a
useful thing -- as I'd hope we're all aware, there have been serious
security issues with displaying unencoded non-ASCII URLs in the past.

> With all due respect, the world is much larger than English. Who are
> we to dictate their slugs are to be encoded in ASCII only?

There are really two different issues going on here, and you're
arguing about the other one. While I personally think there are
usability problems with UTF-8 URLs, if people want them then Django
should support them. I'm not trying to argue that everyone should be
forced to use ASCII.

However, the other issue (which was mentioned in the original post to
this thread) is that the JavaScript "URLify" function which
automatically generates slugs based on the 'prepopulate_from'
attribute doesn't handle UTF-8. And I think it ought to stay that way,
for one simple reason: it'd be impossible to make it truly support
UTF-8.

To see why, remember that URLify doesn't just lowercase all the words,
kill the punctuation and replace spaces with hyphens. It also
strategically drops out common English words that don't have any place
in the slug: "the", "an", "this", etc. Admittedly we're already in
trouble because we only do that with English; we don't drop "le",
"un", "cela", etc. in French, for example. Opening up to anything in
UTF-8 would only exacerbate the problem, because then we'd have an
even bigger can of worms. Should a slug allow Japanese "particles"?
Which ones? Should pronouns be dropped from Greek slugs, since they
can be deduced from the case, gender and number of the verbs? Who
would decide this?

We'd need a whole new i18n system just to make the URLify function
behave properly for the languages we support (and the existing
gettext-like jsi18n stuff wouldn't work, because the set of excluded
words would not be the same for each language -- rather than a single
set of words with translations for each language, we'd need a separate
set of words for every single language).

And what about scripts where there's no concept of "lowercase"? So far
as I know, JavaScript's regular-expression system doesn't support
matching a generic "Unicode uppercase character" (and I'm not sure
about Python; I know you can do it in Perl, though), so that's another
language-specific switch to implement.

So I don't see any advantages to making the URLify function support
UTF-8, and see only headaches if we actually try to do it. If you're
using a non-ASCII script, that means prepopulate_from probably isn't
going to help you any and you'll have to fill in slug fields yourself.
But I think, given the disadvantages of trying to do auto-populated
UTF-8 slugs, that it's the better solution.

Max Battcher

unread,
Apr 14, 2006, 2:24:08 PM4/14/06
to django-d...@googlegroups.com
Viktor wrote:
> James Bennett wrote:
>> Even when I copy/paste that URL (which I had to do, because my
>> web-based email client didn't recognize that the Cyrillic portion of
>> it was part of the URL) into Firefox, the URL encoding turns
>> "������_������" into
>> "%D0%93%D0%BB%D0%B0%D0%B2%D0%BD%D0%B0_%D1%81%D1%82%D1%80%D0%B0%D0%BD%D0%B0".
>
> That is only in firefox (they said that urlencoding=false would be
> default in FF 1.5, but it still isn't :-/). Every other browser supports
> it well.
>

IIRC, It _was_ the default in Firefox 1.5, but the default was quickly
switched back in a security patch for English-language builds as they
were given too much flack about the Phishing sites that made evil use of
it and the English speakers were too easily confused into thinking
things like www.pāýpàļ.com (random name, I don't know if it actually was
one of the ones involved) were www.paypal.com, when clicking links in
their email.


--
--Max Battcher--
http://www.worldmaker.net/
"I'm gonna win, trust in me / I have come to save this world / and in
the end I'll get the grrrl!" --Machinae Supremacy, Hero (Promo Track)

Reply all
Reply to author
Forward
0 new messages