#1602: urlify.js blocks out non-English chars

22 views
Skip to first unread message

Petar Marić

unread,
Apr 8, 2006, 2:03:42 PM4/8/06
to django-d...@googlegroups.com
Greetings to all djangonauts out there,

I suggest we change urlify.js in a way that will not block out
non-English chars. Instead it should translate them into their English
feel-a-like, as suggested in #1602.

However we've got 2 problems:
1. Should we translate chars to their sound-a-likes or look-a-likes?
Some examples:
Š --> sh or s?
Č --> ch or c?
ö --> oe or o?
ä --> ae or a?
ü --> ue or u?
Honestly, I'm more found of look-a-likes (for those of you who didn't
have any coofee yet, they're the ones to the right ;) ).

2. What about languages that have way too many symbols in them, ie
Chinese, Japanese, Thai? AFAIK, in them one symbol represents a whole
word.
Should we include them also or leave them out? Or does someone out
there have a better solution?

Cheers,
--
Petar Marić
*e-mail: petar...@gmail.com
*mobile: +381 (64) 6122467

*icq: 224720322
*skype: petar_maric
*web: http://www.petarmaric.com/

Max Battcher

unread,
Apr 8, 2006, 2:48:58 PM4/8/06
to django-d...@googlegroups.com
Petar Marić wrote:
> I suggest we change urlify.js in a way that will not block out
> non-English chars. Instead it should translate them into their English
> feel-a-like, as suggested in #1602.

Browsers now support Unicode URLs (albeit turned off by default in
English editions lately due to security issues), might it make more
sense just to allow non-English characters as is, if the said
nationality might reasonably be willing/able to type them into the
address bar?

I don't have any needs along these lines (I'm just a stupid American
dealing solely with English applications), so the debate is merely
academic to me at this point.

--
--Max Battcher--
http://www.worldmaker.net/
"I'm gonna win, trust in me / I have come to save this world / and in
the end I'll get the grrrl!" --Machinae Supremacy, Hero (Promo Track)

Petar Marić

unread,
Apr 8, 2006, 3:59:33 PM4/8/06
to django-d...@googlegroups.com
> Browsers now support Unicode URLs (albeit turned off by default in
> English editions lately due to security issues), might it make more
> sense just to allow non-English characters as is, if the said
> nationality might reasonably be willing/able to type them into the
> address bar?

That's true, but what about the filenames? I use slug as a filename
for ImageField like so:
[snip]
def save(self):
#Based on nesh's fileutils: http://djangoutils.python-hosting.com/
from contrib.utils.file import rename_by_field
self.image = rename_by_field(self.image, self.slug)
super(Product, self).save()
[/snip]

For this to work, I use a patched local copy of urlify.js:
[snip]
class Admin:
js = (
#Change the behaviour of urlify so it accepts Serbian letters
'../media/admin/js/urlify.js',
)
[/snip]

James Bennett

unread,
Apr 8, 2006, 11:10:58 PM4/8/06
to django-d...@googlegroups.com
On 4/8/06, Petar Marić <petar...@gmail.com> wrote:
> I suggest we change urlify.js in a way that will not block out
> non-English chars. Instead it should translate them into their English
> feel-a-like, as suggested in #1602.

If we decide to go the sound-alike route, a good resource to start
from might be the Textpattern CMS (which is in the process of
transitioning to a BSD license, so we could eventually base something
on their work once that happens), which includes a file used for
transliterating URL slugs:

http://dev.textpattern.com/browser/releases/4.0.3/source/textpattern/lib/i18n-ascii.txt

Though as you point out that doesn't solve the problem of languages
for which there simply is no "plain ASCII" transliteration possible.
I'm inclined to say that in those cases we shouldn't try to
auto-generate a slug and should instead recommend that developers stay
away from prepopulate_from; it's convenient to have, to be sure, but
we'll never be able to produce a manageable system that can handle all
the languages people might be authoring content in.

--
"May the forces of evil become confused on the way to your house."
-- George Carlin

Arthur

unread,
Apr 9, 2006, 3:34:55 AM4/9/06
to django-d...@googlegroups.com
> > I suggest we change urlify.js in a way that will not block out
> > non-English chars. Instead it should translate them into their English
> > feel-a-like, as suggested in #1602.
>
> If we decide to go the sound-alike route, a good resource to start
> from might be the Textpattern CMS [...]>
> http://dev.textpattern.com/browser/releases/4.0.3/source/textpattern/lib/i18n-ascii.txt

That list looks very nice to me, at least for those languages I know.
The problem with the look-alike route is, that at least for German,
people are used to the what you call sound-alike pattern. In the case
of German it's not so much a phonetic pattern as it is more a
transcription convention. ("schwül/schwuel" <-> sweltering, "schwul"
<-> homosexual(♂) don't have the same meaning.)

Unfortunately I don't know much about standards in this area. If there
are any proven ones, I'd vote to use them. Using Unicode in URLs would
be fine with me, I don't know how well Browsers/Proxies/Servers/Search
Engines handle them.

Arthur

Petar Marić

unread,
Apr 9, 2006, 5:16:42 AM4/9/06
to django-d...@googlegroups.com
> If we decide to go the sound-alike route, a good resource to start
> from might be the Textpattern CMS (which is in the process of
> transitioning to a BSD license, so we could eventually base something
> on their work once that happens), which includes a file used for
> transliterating URL slugs:
> http://dev.textpattern.com/browser/releases/4.0.3/source/textpattern/lib/i18n-ascii.txt
This is interesting. Although I've got to correct you: IMHCO this is a
look-alike transliteration.

> That list looks very nice to me, at least for those languages I know.
> The problem with the look-alike route is, that at least for German,
> people are used to the what you call sound-alike pattern.

AFAIK Google gets along better with the look-alike pattern, ie when I
search for macka I'll get mačka in results (that's cat in Serbian) -
OTOH when I do a search for machka, I won't get mačka. And you'ld be
surprised how many are going to search for ma*c*ka
In other words, this is a little SEO "cheat" - both the less and more
i18n aware users are going to get my site as a search result. Now
shush, don't let the big bad Google hear you ;)

For me this is more of a SEO and compatibility (file system/web
server/client) thing than a usability matter.

@Arthur: You know what? Why don't we just do whatever feels more
natural to our language? Then again, I can sense some transliteration
collisions on the way.

Arthur

unread,
Apr 9, 2006, 5:26:48 AM4/9/06
to django-d...@googlegroups.com
> @Arthur: You know what? Why don't we just do whatever feels more
> natural to our language? Then again, I can sense some transliteration
> collisions on the way.

Absolutely, this should be language specific. If you look at the
proposed list, you see that there are different transliteration
patterns for example for fi-fi than for others. And yes, google does
quite a bit of that stuff: If you search for "aengstlich" on
google.com it won't find "ängstlich" but if you search for
"aengstlich" under google.de/.ch/.at it'll automagically include all
"ängstlich" results.

Arthur

Jeroen Ruigrok van der Werven

unread,
Apr 9, 2006, 7:56:00 AM4/9/06
to django-d...@googlegroups.com
On 4/8/06, Petar Marić <petar...@gmail.com> wrote:
> 2. What about languages that have way too many symbols in them, ie
> Chinese, Japanese, Thai? AFAIK, in them one symbol represents a
> whole word.
> Should we include them also or leave them out? Or does someone out
> there have a better solution?

Please, don't call them symbols. They're as much symbols as the
latin-derived script is. Also, you can have more than one character
describe a word or meaning. Also, depending on which word or meaning
it is describing the transliteration changes.

二十日 can be both hatsuka as well a nijunichi and they both differ in
meaning a bit.

The problem with Thai, Korean, Japanese, Chinese, Hindi, Kannada, et
cetera is the fact that there are many transliteration schemes to
choose from. And what might seem straightforward in transliteration
from Japanese to English makes little sense when read in, say, Dutch
due different pronunciation.

--
Jeroen Ruigrok van der Werven

Reply all
Reply to author
Forward
0 new messages