I suggest we change urlify.js in a way that will not block out non-English chars. Instead it should translate them into their English feel-a-like, as suggested in #1602.
However we've got 2 problems: 1. Should we translate chars to their sound-a-likes or look-a-likes? Some examples: Š --> sh or s? Č --> ch or c? ö --> oe or o? ä --> ae or a? ü --> ue or u? Honestly, I'm more found of look-a-likes (for those of you who didn't have any coofee yet, they're the ones to the right ;) ).
2. What about languages that have way too many symbols in them, ie Chinese, Japanese, Thai? AFAIK, in them one symbol represents a whole word. Should we include them also or leave them out? Or does someone out there have a better solution?
Cheers, -- Petar Marić *e-mail: petar.ma...@gmail.com *mobile: +381 (64) 6122467
Petar Marić wrote: > I suggest we change urlify.js in a way that will not block out > non-English chars. Instead it should translate them into their English > feel-a-like, as suggested in #1602.
Browsers now support Unicode URLs (albeit turned off by default in English editions lately due to security issues), might it make more sense just to allow non-English characters as is, if the said nationality might reasonably be willing/able to type them into the address bar?
I don't have any needs along these lines (I'm just a stupid American dealing solely with English applications), so the debate is merely academic to me at this point.
-- --Max Battcher-- http://www.worldmaker.net/ "I'm gonna win, trust in me / I have come to save this world / and in the end I'll get the grrrl!" --Machinae Supremacy, Hero (Promo Track)
> Browsers now support Unicode URLs (albeit turned off by default in > English editions lately due to security issues), might it make more > sense just to allow non-English characters as is, if the said > nationality might reasonably be willing/able to type them into the > address bar?
That's true, but what about the filenames? I use slug as a filename for ImageField like so: [snip] def save(self): #Based on nesh's fileutils: http://djangoutils.python-hosting.com/ from contrib.utils.file import rename_by_field self.image = rename_by_field(self.image, self.slug) super(Product, self).save() [/snip]
For this to work, I use a patched local copy of urlify.js: [snip] class Admin: js = ( #Change the behaviour of urlify so it accepts Serbian letters '../media/admin/js/urlify.js', ) [/snip]
-- Petar Marić *e-mail: petar.ma...@gmail.com *mobile: +381 (64) 6122467
On 4/8/06, Petar Mariæ <petar.ma...@gmail.com> wrote:
> I suggest we change urlify.js in a way that will not block out > non-English chars. Instead it should translate them into their English > feel-a-like, as suggested in #1602.
If we decide to go the sound-alike route, a good resource to start from might be the Textpattern CMS (which is in the process of transitioning to a BSD license, so we could eventually base something on their work once that happens), which includes a file used for transliterating URL slugs:
Though as you point out that doesn't solve the problem of languages for which there simply is no "plain ASCII" transliteration possible. I'm inclined to say that in those cases we shouldn't try to auto-generate a slug and should instead recommend that developers stay away from prepopulate_from; it's convenient to have, to be sure, but we'll never be able to produce a manageable system that can handle all the languages people might be authoring content in.
-- "May the forces of evil become confused on the way to your house." -- George Carlin
> > I suggest we change urlify.js in a way that will not block out > > non-English chars. Instead it should translate them into their English > > feel-a-like, as suggested in #1602.
That list looks very nice to me, at least for those languages I know. The problem with the look-alike route is, that at least for German, people are used to the what you call sound-alike pattern. In the case of German it's not so much a phonetic pattern as it is more a transcription convention. ("schwül/schwuel" <-> sweltering, "schwul" <-> homosexual(♂) don't have the same meaning.)
Unfortunately I don't know much about standards in this area. If there are any proven ones, I'd vote to use them. Using Unicode in URLs would be fine with me, I don't know how well Browsers/Proxies/Servers/Search Engines handle them.
> If we decide to go the sound-alike route, a good resource to start > from might be the Textpattern CMS (which is in the process of > transitioning to a BSD license, so we could eventually base something > on their work once that happens), which includes a file used for > transliterating URL slugs: > http://dev.textpattern.com/browser/releases/4.0.3/source/textpattern/...
This is interesting. Although I've got to correct you: IMHCO this is a look-alike transliteration.
> That list looks very nice to me, at least for those languages I know. > The problem with the look-alike route is, that at least for German, > people are used to the what you call sound-alike pattern.
AFAIK Google gets along better with the look-alike pattern, ie when I search for macka I'll get mačka in results (that's cat in Serbian) - OTOH when I do a search for machka, I won't get mačka. And you'ld be surprised how many are going to search for ma*c*ka In other words, this is a little SEO "cheat" - both the less and more i18n aware users are going to get my site as a search result. Now shush, don't let the big bad Google hear you ;)
For me this is more of a SEO and compatibility (file system/web server/client) thing than a usability matter.
@Arthur: You know what? Why don't we just do whatever feels more natural to our language? Then again, I can sense some transliteration collisions on the way.
-- Petar Marić *e-mail: petar.ma...@gmail.com *mobile: +381 (64) 6122467
> @Arthur: You know what? Why don't we just do whatever feels more > natural to our language? Then again, I can sense some transliteration > collisions on the way.
Absolutely, this should be language specific. If you look at the proposed list, you see that there are different transliteration patterns for example for fi-fi than for others. And yes, google does quite a bit of that stuff: If you search for "aengstlich" on google.com it won't find "ängstlich" but if you search for "aengstlich" under google.de/.ch/.at it'll automagically include all "ängstlich" results.
On 4/8/06, Petar Marić <petar.ma...@gmail.com> wrote:
> 2. What about languages that have way too many symbols in them, ie > Chinese, Japanese, Thai? AFAIK, in them one symbol represents a > whole word. > Should we include them also or leave them out? Or does someone out > there have a better solution?
Please, don't call them symbols. They're as much symbols as the latin-derived script is. Also, you can have more than one character describe a word or meaning. Also, depending on which word or meaning it is describing the transliteration changes.
二十日 can be both hatsuka as well a nijunichi and they both differ in meaning a bit.
The problem with Thai, Korean, Japanese, Chinese, Hindi, Kannada, et cetera is the fact that there are many transliteration schemes to choose from. And what might seem straightforward in transliteration from Japanese to English makes little sense when read in, say, Dutch due different pronunciation.