[patch] Generating slug for words with accents

93 views
Skip to first unread message

Michal

unread,
Aug 26, 2006, 4:05:50 AM8/26/06
to django...@googlegroups.com
Hello,
I have problem with submiting ticket in trac (details below) with my
patch, so I decided to post it here.

-----------------------------------------

Short summary: [patch] Generating slug for words with accents

Full description: In my language (czech) there are a lot of characters
with accents. When I type titles in admin forms, the slug field
autogenerated values are incorect (for example:title="sršeň",
autogenerated slug="sre"; correct is "srsen"). So I wrote little patch
to urlify.js code, which first convert all accents chars to their ASCII
equivalent. For now, my code respect only czech accents. I will be glad,
If some others of you add your own national characters.

Priority: normal
Component: Admin interface
Severity: normal
Version: SVN
Keywords: slug urlify

-----------------------------------------

Trac error

Trac detected an internal error:
Traceback (most recent call last):
File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 299,
in dispatch_request
dispatcher.dispatch(req)
File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 189,
in dispatch
resp = chosen_handler.process_request(req)
File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line
104, in process_request
self._do_create(req, db)
File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line
163, in _do_create
self._validate_ticket(req, ticket)
File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line
47, in _validate_ticket
for field, message in manipulator.validate_ticket(req, ticket):
File "build/bdist.linux-i686/egg/tracspamfilter/adapters.py", line
40, in validate_ticket
File "build/bdist.linux-i686/egg/tracspamfilter/api.py", line 74, in test
herror: (1, 'Unknown host')

(I was trying to submit ticket from FreeBSD 5.4 system & Firefox 1.5.0.1)


Regards
Michal

urlify_patch.diff

Maciej Bliziński

unread,
Aug 26, 2006, 7:44:45 AM8/26/06
to django...@googlegroups.com
On Sat, 2006-08-26 at 10:05 +0200, Michal wrote:
> Full description: In my language (czech) there are a lot of characters
> with accents. When I type titles in admin forms, the slug field
> autogenerated values are incorect (for example:title="sršeň",

Is it a hornet?

> autogenerated slug="sre"; correct is "srsen").

That's right. I've been experiencing the same thing.

> I will be glad, If some others of you add your own national characters.

I'm attaching a modified patch with Polish characters added.

--
Maciej Bliziński
http://automatthias.wordpress.com

urlify_patch_cz_and_pl.diff

Michal

unread,
Aug 26, 2006, 10:48:15 AM8/26/06
to django...@googlegroups.com
Maciej Bliziński wrote:
> On Sat, 2006-08-26 at 10:05 +0200, Michal wrote:
>> Full description: In my language (czech) there are a lot of characters
>> with accents. When I type titles in admin forms, the slug field
>> autogenerated values are incorect (for example:title="sršeň",
>
> Is it a hornet?

Yes it is, my Slavic brother :)

>
>> autogenerated slug="sre"; correct is "srsen").
>
> That's right. I've been experiencing the same thing.
>
>> I will be glad, If some others of you add your own national characters.
>
> I'm attaching a modified patch with Polish characters added.
>

Thank you. I also added a few of Slovak characters (Czech and Slovak was
brothers too, and they have similar alphabet).

>
>

urlify_patch_cz_pl_sk.diff

Maciej Bliziński

unread,
Aug 27, 2006, 5:20:03 AM8/27/06
to django...@googlegroups.com
On Sat, 2006-08-26 at 16:48 +0200, Michal wrote:
> I also added a few of Slovak characters (Czech and Slovak was
> brothers too, and they have similar alphabet).

I looked at the Latin Unicode article in Wikipedia:
http://en.wikipedia.org/wiki/Latin_Unicode

There are characters with accents have I never seen before... Vietnamese
alphabet, for instance, has glyphs which are Latin characters with
unusual accents, for example: ã, or even with two accents: ặ

For most of the characters, it's pretty easy to remove the accents.
However, some characters are mysterious: should Ƨ be translated to S?
I don't know. So I just deleted them from the accent removal list.

I'm including a patch with "from" and "to" constants extended with all
the characters I found on Wikipedia that seemed to be of any use. This
should cover all the Slavic countries except those which use cyrylic
alphabet.

One thing... some characters want to be translated into _two_ ASCII
characters, for example Æ to AE. This would require a different data
structure. In present form, I just entered E. The same with ß which
I replaced with single S.

Regards,
Maciej

urlify-i18n-patch.diff

Michal

unread,
Aug 27, 2006, 5:37:31 AM8/27/06
to django...@googlegroups.com
Maciej Bliziński wrote:
> On Sat, 2006-08-26 at 16:48 +0200, Michal wrote:
>> I also added a few of Slovak characters (Czech and Slovak was
>> brothers too, and they have similar alphabet).
>
> I looked at the Latin Unicode article in Wikipedia:
> http://en.wikipedia.org/wiki/Latin_Unicode
>
> There are characters with accents have I never seen before... Vietnamese
> alphabet, for instance, has glyphs which are Latin characters with
> unusual accents, for example: ã, or even with two accents: ặ
>
> For most of the characters, it's pretty easy to remove the accents.
> However, some characters are mysterious: should Ƨ be translated to S?
> I don't know. So I just deleted them from the accent removal list.
>

Nice work Maciej :)

When I wrote my first post, I typed: "I will be glad, If some others of

you add your own national characters."

Each nationality have its own specific characters and rules for them, so
I think that somebody from this countries should check your version of
patch.


> I'm including a patch with "from" and "to" constants extended with all
> the characters I found on Wikipedia that seemed to be of any use. This
> should cover all the Slavic countries except those which use cyrylic
> alphabet.
>
> One thing... some characters want to be translated into _two_ ASCII
> characters, for example Æ to AE. This would require a different data
> structure. In present form, I just entered E. The same with ß which
> I replaced with single S.

Maybe we could try wrote one new function, which will translate one
unicode to adequate 2 ascii chars? (translate accent chars will be then
done in two steps: 1-replAccents, 2-new function)

>
> Regards,
> Maciej
>
>
>
> ------------------------------------------------------------------------
>
> Index: django/contrib/admin/media/js/urlify.js
> ===================================================================
> --- django/contrib/admin/media/js/urlify.js (revision 3618)
> +++ django/contrib/admin/media/js/urlify.js (working copy)
> @@ -1,4 +1,43 @@
> +function replAccents(s)
> +{
> + // Replacement lists based on article in Wikipedia,
> + // http://en.wikipedia.org/wiki/Latin_Unicode
> + // from and to strings must have same number of characters
> + var from = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîï';
> + var to = 'AAAAAAECEEEEIIIIDNOOOOOOUUUUYSaaaaaaaceeeeiiii';
> + from += 'ñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģ';
> + to += 'noooooouuuuyyaaaaaaccccccccddddeeeeeeeeeegggggggg';
> + from += 'ĤĥĦħĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘř';
> + to += 'hhhhiiiiiiiiiijjkkkllllllllllnnnnnnnnnoooooooorrrrrr';
> + from += 'ŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƂƃƄƅƇƈƉƊƐƑƒƓƔ';
> + to += 'ssssssssttttttuuuuuuuuuuuuwwyyyzzzzzzfbbbbbccddeffgv';
> + from += 'ƖƗƘƙƚƝƞƟƠƤƦƫƬƭƮƯưƱƲƳƴƵƶǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩ';
> + to += 'likklnnoopettttuuuuyyzzaaiioouuuuuuuuuueaaaaeeggggkk';
> + from += 'ǪǫǬǭǰǴǵǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȞȟȤȥȦȧȨȩ';
> + to += 'oooojggpnnaaeeooaaaaeeeeiiiioooorrrruuuusstthhzzaaee';
> + from += 'ȪȫȬȭȮȯȰȱȲȳḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫ';
> + to += 'ooooooooyyaabbbbbbccddddddddddeeeeeeeeeeffgghhhhhhhhhh';
> + from += 'ḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟ';
> + to += 'iiiikkkkkkllllllllmmmmmmnnnnnnnnoooooooopppprrrrrrrr';
> + from += 'ṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕ';
> + to += 'ssssssssssttttttttuuuuuuuuuuvvvvwwwwwwwwwwxxxxxyzzzzzz';
> + from += 'ẖẗẘẙẚẛẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊị';
> + to += 'htwyafaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeeeeeeeeeeeiiii';
> + from += 'ỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ';
> + to += 'oooooooooooooooooooooooouuuuuuuuuuuuuuyyyyyyyy';
> +
> + for (var i = 0; i != s.length; i++) {
> + var x = from.indexOf(s[i]);
> + if (x != -1) {
> + r = new RegExp(from[x], 'g');
> + s = s.replace(r, to[x]);
> + }
> + }
> + return s;
> +}
> +
> function URLify(s, num_chars) {
> + s = replAccents(s);
> // changes, e.g., "Petty theft" to "petty_theft"
> // remove all these words from the string before urlifying
> removelist = ["a", "an", "as", "at", "before", "but", "by", "for", "from",
>

Nicolas Steinmetz

unread,
Nov 16, 2006, 3:49:05 AM11/16/06
to django...@googlegroups.com
Maciej Bliziński a écrit :

> I'm including a patch with "from" and "to" constants extended with all
> the characters I found on Wikipedia that seemed to be of any use. This
> should cover all the Slavic countries except those which use cyrylic
> alphabet.

Was this page commit to svn version of django, as in 0.95 I was facing
this issue with french accents.

Nicolas

Aidas Bendoraitis

unread,
Nov 16, 2006, 4:53:04 AM11/16/06
to django...@googlegroups.com
German ß should be translated to ss
ä to ae
ö to oe
ü to ue

Regards,
Aidas Bendoraitis [aka Archatas]

Karsten W. Rohrbach

unread,
Nov 16, 2006, 4:54:04 AM11/16/06
to Django users
Would this make sense to integrate on the server side (instead of JS),
say next to django.utils.text.get_valid_filename()?

John Lenton

unread,
Nov 16, 2006, 7:42:35 AM11/16/06
to django...@googlegroups.com
On 11/16/06, Aidas Bendoraitis <aidas.be...@gmail.com> wrote:
> German ß should be translated to ss
> ä to ae
> ö to oe
> ü to ue

but «ü» in Spanish should be just «u» (as in pingüino -> pinguino).

--
John Lenton (jle...@gmail.com) -- Random fortune:
The trouble with a lot of self-made men is that they worship their creator.

zenx

unread,
Nov 16, 2006, 6:38:55 PM11/16/06
to Django users
Spanish info:
á é í ó ú should be a e i o u
ü should be u
ñ should be n

I think that's everything in spanish ;)

Kamil Wdowicz

unread,
Nov 17, 2006, 1:25:50 AM11/17/06
to django...@googlegroups.com
Polish:
ą = a
ć = c
ź or ż = z
ę = e
ó = o
ł = l
ś = s
ń = n

2006/11/17, zenx <antoni...@gmail.com>:

Aidas Bendoraitis

unread,
Nov 17, 2006, 3:19:44 AM11/17/06
to django...@googlegroups.com
Similarly Lithuanian would be:
ą = a
č = c
ę = e
ė = e
į = i
š = s
ų = u
ū = u
ž = z

I am just thinking whether slugify function should correspond to the
chosen language or not. It seems that there are not many differences
among stripped accented letters in different languages, so maybe it
should be left the same. Whatever we decide, ß should still be
translated to ss, but not S. What is the opinion of the others?

And also, if we are already adding localizations to the slugify
function, should't greek, russian, and other non-latin alphabets also
be translated to latin charset?

Regards,
Aidas Bendoraitis [aka Archatas]

orestis

unread,
Nov 17, 2006, 4:53:54 AM11/17/06
to Django users
Can you discuss this on the relevant ticket:

http://code.djangoproject.com/ticket/2282

Thanks,
Orestis

Reply all
Reply to author
Forward
0 new messages