Hello, I have problem with submiting ticket in trac (details below) with my patch, so I decided to post it here.
-----------------------------------------
Short summary: [patch] Generating slug for words with accents
Full description: In my language (czech) there are a lot of characters with accents. When I type titles in admin forms, the slug field autogenerated values are incorect (for example:title="sršeň", autogenerated slug="sre"; correct is "srsen"). So I wrote little patch to urlify.js code, which first convert all accents chars to their ASCII equivalent. For now, my code respect only czech accents. I will be glad, If some others of you add your own national characters.
Priority: normal Component: Admin interface Severity: normal Version: SVN Keywords: slug urlify
-----------------------------------------
Trac error
Trac detected an internal error: Traceback (most recent call last): File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 299, in dispatch_request dispatcher.dispatch(req) File "/usr/lib/python2.3/site-packages/trac/web/main.py", line 189, in dispatch resp = chosen_handler.process_request(req) File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line 104, in process_request self._do_create(req, db) File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line 163, in _do_create self._validate_ticket(req, ticket) File "/usr/lib/python2.3/site-packages/trac/ticket/web_ui.py", line 47, in _validate_ticket for field, message in manipulator.validate_ticket(req, ticket): File "build/bdist.linux-i686/egg/tracspamfilter/adapters.py", line 40, in validate_ticket File "build/bdist.linux-i686/egg/tracspamfilter/api.py", line 74, in test herror: (1, 'Unknown host')
(I was trying to submit ticket from FreeBSD 5.4 system & Firefox 1.5.0.1)
On Sat, 2006-08-26 at 10:05 +0200, Michal wrote: > Full description: In my language (czech) there are a lot of characters > with accents. When I type titles in admin forms, the slug field > autogenerated values are incorect (for example:title="sršeň",
Is it a hornet?
> autogenerated slug="sre"; correct is "srsen").
That's right. I've been experiencing the same thing.
> I will be glad, If some others of you add your own national characters.
I'm attaching a modified patch with Polish characters added.
Maciej Bliziński wrote: > On Sat, 2006-08-26 at 10:05 +0200, Michal wrote: >> Full description: In my language (czech) there are a lot of characters >> with accents. When I type titles in admin forms, the slug field >> autogenerated values are incorect (for example:title="sršeň",
> Is it a hornet?
Yes it is, my Slavic brother :)
>> autogenerated slug="sre"; correct is "srsen").
> That's right. I've been experiencing the same thing.
>> I will be glad, If some others of you add your own national characters.
> I'm attaching a modified patch with Polish characters added.
Thank you. I also added a few of Slovak characters (Czech and Slovak was brothers too, and they have similar alphabet).
On Sat, 2006-08-26 at 16:48 +0200, Michal wrote: > I also added a few of Slovak characters (Czech and Slovak was > brothers too, and they have similar alphabet).
There are characters with accents have I never seen before... Vietnamese alphabet, for instance, has glyphs which are Latin characters with unusual accents, for example: ã, or even with two accents: ặ
For most of the characters, it's pretty easy to remove the accents. However, some characters are mysterious: should Ƨ be translated to S? I don't know. So I just deleted them from the accent removal list.
I'm including a patch with "from" and "to" constants extended with all the characters I found on Wikipedia that seemed to be of any use. This should cover all the Slavic countries except those which use cyrylic alphabet.
One thing... some characters want to be translated into _two_ ASCII characters, for example Æ to AE. This would require a different data structure. In present form, I just entered E. The same with ß which I replaced with single S.
Maciej Bliziński wrote: > On Sat, 2006-08-26 at 16:48 +0200, Michal wrote: >> I also added a few of Slovak characters (Czech and Slovak was >> brothers too, and they have similar alphabet).
> There are characters with accents have I never seen before... Vietnamese > alphabet, for instance, has glyphs which are Latin characters with > unusual accents, for example: ã, or even with two accents: ặ
> For most of the characters, it's pretty easy to remove the accents. > However, some characters are mysterious: should Ƨ be translated to S? > I don't know. So I just deleted them from the accent removal list.
Nice work Maciej :)
When I wrote my first post, I typed: "I will be glad, If some others of you add your own national characters." Each nationality have its own specific characters and rules for them, so I think that somebody from this countries should check your version of patch.
> I'm including a patch with "from" and "to" constants extended with all > the characters I found on Wikipedia that seemed to be of any use. This > should cover all the Slavic countries except those which use cyrylic > alphabet.
> One thing... some characters want to be translated into _two_ ASCII > characters, for example Æ to AE. This would require a different data > structure. In present form, I just entered E. The same with ß which > I replaced with single S.
Maybe we could try wrote one new function, which will translate one unicode to adequate 2 ascii chars? (translate accent chars will be then done in two steps: 1-replAccents, 2-new function)
> Index: django/contrib/admin/media/js/urlify.js > =================================================================== > --- django/contrib/admin/media/js/urlify.js (revision 3618) > +++ django/contrib/admin/media/js/urlify.js (working copy) > @@ -1,4 +1,43 @@ > +function replAccents(s) > +{ > + // Replacement lists based on article in Wikipedia, > + // http://en.wikipedia.org/wiki/Latin_Unicode > + // from and to strings must have same number of characters > + var from = 'ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖØÙÚÛÜÝßàáâãäåæçèéêëìíîï'; > + var to = 'AAAAAAECEEEEIIIIDNOOOOOOUUUUYSaaaaaaaceeeeiiii'; > + from += 'ñòóôõöøùúûüýÿĀāĂ㥹ĆćĈĉĊċČčĎďĐđĒēĔĕĖėĘęĚěĜĝĞğĠġĢģ'; > + to += 'noooooouuuuyyaaaaaaccccccccddddeeeeeeeeeegggggggg'; > + from += 'ĤĥĦħĨĩĪīĬĭĮįİıĴĵĶķĸĹĺĻļĽľĿŀŁłŃńŅņŇňʼnŊŋŌōŎŏŐőŒœŔŕŖŗŘř'; > + to += 'hhhhiiiiiiiiiijjkkkllllllllllnnnnnnnnnoooooooorrrrrr'; > + from += 'ŚśŜŝŞşŠšŢţŤťŦŧŨũŪūŬŭŮůŰűŲųŴŵŶŷŸŹźŻżŽžſƀƂƃƄƅƇƈƉƊƐƑƒƓƔ'; > + to += 'ssssssssttttttuuuuuuuuuuuuwwyyyzzzzzzfbbbbbccddeffgv'; > + from += 'ƖƗƘƙƚƝƞƟƠƤƦƫƬƭƮƯưƱƲƳƴƵƶǍǎǏǐǑǒǓǔǕǖǗǘǙǚǛǜǝǞǟǠǡǢǣǤǥǦǧǨǩ'; > + to += 'likklnnoopettttuuuuyyzzaaiioouuuuuuuuuueaaaaeeggggkk'; > + from += 'ǪǫǬǭǰǴǵǷǸǹǺǻǼǽǾǿȀȁȂȃȄȅȆȇȈȉȊȋȌȍȎȏȐȑȒȓȔȕȖȗȘșȚțȞȟȤȥȦȧȨȩ'; > + to += 'oooojggpnnaaeeooaaaaeeeeiiiioooorrrruuuusstthhzzaaee'; > + from += 'ȪȫȬȭȮȯȰȱȲȳḀḁḂḃḄḅḆḇḈḉḊḋḌḍḎḏḐḑḒḓḔḕḖḗḘḙḚḛḜḝḞḟḠḡḢḣḤḥḦḧḨḩḪḫ'; > + to += 'ooooooooyyaabbbbbbccddddddddddeeeeeeeeeeffgghhhhhhhhhh'; > + from += 'ḬḭḮḯḰḱḲḳḴḵḶḷḸḹḺḻḼḽḾḿṀṁṂṃṄṅṆṇṈṉṊṋṌṍṎṏṐṑṒṓṔṕṖṗṘṙṚṛṜṝṞṟ'; > + to += 'iiiikkkkkkllllllllmmmmmmnnnnnnnnoooooooopppprrrrrrrr'; > + from += 'ṠṡṢṣṤṥṦṧṨṩṪṫṬṭṮṯṰṱṲṳṴṵṶṷṸṹṺṻṼṽṾṿẀẁẂẃẄẅẆẇẈẉẊẋẌẍẎẏẐẑẒẓẔẕ'; > + to += 'ssssssssssttttttttuuuuuuuuuuvvvvwwwwwwwwwwxxxxxyzzzzzz'; > + from += 'ẖẗẘẙẚẛẠạẢảẤấẦầẨẩẪẫẬậẮắẰằẲẳẴẵẶặẸẹẺẻẼẽẾếỀềỂểỄễỆệỈỉỊị'; > + to += 'htwyafaaaaaaaaaaaaaaaaaaaaaaaaeeeeeeeeeeeeeeeeiiii'; > + from += 'ỌọỎỏỐốỒồỔổỖỗỘộỚớỜờỞởỠỡỢợỤụỦủỨứỪừỬửỮữỰựỲỳỴỵỶỷỸỹ'; > + to += 'oooooooooooooooooooooooouuuuuuuuuuuuuuyyyyyyyy'; > + > + for (var i = 0; i != s.length; i++) { > + var x = from.indexOf(s[i]); > + if (x != -1) { > + r = new RegExp(from[x], 'g'); > + s = s.replace(r, to[x]); > + } > + } > + return s; > +} > + > function URLify(s, num_chars) { > + s = replAccents(s); > // changes, e.g., "Petty theft" to "petty_theft" > // remove all these words from the string before urlifying > removelist = ["a", "an", "as", "at", "before", "but", "by", "for", "from",
> I'm including a patch with "from" and "to" constants extended with all > the characters I found on Wikipedia that seemed to be of any use. This > should cover all the Slavic countries except those which use cyrylic > alphabet.
Was this page commit to svn version of django, as in 0.95 I was facing this issue with french accents.
> > I'm including a patch with "from" and "to" constants extended with all > > the characters I found on Wikipedia that seemed to be of any use. This > > should cover all the Slavic countries except those which use cyrylic > > alphabet.
> Was this page commit to svn version of django, as in 0.95 I was facing > this issue with french accents.
Similarly Lithuanian would be: ą = a č = c ę = e ė = e į = i š = s ų = u ū = u ž = z
I am just thinking whether slugify function should correspond to the chosen language or not. It seems that there are not many differences among stripped accented letters in different languages, so maybe it should be left the same. Whatever we decide, ß should still be translated to ss, but not S. What is the opinion of the others?
And also, if we are already adding localizations to the slugify function, should't greek, russian, and other non-latin alphabets also be translated to latin charset?
Regards, Aidas Bendoraitis [aka Archatas]
On 11/17/06, Kamil Wdowicz <kwdow...@zenstudio.pl> wrote:
Aidas Bendoraitis wrote: > Similarly Lithuanian would be: > ą = a > č = c > ę = e > ė = e > į = i > š = s > ų = u > ū = u > ž = z
> I am just thinking whether slugify function should correspond to the > chosen language or not. It seems that there are not many differences > among stripped accented letters in different languages, so maybe it > should be left the same. Whatever we decide, ß should still be > translated to ss, but not S. What is the opinion of the others?
> And also, if we are already adding localizations to the slugify > function, should't greek, russian, and other non-latin alphabets also > be translated to latin charset?
> Regards, > Aidas Bendoraitis [aka Archatas]
> On 11/17/06, Kamil Wdowicz <kwdow...@zenstudio.pl> wrote: > > Polish: > > ą = a > > ć = c > > ź or ż = z > > ę = e > > ó = o > > ł = l > > ś = s > > ń = n