Web Images Videos Maps News Shopping Gmail more »
Recently Visited Groups | Help | Sign in
Google Groups Home
urlify.js blocks out non-English chars - 2nd try?
There are currently too many topics in this group that display first. To make this topic appear first, remove this option from another topic.
There was an error processing your request. Please try again.
flag
  Messages 1 - 25 of 27 - Collapse all  -  Translate all to Translated (View all originals)   Newer >
The group you are posting to is a Usenet group. Messages posted to this group will make your email address visible to anyone on the Internet.
Your reply message has not been sent.
Your post was successful
 
From:
To:
Cc:
Followup To:
Add Cc | Add Followup-to | Edit Subject
Subject:
Validation:
For verification purposes please type the characters you see in the picture below or the numbers you hear by clicking the accessibility icon. Listen and type the numbers you hear
 
David Larlet  
View profile  
 More options Jul 6 2006, 4:57 am
From: "David Larlet" <lar...@gmail.com>
Date: Thu, 6 Jul 2006 10:57:10 +0200
Local: Thurs, Jul 6 2006 4:57 am
Subject: urlify.js blocks out non-English chars - 2nd try?
Hi all,

I've recently added an enhancement (ticket #2282) about urlify without
checking for duplicate and there is already a proposal (my mistake)
and a discussion on this mailing-list which were unfortunatly closed
now: http://groups.google.com/group/django-developers/browse_thread/thread...

I'd like to know if it's possible to do something about it? What are
previous conclusions and facts since the last discussion? I'm new in
Django and I may help in Python but not in js so I need your help ;).

My current problem is for french accents so it's not really difficult
(I've pasted a js from a french blog app on my ticket) but I'm
conscious there are more problems with other languages. Concerning
utf-8 URLs, I don't know if it's really a good idea because this is
actually associated to phishing...

Cheers,
David Larlet


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Jul 6 2006, 5:05 am
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Thu, 06 Jul 2006 19:05:12 +1000
Local: Thurs, Jul 6 2006 5:05 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

On Thu, 2006-07-06 at 10:57 +0200, David Larlet wrote:
> Hi all,

> I've recently added an enhancement (ticket #2282) about urlify without
> checking for duplicate and there is already a proposal (my mistake)
> and a discussion on this mailing-list which were unfortunatly closed
> now: http://groups.google.com/group/django-developers/browse_thread/thread...

> I'd like to know if it's possible to do something about it? What are
> previous conclusions and facts since the last discussion? I'm new in
> Django and I may help in Python but not in js so I need your help ;).

There was reasonable consensus in one of the threads about doing
something similar (but a bit smaller) than what Wordpress does. Now it's
a case of "patches gratefully accepted". A lot of people say this is a
big issue for them, so it's something that will be fixed one day, but
nobody has put in a reasonable patch yet. When that happens, we can
progress.

Malcolm


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 7 2006, 5:06 am
From: Bill de hÓra <b...@dehora.net>
Date: Fri, 07 Jul 2006 10:06:56 +0100
Local: Fri, Jul 7 2006 5:06 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

Malcolm Tredinnick wrote:
> There was reasonable consensus in one of the threads about doing
> something similar (but a bit smaller) than what Wordpress does. Now it's
> a case of "patches gratefully accepted". A lot of people say this is a
> big issue for them, so it's something that will be fixed one day, but
> nobody has put in a reasonable patch yet. When that happens, we can
> progress.

What's the expected scope of the downcoding? Would it be throwing a few
dicts together in the admin js, or a callback to unicodedata.normalize?

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Jul 7 2006, 5:59 am
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Fri, 07 Jul 2006 19:59:43 +1000
Local: Fri, Jul 7 2006 5:59 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
Hi Bill,

On Fri, 2006-07-07 at 10:06 +0100, Bill de hÓra wrote:
> Malcolm Tredinnick wrote:

> > There was reasonable consensus in one of the threads about doing
> > something similar (but a bit smaller) than what Wordpress does. Now it's
> > a case of "patches gratefully accepted". A lot of people say this is a
> > big issue for them, so it's something that will be fixed one day, but
> > nobody has put in a reasonable patch yet. When that happens, we can
> > progress.

> What's the expected scope of the downcoding? Would it be throwing a few
> dicts together in the admin js, or a callback to unicodedata.normalize?

I thought there was some sort of consensus; I didn't claim all the
details had been settled. Personally, I was kind of hoping whoever wrote
the patch might think this sort of thing through and give us a concrete
target to throw ideas at. :-)

My own misguided thoughts (I *really* don't want to have write this
patch): I thought the original design wish was "something that read
sensibly" here, since slugifying is already a lossy process. If I had to
write it today, I would do the "dictionary mapping on the client side"
version. But you're more of an expert here: what does normalization gain
us without having to move to fully internationalised URLs, which still
seem to be a phishing vector: if we allow fully international URLs, then
doing everything properly would make sense. However, is it universally
supported as "not a security risk" in all common browsers yet?

Regards,
Malcolm


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Antonio Cavedoni  
View profile  
 More options Jul 7 2006, 6:37 am
From: Antonio Cavedoni <anto...@cavedoni.org>
Date: Fri, 7 Jul 2006 12:37:51 +0200
Local: Fri, Jul 7 2006 6:37 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
On 7 Jul 2006, at 11:06, Bill de hÓra wrote:

> What's the expected scope of the downcoding? Would it be throwing a  
> few
> dicts together in the admin js, or a callback to  
> unicodedata.normalize?

I’m not sure unicodedata.normalize is enough. It kind of works, if  
you do something like:

def slugify_utf8_slug(slug):
     normalized = []
     for c in slug.decode('utf-8'):
         normalized.append(unicodedata.normalize('NFD', c)[0])
     return ''.join(normalized)

Then it works for simple slugs:

 >>> slugify_utf8_slug("müller")
u'muller'
 >>> slugify_utf8_slug('perché')
u'perche'

But this is because “ü” and “é” can be decomposed as “u”  
and “e” plus accent or diacritic. But then you couldn’t have  
language-specific decompositions like the “Ä = Ae” mentioned here:

  http://dev.textpattern.com/browser/releases/4.0.3/source/
textpattern/lib/i18n-ascii.txt

Also:

 >>> print slugify_utf8_slug("Δ")
Δ

So this would be no good.

Perhaps I’m missing something but unicodedata won’t cut it. If  
we’re going the asciify-route, we need a lookup table.

Cheers.
--
Antonio


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 7 2006, 11:50 am
From: Bill de hÓra <b...@dehora.net>
Date: Fri, 07 Jul 2006 16:50:18 +0100
Local: Fri, Jul 7 2006 11:50 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

Antonio Cavedoni wrote:
> So this would be no good.

> Perhaps I’m missing something but unicodedata won’t cut it.

This is my point. Cut what exactly?  "No good" for what exactly? We
could file patches to see what sticks, but it might be better to figure
what's wanted first, instead of playing fetch me a rock.

A slug function can range from a regex replace to a complete text
normalization/decomposition/lookup service that will never be enough
because even unicode+mappings aren't a complete solution.

If it's the full unicode+mappings case, I'm doubtful that processing
should be done on the client, not only because the unicode database is
large, but also because the server will have a well tested setup via
unicodedata.

If there's a need to keep the slug current behaviour, fill out as you
write as opposed to fill out on the server, that suggests an ajax
callback to the server to get at unicodedata.

If a latin1 hack is enough, that can be sent down to the client in the
admin js. RT editors like fck do this all the time with entity
replacements. No need to use Python if we're dealing with a small subset.

Mappings: yes, ord/text mappings are grand (Greek, Russian, Turkish of
the top of my head would be good inbuilts, as would latin if a unicode
db isn't used).

If there's a need for mapping extension, there needs to be a place for
people to put dictionaries.

If the code falls into an else because it has no lookup, does it insert
a stringified hexcode or blank the character out.

Etc.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 7 2006, 1:20 pm
From: Bill de hÓra <b...@dehora.net>
Date: Fri, 07 Jul 2006 18:20:11 +0100
Local: Fri, Jul 7 2006 1:20 pm
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

Normalisation/decomposition gains you greater assuredness you'll throw
away what you think you're throwing away, before you try a mapping.
Unicode provides mappings down to ascii but it's not complete; mapping
decisions tend to be localized/controversial.

The phishing problem with Internationalised URLs (IRIs) is in the
internationalized domain name (IDN) where you can get redirected, and
not so much the path segment where the slug lives. I work on atom
protocol and IRIs are official IETF/W3C goodness these days (funny, we
just went through slugging on the protocol list yesterday). IRIs are
designed to to be treated as encoded Unicode (utf8 most likely) so they
pass through systems without losing information. Slugging as I tend to
understand it is really about dropping down to ascii and throwing
character information away. I'm thinking that for slugs people want to
have a character replaced with an ascii equivalent and not /preserve/
character data via encoding.

It really does depend on what people want from this feature. A full full
full downcoding solution needs to go back to the server I think, do the
whole unicode bit, and use whatever custom mappings onto ascii. Whereas
a good enough approach would be set of js dicts sent to the client; that
  keeps the nice js autofill feature in the admin, and will probably
cover 95% of use cases.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]" by Bill de hÓra
Bill de hÓra  
View profile  
 More options Jul 11 2006, 9:54 pm
From: Bill de hÓra <b...@dehora.net>
Date: Wed, 12 Jul 2006 02:54:03 +0100
Local: Tues, Jul 11 2006 9:54 pm
Subject: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]

Malcolm Tredinnick wrote:
> Personally, I was kind of hoping whoever wrote
> the patch might think this sort of thing through and give us a concrete
> target to throw ideas at. :-)

Hi Malcolm,

Here we go:

[[[urlify.js:

     var LATIN_MAP =
     {
        'À':'A',
        'Á':'A',
        'à':'a',
        'á':'a',
'©':'c'
     } ;
     var LATIN_SYMBOLS_MAP =
     {
     }
     var GREEK_MAP =
     {
     }
     var TURKISH_MAP =
     {
     }
     var RUSSIAN_MAP =
     {
     }

var ALL_DOWNCODE_MAPS=new Array()
ALL_DOWNCODE_MAPS[0]=LATIN_MAP
ALL_DOWNCODE_MAPS[1]=LATIN_SYMBOLS_MAP
ALL_DOWNCODE_MAPS[2]=GREEK_MAP
ALL_DOWNCODE_MAPS[3]=TURKISH_MAP
ALL_DOWNCODE_MAPS[4]=RUSSIAN_MAP

var Downcoder = new Object() ;

Downcoder.Initialize = function()
{
     if (Downcoder.map) // already made
         return ;
     Downcoder.map ={}
     Downcoder.chars = '' ;
     for(var i in ALL_DOWNCODE_MAPS)
     {
         var lookup = ALL_DOWNCODE_MAPS[i]
         for (var c in lookup)
         {
             Downcoder.map[c] = lookup[c] ;
             Downcoder.chars += c ;
         }
      }
     Downcoder.regex = new RegExp('[' + Downcoder.chars + ']|[^' +
Downcoder.chars + ']+','g') ;

}

downcode= function( slug )
{
     Downcoder.Initialize() ;
     var downcoded =""
     var pieces = str.match(Downcoder.regex);
     if(pieces)
     {
        for (var i = 0 ; i < pieces.length ; i++)
        {
             if (pieces[i].length == 1)
             {
                var mapped = Downcoder.map[pieces[i]] ;
                if (mapped != null)
                {
                     downcoded+=mapped;
                     continue ;
                }
             }
             else
             {
                 downcoded+=pieces[i];
               }
         }
     }  
     else
     {
         downcoded = slug;
     }
     return downcoded;

}

function URLify(s, num_chars) {
     s = downcode(s);
     removelist = ["a", "an", "as", "at", "before", "but", "by", "for",
"from",
                   "is", "in", "into", "like", "of", "off", "on",
"onto", "per",
                   "since", "than", "the", "this", "that", "to", "up",
"via",
                   "with"];
     r = new RegExp('\\b(' + removelist.join('|') + ')\\b', 'gi');
     s = s.replace(r, '');
     // if downcode fails, the char will be stripped here
     s = s.replace(/[^-A-Z0-9\s]/gi, '');
     s = s.replace(/^\s+|\s+$/g, ''); // trim leading/trailing spaces
     s = s.replace(/[-\s]+/g, '-');   // convert spaces to hyphens
     s = s.toLowerCase();             // convert to lowercase
     return s.substring(0, num_chars);// trim to first num_chars chars
     return s.substring(0, num_chars);// trim to first num_chars chars
}

]]]

I need to test this properly and fill in the mappings, but the gist of
the approach should be clear. When that's done, unless someone has an
objection, I'll file a patch against

   http://code.djangoproject.com/ticket/1602
   (also 2282)
.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Malcolm Tredinnick  
View profile  
 More options Jul 11 2006, 10:09 pm
From: Malcolm Tredinnick <malc...@pointy-stick.com>
Date: Wed, 12 Jul 2006 12:09:15 +1000
Local: Tues, Jul 11 2006 10:09 pm
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]

On Wed, 2006-07-12 at 02:54 +0100, Bill de hÓra wrote:
> Malcolm Tredinnick wrote:

> > Personally, I was kind of hoping whoever wrote
> > the patch might think this sort of thing through and give us a concrete
> > target to throw ideas at. :-)

> Hi Malcolm,

> Here we go:

aah ... batter up! :-)

Probably only one of the last two lines is necessary. :-)

I am the about the worst guy in the world to review Javascript code, but
I agree it seems to behave logically. I'm just not familiar enough with
the ins and outs enough to spot any sneaky problems that might creep
into code like this. Fortunately, the audience is full of people with
real clues.

> I need to test this properly and fill in the mappings, but the gist of
> the approach should be clear. When that's done, unless someone has an
> objection, I'll file a patch against

>    http://code.djangoproject.com/ticket/1602
>    (also 2282)
> .

In the interests of economy I just closed #1602 as a dupe of #2282 (the
latter has more links to mailing list threads). It's too hard to track
all the dupes and semi-related bugs.

Anyway, thanks for doing the work, Bill. This looks like something that
is a great start and that we can tweak as people find problems in their
own locales.

Thanks,
Malcolm


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars" by Andrey Golovizin
Andrey Golovizin  
View profile  
 More options Jul 12 2006, 1:31 am
From: Andrey Golovizin <golovi...@gmail.com>
Date: Wed, 12 Jul 2006 12:31:51 +0700
Local: Wed, Jul 12 2006 1:31 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars

Bill de hÓra wrote:
> I need to test this properly and fill in the mappings

The official Cyrillic-Latin mapping could be found here:
http://en.wikipedia.org/wiki/Translit

It could be like this:
var RUSSIAN_MAP =
{
"а": "a",    "к": "k",    "х": "kh,
"б": "b",    "л": "l",    "ц": "ts",
"в": "v",    "м": "m",    "ч": "ch",
"г": "g",    "н": "n",    "ш": "sh",
"д": "d",    "о": "o",    "щ": "shch",
"е": "e",    "п": "p",    "ъ": "''",
"ё": "jo",   "р": "r",    "ы": "y",
"ж": "zh",   "с": "s",    "ь": "'",
"з": "z",    "т": "t",    "э": "eh",
"и": "i",    "у": "u",    "ю": "ju",
"й": "j",    "ф": "f",    "я": "ja"

}

Andrey

    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]" by James Bennett
James Bennett  
View profile  
 More options Jul 12 2006, 2:58 am
From: "James Bennett" <ubernost...@gmail.com>
Date: Wed, 12 Jul 2006 01:58:38 -0500
Local: Wed, Jul 12 2006 2:58 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]
On 7/11/06, Bill de hÓra <b...@dehora.net> wrote:

> I need to test this properly and fill in the mappings, but the gist of
> the approach should be clear. When that's done, unless someone has an
> objection, I'll file a patch against

The structure and logic look good; my only worry is that this will
eventually get pretty unwieldy; just looking over the languages we
already support through the i18n system, we'd be carrying around a
pretty huge mapping from the get-go, and it would only grow over time.
And God help us if we seriously decide to support slugifying CJK ;)

--
"May the forces of evil become confused on the way to your house."
  -- George Carlin


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 12 2006, 7:02 am
From: Bill de hÓra <b...@dehora.net>
Date: Wed, 12 Jul 2006 12:02:18 +0100
Local: Wed, Jul 12 2006 7:02 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]

Malcolm Tredinnick wrote:
>>      return s.substring(0, num_chars);// trim to first num_chars chars
>>      return s.substring(0, num_chars);// trim to first num_chars chars
>> }
>> ]]]

> Probably only one of the last two lines is necessary. :-)

DRY!

> I am the about the worst guy in the world to review Javascript code, but
> I agree it seems to behave logically. I'm just not familiar enough with
> the ins and outs enough to spot any sneaky problems that might creep
> into code like this. Fortunately, the audience is full of people with
> real clues.

Fair enough; it's out there so people can push back on the approach,
before I crack on.

> Anyway, thanks for doing the work, Bill. This looks like something that
> is a great start and that we can tweak as people find problems in their
> own locales.

Ok, I'll start testing this and filling in some mappings. I'll send in a
patch as soon as I can.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 12 2006, 7:08 am
From: Bill de hÓra <b...@dehora.net>
Date: Wed, 12 Jul 2006 12:08:58 +0100
Local: Wed, Jul 12 2006 7:08 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]

James Bennett wrote:
> On 7/11/06, Bill de hÓra <b...@dehora.net> wrote:
>> I need to test this properly and fill in the mappings, but the gist of
>> the approach should be clear. When that's done, unless someone has an
>> objection, I'll file a patch against

> The structure and logic look good; my only worry is that this will
> eventually get pretty unwieldy; just looking over the languages we
> already support through the i18n system, we'd be carrying around a
> pretty huge mapping from the get-go, and it would only grow over time.
> And God help us if we seriously decide to support slugifying CJK ;)

Thanks James. I think if it got to that stage, it's in the class of good
problems to have (Django having achieved world domination). I guess we'd
need to look at going back to python to run against unicodedata and/or
server sided mappings as Antonio pointed out. I think using a Downcoder
object gives room to callback to the server to the do the work if needed
(the logic would be refactored into python). I'm not sure how people
would fold in their locale/custom mappings at that stage, but it might
be a nice to have feature.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars" by Bill de hÓra
Bill de hÓra  
View profile  
 More options Jul 12 2006, 7:10 am
From: Bill de hÓra <b...@dehora.net>
Date: Wed, 12 Jul 2006 12:10:02 +0100
Local: Wed, Jul 12 2006 7:10 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars

Andrey Golovizin wrote:
> Bill de hÓra wrote:
>> I need to test this properly and fill in the mappings

> The official Cyrillic-Latin mapping could be found here:
> http://en.wikipedia.org/wiki/Translit

Great resource Andrey, thanks. I'll fold that mapping in.

cheers
Bill


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]" by David Larlet
David Larlet  
View profile  
 More options Jul 12 2006, 7:31 am
From: "David Larlet" <lar...@gmail.com>
Date: Wed, 12 Jul 2006 13:31:57 +0200
Local: Wed, Jul 12 2006 7:31 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]
2006/7/12, Bill de hÓra <b...@dehora.net>:

Hi,

Thanks for your job, seems good to me but what about the size of the
final urlify.js? I'm afraid that it will increase a lot and maybe the
current one is good for most of the cases, is it possible to easily
switch between the two for english writers?

Another question is about customisation of removelist. What's your
opinion about that?

Cheers,
David


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars" by Petar Marić
Petar Marić  
View profile  
 More options Jul 12 2006, 8:53 am
From: "Petar Marić" <petar.ma...@gmail.com>
Date: Wed, 12 Jul 2006 14:53:30 +0200
Local: Wed, Jul 12 2006 8:53 am
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars
Here're Serbian mappings (default Serbian language in Django is Serbian Latin):

var SERBIAN_LATIN_MAP =
{
"š": "s",  "đ": "dj",  "ž": "z",
"č": "c",  "ć": "c"

}

var SERBIAN_CYRILLIC_MAP =
{
"а": "a",  "б":  "b",  "в":  "v",
"г": "g",  "д":  "d",  "ђ": "dj",
"е": "e",  "ж":  "z",  "з":  "z",
"и": "i",  "ј":  "j",  "к":  "k",
"л": "l",  "љ": "lj",  "м":  "m",
"н": "n",  "њ": "nj",  "о":  "o",
"п": "p",  "р":  "r",  "с":  "s",
"т": "t",  "ћ":  "c",  "у":  "u",
"ф": "f",  "х":  "h",  "ц":  "c",
"ч": "c",  "џ": "dz",  "ш":  "s"

}

Oh, and a big +1 on the proposed fix of slugify :)
--
Petar Marić
*e-mail: petar.ma...@gmail.com
*mobile: +381 (64) 6122467

*icq: 224720322
*skype: petar_maric
*web: http://www.petarmaric.com/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]" by everes
everes  
View profile  
 More options Jul 12 2006, 1:04 pm
From: "everes" <mtsuy...@gmail.com>
Date: Wed, 12 Jul 2006 17:04:58 -0000
Local: Wed, Jul 12 2006 1:04 pm
Subject: Re: Suggestion for #1602, urlify.js blocks out non-English chars [was urlify.js blocks out...]
Hi.

If we japanese make character mappings, it will be millions and can't
cover all of them.
Mapping plan doesn't seem realistic for Chinese too.

So I agree with Antonio and Bill.
But people who use latin charcter prefer mappings to encoding, I think.

How about plugable python logic with each client language or setting's
LANGUAGE_CODE?
I hope applications made with django can run any locale.

-----------------------
makoto tsuyuki
http://www.djangoproject.jp/


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Discussion subject changed to "urlify.js blocks out non-English chars - 2nd try?" by Julian &#39;Julik&#39; Tarkhanov
Julian 'Julik' Tarkhanov  
View profile  
 More options Jul 12 2006, 10:27 am
From: Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com>
Date: Wed, 12 Jul 2006 16:27:34 +0200
Local: Wed, Jul 12 2006 10:27 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

On 7-jul-2006, at 17:50, Bill de hÓra wrote:

> This is my point. Cut what exactly?  "No good" for what exactly? We
> could file patches to see what sticks, but it might be better to  
> figure
> what's wanted first, instead of playing fetch me a rock.

This is handled by Unicode standard and is called transliteration.  
The problem is that it's locale dependent.
AFAIK Python's codecs don't implement it (but ICU4R does). If you go  
for tables it's going to be _many_.

URLs can be Unicode-aware, just encoded - so why not replacing  
whitespace with dashes and doing a Unicode downcase,
and be done with it? Some browsers (Safari) even show you the request  
string verbatim, so it's very readable.


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jeroen Ruigrok van der Werven  
View profile  
 More options Jul 15 2006, 3:58 am
From: "Jeroen Ruigrok van der Werven" <asheme...@gmail.com>
Date: Sat, 15 Jul 2006 09:58:49 +0200
Local: Sat, Jul 15 2006 3:58 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
On 7/12/06, Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com> wrote:

> This is handled by Unicode standard and is called transliteration.

And also not quite true. Arabic, for example, but also Hindi, have no
real standardized transliteration schemes.
Also, for Japanese, are you going to follow kunrei-shiki or rather the
more widely used hepburn transliteration? Or perhaps even nippon-shiki
if you feel like sticking to strictness.
Chinese, want to settle for hanyu pinyin or rather wade-giles?

Be careful because you are treading on very unstable ground now.

--
Jeroen Ruigrok van der Werven


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
gabor  
View profile  
 More options Jul 16 2006, 3:01 pm
From: gabor <ga...@nekomancer.net>
Date: Sun, 16 Jul 2006 21:01:45 +0200
Local: Sun, Jul 16 2006 3:01 pm
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
Jeroen Ruigrok van der Werven wrote:

> On 7/12/06, Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com> wrote:
>> This is handled by Unicode standard and is called transliteration.

> Also, for Japanese, are you going to follow kunrei-shiki or rather the
> more widely used hepburn transliteration? Or perhaps even nippon-shiki
> if you feel like sticking to strictness.

i think we do not need to discuss japanese at all. after all, there's no
transliteration for kanji. so it's imho pointless to argue about
kana-transliteration, when you cannot transliterate kanji.

gabor


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
tsuyuki makoto  
View profile  
 More options Jul 17 2006, 2:25 am
From: "tsuyuki makoto" <mtsuy...@gmail.com>
Date: Mon, 17 Jul 2006 15:25:01 +0900
Local: Mon, Jul 17 2006 2:25 am
Subject: Re: Re: urlify.js blocks out non-English chars - 2nd try?
2006/7/17, gabor <ga...@nekomancer.net>:

> Jeroen Ruigrok van der Werven wrote:
> > On 7/12/06, Julian 'Julik' Tarkhanov <julian.tarkha...@gmail.com> wrote:
> >> This is handled by Unicode standard and is called transliteration.

> > Also, for Japanese, are you going to follow kunrei-shiki or rather the
> > more widely used hepburn transliteration? Or perhaps even nippon-shiki
> > if you feel like sticking to strictness.

> i think we do not need to discuss japanese at all. after all, there's no
> transliteration for kanji. so it's imho pointless to argue about
> kana-transliteration, when you cannot transliterate kanji.

We Japanese know that we can't transarate Japanese to ASCII.
So I want to do it as follows at least.
A letter does not disappear and is restored.
#FileField and ImageField have same letters disappear problem.

def slug_ja(word) :
    try :
        unicode(word, 'ASCII')
        import re
        slug = re.sub('[^\w\s-]', '', word).strip().lower()
        slug = re.sub('[-\s]+', '-', slug)
        return slug
    except UnicodeDecodeError :
        from encodings import idna
        painful_slug = word.strip().lower().decode('utf-8').encode('IDNA')
        return painful_slug


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Antonio Cavedoni  
View profile  
 More options Jul 19 2006, 5:16 am
From: Antonio Cavedoni <anto...@cavedoni.org>
Date: Wed, 19 Jul 2006 11:16:11 +0200
Local: Wed, Jul 19 2006 5:16 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
On 17 Jul 2006, at 8:25, tsuyuki makoto wrote:

I’m not convinced by this approach, but I would suggest using the  
“punycode” instead of the “idna” encoder anyway. The results don’t  
include the initial “xn--” marks which are only useful in a domain  
name, not in a URI path. Also, the “from encodings […]” line appears  
to be unnecessary on my Python 2.3.5 and 2.4.1 on OSX.

[[[
 >>> p = u"perché"
 >>> from encodings import idna
 >>> p.encode('idna')
'xn--perch-fsa'
 >>> p.encode('punycode')
'perch-fsa'
 >>> puny = 'perch-fsa'
 >>> puny.decode('punycode')
u'perch\xe9'
 >>> print puny.decode('punycode')
perché
 >>> pu = puny.decode('punycode') # it's reversible
 >>> print pu
perché
]]]

More on Punycode: http://en.wikipedia.org/wiki/Punycode

Cheers.
--
Antonio


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Gábor Farkas  
View profile  
 More options Jul 19 2006, 5:29 am
From: Gábor Farkas <ga...@nekomancer.net>
Date: Wed, 19 Jul 2006 11:29:51 +0200
Local: Wed, Jul 19 2006 5:29 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

i somehow have the feeling that we lost the original idea here a little.

(as far as i understand, by urlify.js we are talking about slug
auto-generation, please correct me if i'm wrong).

we are auto-generating slugs when it "makes sense". for example, for
english it makes sense to remove all the non-word stuff, because what
remains can still be read, be understood, and generally looks fine when
being a part of the URL.

also, for many languages (hungarian or slavic ones), it also "makes
sense" to simply drop all the diacritical marks, because the rest can
still be read, be understood, and looks fine as part of an URL.

but with punycode or whatever-code encoding japanese, what's the point?
what you get will be completely unreadable.. if you only need to
preserve the submitted data, you don't need to do anything. simply take
your unicode text, encode it to utf8, url-escape it and use it as a part
of the url. it will be ok. and on the other side you can url-unescape
and utf8-decode it and you're back. you will even be able to have ascii
stuff readably-preserved.

form my point of view, with the current slug-approach, you either can
convert your text into ascii that "makes sense" or not. if the former,
then enhancing urlify.js makes sense. if the latter, then it makes no
sense. imho.

gabor


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Bill de hÓra  
View profile  
 More options Jul 19 2006, 6:18 am
From: Bill de hÓra <b...@dehora.net>
Date: Wed, 19 Jul 2006 11:18:20 +0100
Local: Wed, Jul 19 2006 6:18 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?

I agree;  this has gone *way* past the original idea.

Transcription of characters onto ascii, (aka "slugging") is not the same
problem as passing around encoded unicode IRI segments between clients
and servers. There's a standards track IETF document for the latter
purpose - RFC3987. If you want to do this, do it to spec.

I think the js mapping approach is good enough for the admin interface.
Once I can get the greek table to be picked up (argh), a patch will land...

cheers
Bill

* http://www.ietf.org/rfc/rfc3987.txt


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Jeroen Ruigrok van der Werven  
View profile  
 More options Jul 20 2006, 5:56 am
From: "Jeroen Ruigrok van der Werven" <asheme...@gmail.com>
Date: Thu, 20 Jul 2006 11:56:29 +0200
Local: Thurs, Jul 20 2006 5:56 am
Subject: Re: urlify.js blocks out non-English chars - 2nd try?
On 7/16/06, gabor <ga...@nekomancer.net> wrote:

> i think we do not need to discuss japanese at all. after all, there's no
> transliteration for kanji. so it's imho pointless to argue about
> kana-transliteration, when you cannot transliterate kanji.

If you mean that you cannot easily deduce whether the kanji for moon 月
should be transliterated according to the reading 'tsuki' or 'getsu',
then yes, you are correct. But you *can* transliterate them according
to their on or kun reading.

--
Jeroen Ruigrok van der Werven


    Reply to author    Forward  
You must Sign in before you can post messages.
To post a message you must first join this group.
Please update your nickname on the subscription settings page before posting.
You do not have the permission required to post.
Messages 1 - 25 of 27   Newer >
« Back to Discussions « Newer topic     Older topic »

Create a group - Google Groups - Google Home - Terms of Service - Privacy Policy
©2009 Google