Help with regex for URL with spanish characters

457 views
Skip to first unread message

mamcxyz

unread,
Jun 30, 2006, 5:53:25 PM6/30/06
to Django users
I'm building a site for restaurants, like a yellow pages.

I wanna provide listing based in state / city /zones. Some citys in
Colombia are "Medellín", "Santa Marta" and so on...

So, the url are transformed to Medell%C3%ADn and Santa%20Marta.

Easy, I think... in the urls:

(r'^(?P<depto>[a-zA-Z0-9%\-]+)/(?P<city>[a-zA-Z0-9%\-]+)/$',
'restaurant.views.byCity' ),

to match depto and city.

I test this in the interactive mode:

re.match(r'^(?P<depto>[a-zA-Z0-9%\-]+)/(?P<city>[a-zA-Z0-9%\-]+)/$',r'a/Bogot%C3%A1/').groups()
>>('a', 'Bogot%C3%A1')

However, the django site not can found this...

Page not found (404)

I don't find another regular expression that work fine....

Any idea?

Jeremy Dunck

unread,
Jun 30, 2006, 6:25:34 PM6/30/06
to django...@googlegroups.com
The %C3%AD is just escaping for the URL, and isn't actually a character you'll see from the web server.

The unicode character í is U+00ED.  A matching regex would be:
Medell\xEDn

How did I get from C3AD to 00ED?
I guessed that the URL is escaped UTF-8.

Medellín

Medell\xEDn

unicode
1110 1101
E    D    

url (escaped utf-8)
range pattern:
110x xxxx 10xx  xxxx

The unicode bits:
       11   10  1101

left zero-pad the bits:
1100 0011 1010  1101
C    3    A     D

Yep, it's escaped UTF-8 since the escaped chars match the bit pattern for UTF-8-encoded U+00ED.

So, unless I'm a moron (in the Pilgrim sense[1]), in general, you can expect to get UTF-8 URLs.

See UTF-8 on Wikipedia if you're totally confused.  :)

[1]
http://diveintomark.org/archives/2004/08/16/specs

Jeremy Dunck

unread,
Jun 30, 2006, 6:33:45 PM6/30/06
to django...@googlegroups.com


On 6/30/06, Jeremy Dunck <jdu...@gmail.com> wrote:
So, unless I'm a moron (in the Pilgrim sense[1]), in general, you can expect to get UTF-8 URLs.

Sorry, let me clarify:
In general, you should expect to get -unicode- URLs from the server, and this explanation will hopefully serve to reduce confusion over URL-escaping and how to match non-ASCII URL dispatch.  

... And goddamn this unicode stuff makes my brain hurt.
It's been like a perfect storm for me lately.  I've gone years without caring about unicode, but reading about it just cuz I like to know stuff.

In the past week, I've dealt with emails getting cut off, files getting munged, form submissions failing, and regexs not matching, all due to encoding issues (and almost all in conjunction w/ django).  So yeah, I'm +1 on unicodification, because yow, it's already an issue and the least we could do is be explicit about it.  :)

mamcxyz

unread,
Jun 30, 2006, 7:19:13 PM6/30/06
to Django users
I understand the encoding issue.

I modified the regex to:

^(?<depto>[a-zA-Z0-9%\\\-]+)/(?<city>[a-zA-Z0-9%\\\-]+)/$

With this test string:

re.match(r'^(?P<depto>[a-zA-Z0-9\%\\\-]+)/(?P<city>[a-zA-Z0-9\%\\\-]+)/$',r'Medell%C3%ADn/Medell\xEDn/').groups()

Work outside django but not inside it...

In where I need to look to see exactly what is evaluated?

Jeremy Dunck

unread,
Jun 30, 2006, 11:26:46 PM6/30/06
to django...@googlegroups.com
On 6/30/06, mamcxyz <mam...@gmail.com> wrote:
> In where I need to look to see exactly what is evaluated?


django.core.urlresolvers.RegexURLPattern.resolve

...and from the look of that test, no, you don't understand the encoding issue.

it'd be more like:

re.match(r'[a-zA-Z0-9\%\\\xED-\xEF]', 'Medill\xEDn')
...In other words, your character class should include every character
you'll accept.

Here's an excellent tutorial:
http://www.regular-expressions.info/unicode.html

Unfortunately, googling for "unicode regex url" turned up nothing
useful. I think a django-provided character class for "any char other
than URL-specials like ?#&/" would be good.

On that tack, perhaps [^?#&=/] (or similar) is what you want. ;-)

Reply all
Reply to author
Forward
0 new messages