I'm using the default locale (whatever US ASCII is), but I'd like to match some words that have accented characters using the re module. For instance, I'd like the Americanized "Montreal" to match the French "Montréal". I thought the way to do this changed with 1.5 and the re module, but a search of the Python locator and the re module documentation didn't turn up anything useful. How do I do this?
> I'm using the default locale (whatever US ASCII is), but I'd like to match > some words that have accented characters using the re module. For instance= > , > I'd like the Americanized "Montreal" to match the French "Montr=E9al". I > thought the way to do this changed with 1.5 and the re module, but a search=
> of the Python locator and the re module documentation didn't turn up > anything useful. How do I do this?
If you don't want to change the locale, you'll have to make an explicit translation table (so you'll have to decide exactly which accented characters you want to map to which other characters) and translate the string using string.translate().
In article <199801270355.WAA12...@eric.CNRI.Reston.Va.US>, Guido van Rossum <gu...@CNRI.Reston.Va.US> wrote:
> > I'm using the default locale (whatever US ASCII is), but I'd like to match > > some words that have accented characters using the re module. For
instance=
> If you don't want to change the locale, you'll have to make an > explicit translation table (so you'll have to decide exactly which > accented characters you want to map to which other characters) and > translate the string using string.translate().
What about ignoring case in accented languages? I'd like to match '\351' and '\311', the same accented letter just that one is capitalized.
Is the only way using string.translate()? Isn't it too slow?
If it is so, has anyone already created this translation table?
> > If you don't want to change the locale, you'll have to make an > > explicit translation table (so you'll have to decide exactly which > > accented characters you want to map to which other characters) and > > translate the string using string.translate().
[Paulo]
> What about ignoring case in accented languages? > I'd like to match '\351' and '\311', the same accented letter just > that one is capitalized.
> Is the only way using string.translate()? Isn't it too slow?
It's implemented in C so it should be FAST!
> If it is so, has anyone already created this translation table?
You can easily build a table yourself using string.maketrans(): e.g. string.maketrans("\351", "\311") returns a translation table that maps \351 to \311 (the arguments are strings that are to be mapped one-by-one).
But you're posting from Brazil, I presume -- if you set up your locale correctly, shouldn't you be able to use \w and re.IGNORECASE to get the right effect?
ne...@inf.puc-rio.br wrote: >What about ignoring case in accented languages? >I'd like to match '\351' and '\311', the same accented letter just that one is >capitalized.
>Is the only way using string.translate()? Isn't it too slow?
I'm afraid string.translate or string.lower are your only options; at the moment, a fixed table is used to map between upper and lower-case, because making it fully dynamic based upon the re.LOCALE flag was really messy.
akuchl...@acm.org http://starship.skyport.net/crew/amk/ Modern disillusion is unlikely to last forever, and nothing rings so hollow as the angst of yesterday. -- Robertson Davies, "Reading"
Guido van Rossum wrote: > [Paulo] > > What about ignoring case in accented languages? > > I'd like to match '\351' and '\311', the same accented letter just > > that one is capitalized.
> > Is the only way using string.translate()? Isn't it too slow?
> It's implemented in C so it should be FAST!
Sure. But it is probably slower to do string.locale() and a pattern matching than working with a pattern matching that ignores the case.
Other problem is that I'd lost my original text.
> > If it is so, has anyone already created this translation table?
> You can easily build a table yourself using string.maketrans(): > e.g. string.maketrans("\351", "\311") returns a translation table that > maps \351 to \311 (the arguments are strings that are to be mapped > one-by-one).
No problem, I just thought someone else should have already done that.
> But you're posting from Brazil, I presume --
Yes.
>if you set up your locale > correctly, shouldn't you be able to use \w and re.IGNORECASE to get > the right effect?
I've tried it in windows and it doesn't work. The windows is an english version but the language is set to Brazilian Portuguese.
I've just tried it in unix and it also didn't work. See my test, probably the chars 'é' and 'É' won't look good in your email, they are the chars '\351' and '\311', respectly:
> I've just tried it in unix and it also didn't work. > See my test, probably the chars 'é' and 'É' won't look good in your > email, they are the chars '\351' and '\311', respectly:
> /home/neves> locale > LANG=pt_BR > LC_COLLATE="pt_BR" > LC_CTYPE="pt_BR" > LC_MONETARY="pt_BR" > LC_NUMERIC="pt_BR" > LC_TIME="pt_BR" > LC_MESSAGES="pt_BR" > LC_ALL= > [nazareth]/home/neves> python > Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > >>> a = '\351\351\311' * 3 > >>> print a > ééÉééÉééÉ > >>> from re import * > >>> p = compile(r'(é+)', I) > >>> m = p.search(a) > >>> print m.group(1) > éé > It should have matched the whole string, right?
at the start of your session, and passed I+L (or IGNORECASE+LOCALE) as the flags to compile().
Unfortunately I just heard from Andrew Kuchling that the re module doesn't do this right yet. Nevertheless, if it *did* do it right, you would still have to do what I said here (i.e. the locale is not used automatically; you must call setlocale() *and* pass the L flag to compile).
Neves[SMTP:ne...@inf.puc-rio.br] wrote: > Guido van Rossum wrote:
> > [Paulo] > > > What about ignoring case in accented languages? > > > I'd like to match '\351' and '\311', the same accented letter just > > > that one is capitalized.
> > > Is the only way using string.translate()? Isn't it too slow?
> > It's implemented in C so it should be FAST!
> Sure. But it is probably slower to do string.locale() and a pattern > matching than working with a pattern matching that ignores the case.
> Other problem is that I'd lost my original text.
> > > If it is so, has anyone already created this translation table?
> > You can easily build a table yourself using string.maketrans(): > > e.g. string.maketrans("\351", "\311") returns a translation table that > > maps \351 to \311 (the arguments are strings that are to be mapped > > one-by-one).
> No problem, I just thought someone else should have already done that.
> > But you're posting from Brazil, I presume --
> Yes.
> >if you set up your locale > > correctly, shouldn't you be able to use \w and re.IGNORECASE to get > > the right effect?
> I've tried it in windows and it doesn't work. The windows is an english > version but the language is set to Brazilian Portuguese.
> I've just tried it in unix and it also didn't work. > See my test, probably the chars 'é' and 'É' won't look good in your > email, they are the chars '\351' and '\311', respectly:
> /home/neves> locale > LANG=pt_BR > LC_COLLATE="pt_BR" > LC_CTYPE="pt_BR" > LC_MONETARY="pt_BR" > LC_NUMERIC="pt_BR" > LC_TIME="pt_BR" > LC_MESSAGES="pt_BR" > LC_ALL= > [nazareth]/home/neves> python > Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4 > Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam > >>> a = '\351\351\311' * 3 > >>> print a > ééÉééÉééÉ > >>> from re import * > >>> p = compile(r'(é+)', I) > >>> m = p.search(a) > >>> print m.group(1) > éé
> It should have matched the whole string, right?
> -- > Paulo Eduardo Neves > mailto:ne...@inf.puc-rio.br Rio de Janeiro - Brasil > Pager-> Central:(021)532-4499 Cod.:213 99 64
I have an application in Visual C++ (win95/NT) where I use the 'setlocale(LC_TIME, "portuguese")' to make sure that the result of strftime is always a portuguese string regardless of the window version where it runs. It works with the US version of win95 and the portuguese one. Perhaps if you explicity set the locale it will work.