I'm using the default locale (whatever US ASCII is), but I'd like to match
some words that have accented characters using the re module. For instance,
I'd like the Americanized "Montreal" to match the French "Montréal". I
thought the way to do this changed with 1.5 and the re module, but a search
of the Python locator and the re module documentation didn't turn up
anything useful. How do I do this?
Thx,
Skip Montanaro | Musi-Cal: http://concerts.calendar.com/
sk...@calendar.com | Python Support: http://www.pythonpros.com/
(518)372-5583 | XEmacs: http://www.automatrix.com/~skip/xemacs/tip.html
If you don't want to change the locale, you'll have to make an
explicit translation table (so you'll have to decide exactly which
accented characters you want to map to which other characters) and
translate the string using string.translate().
--Guido van Rossum (home page: http://www.python.org/~guido/)
> If you don't want to change the locale, you'll have to make an
> explicit translation table (so you'll have to decide exactly which
> accented characters you want to map to which other characters) and
> translate the string using string.translate().
What about ignoring case in accented languages?
I'd like to match '\351' and '\311', the same accented letter just that one is
capitalized.
Is the only way using string.translate()? Isn't it too slow?
If it is so, has anyone already created this translation table?
regards,
Paulo
[Paulo]
> What about ignoring case in accented languages?
> I'd like to match '\351' and '\311', the same accented letter just
> that one is capitalized.
>
> Is the only way using string.translate()? Isn't it too slow?
It's implemented in C so it should be FAST!
> If it is so, has anyone already created this translation table?
You can easily build a table yourself using string.maketrans():
e.g. string.maketrans("\351", "\311") returns a translation table that
maps \351 to \311 (the arguments are strings that are to be mapped
one-by-one).
But you're posting from Brazil, I presume -- if you set up your locale
correctly, shouldn't you be able to use \w and re.IGNORECASE to get
the right effect?
I'm afraid string.translate or string.lower are your only
options; at the moment, a fixed table is used to map between upper and
lower-case, because making it fully dynamic based upon the re.LOCALE
flag was really messy.
akuc...@acm.org http://starship.skyport.net/crew/amk/
Modern disillusion is unlikely to last forever, and nothing rings so
hollow as the angst of yesterday.
-- Robertson Davies, "Reading"
Sure. But it is probably slower to do string.locale() and a pattern
matching than working with a pattern matching that ignores the case.
Other problem is that I'd lost my original text.
>
> > If it is so, has anyone already created this translation table?
>
> You can easily build a table yourself using string.maketrans():
> e.g. string.maketrans("\351", "\311") returns a translation table that
> maps \351 to \311 (the arguments are strings that are to be mapped
> one-by-one).
No problem, I just thought someone else should have already done that.
>
> But you're posting from Brazil, I presume --
Yes.
>if you set up your locale
> correctly, shouldn't you be able to use \w and re.IGNORECASE to get
> the right effect?
I've tried it in windows and it doesn't work. The windows is an english
version but the language is set to Brazilian Portuguese.
I've just tried it in unix and it also didn't work.
See my test, probably the chars '้' and 'ษ' won't look good in your
email, they are the chars '\351' and '\311', respectly:
/home/neves> locale
LANG=pt_BR
LC_COLLATE="pt_BR"
LC_CTYPE="pt_BR"
LC_MONETARY="pt_BR"
LC_NUMERIC="pt_BR"
LC_TIME="pt_BR"
LC_MESSAGES="pt_BR"
LC_ALL=
[nazareth]/home/neves> python
Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a = '\351\351\311' * 3
>>> print a
้้ษ้้ษ้้ษ
>>> from re import *
>>> p = compile(r'(้+)', I)
>>> m = p.search(a)
>>> print m.group(1)
้้
It should have matched the whole string, right?
--
Paulo Eduardo Neves
mailto:ne...@inf.puc-rio.br Rio de Janeiro - Brasil
Pager-> Central:(021)532-4499 Cod.:213 99 64
No, you should have added
>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")
at the start of your session, and passed I+L (or IGNORECASE+LOCALE) as
the flags to compile().
Unfortunately I just heard from Andrew Kuchling that the re module
doesn't do this right yet. Nevertheless, if it *did* do it right, you
would still have to do what I said here (i.e. the locale is not used
automatically; you must call setlocale() *and* pass the L flag to
compile).
I have an application in Visual C++ (win95/NT) where I use the
'setlocale(LC_TIME, "portuguese")' to make sure that the result of
strftime is always a portuguese string regardless of the window version
where it runs. It works with the US version of win95 and the portuguese
one. Perhaps if you explicity set the locale it will work.
Best Regards,
Paulo Soares
pso...@consiste.pt