Making Montreal match Montréal using re

Skip Montanaro

unread,

Jan 27, 1998, 3:00:00 AM1/27/98

to

I'm using the default locale (whatever US ASCII is), but I'd like to match
some words that have accented characters using the re module. For instance,
I'd like the Americanized "Montreal" to match the French "Montréal". I
thought the way to do this changed with 1.5 and the re module, but a search
of the Python locator and the re module documentation didn't turn up
anything useful. How do I do this?

Thx,

Skip Montanaro | Musi-Cal: http://concerts.calendar.com/
sk...@calendar.com | Python Support: http://www.pythonpros.com/
(518)372-5583 | XEmacs: http://www.automatrix.com/~skip/xemacs/tip.html

Guido van Rossum

unread,

Jan 27, 1998, 3:00:00 AM1/27/98

to

> I'm using the default locale (whatever US ASCII is), but I'd like to match

> some words that have accented characters using the re module. For instance=
> ,
> I'd like the Americanized "Montreal" to match the French "Montr=E9al". I
> thought the way to do this changed with 1.5 and the re module, but a search=

>
> of the Python locator and the re module documentation didn't turn up
> anything useful. How do I do this?

If you don't want to change the locale, you'll have to make an
explicit translation table (so you'll have to decide exactly which
accented characters you want to map to which other characters) and
translate the string using string.translate().

--Guido van Rossum (home page: http://www.python.org/~guido/)

ne...@inf.puc-rio.br

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

In article <1998012703...@eric.CNRI.Reston.Va.US>,

Guido van Rossum <gu...@CNRI.Reston.Va.US> wrote:
>
> > I'm using the default locale (whatever US ASCII is), but I'd like to match
> > some words that have accented characters using the re module. For
instance=

> If you don't want to change the locale, you'll have to make an

> explicit translation table (so you'll have to decide exactly which
> accented characters you want to map to which other characters) and
> translate the string using string.translate().

What about ignoring case in accented languages?
I'd like to match '\351' and '\311', the same accented letter just that one is
capitalized.

Is the only way using string.translate()? Isn't it too slow?

If it is so, has anyone already created this translation table?

regards,
Paulo

Guido van Rossum

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

[me]

> > If you don't want to change the locale, you'll have to make an
> > explicit translation table (so you'll have to decide exactly which
> > accented characters you want to map to which other characters) and
> > translate the string using string.translate().

[Paulo]

> What about ignoring case in accented languages?
> I'd like to match '\351' and '\311', the same accented letter just
> that one is capitalized.
>
> Is the only way using string.translate()? Isn't it too slow?

It's implemented in C so it should be FAST!

> If it is so, has anyone already created this translation table?

You can easily build a table yourself using string.maketrans():
e.g. string.maketrans("\351", "\311") returns a translation table that
maps \351 to \311 (the arguments are strings that are to be mapped
one-by-one).

But you're posting from Brazil, I presume -- if you set up your locale
correctly, shouldn't you be able to use \w and re.IGNORECASE to get
the right effect?

Andrew Kuchling

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

ne...@inf.puc-rio.br wrote:
>What about ignoring case in accented languages?
>I'd like to match '\351' and '\311', the same accented letter just that one is
>capitalized.
>
>Is the only way using string.translate()? Isn't it too slow?

I'm afraid string.translate or string.lower are your only
options; at the moment, a fixed table is used to map between upper and
lower-case, because making it fully dynamic based upon the re.LOCALE
flag was really messy.

akuc...@acm.org http://starship.skyport.net/crew/amk/
Modern disillusion is unlikely to last forever, and nothing rings so
hollow as the angst of yesterday.
-- Robertson Davies, "Reading"

Paulo Eduardo Neves

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

Guido van Rossum wrote:

> [Paulo]

> > What about ignoring case in accented languages?
> > I'd like to match '\351' and '\311', the same accented letter just
> > that one is capitalized.
> >
> > Is the only way using string.translate()? Isn't it too slow?
>

> It's implemented in C so it should be FAST!

Sure. But it is probably slower to do string.locale() and a pattern
matching than working with a pattern matching that ignores the case.

Other problem is that I'd lost my original text.

>
> > If it is so, has anyone already created this translation table?
>
> You can easily build a table yourself using string.maketrans():
> e.g. string.maketrans("\351", "\311") returns a translation table that
> maps \351 to \311 (the arguments are strings that are to be mapped
> one-by-one).

No problem, I just thought someone else should have already done that.

>
> But you're posting from Brazil, I presume --

Yes.

>if you set up your locale
> correctly, shouldn't you be able to use \w and re.IGNORECASE to get
> the right effect?

I've tried it in windows and it doesn't work. The windows is an english
version but the language is set to Brazilian Portuguese.

I've just tried it in unix and it also didn't work.
See my test, probably the chars '้' and 'ษ' won't look good in your
email, they are the chars '\351' and '\311', respectly:

/home/neves> locale
LANG=pt_BR
LC_COLLATE="pt_BR"
LC_CTYPE="pt_BR"
LC_MONETARY="pt_BR"
LC_NUMERIC="pt_BR"
LC_TIME="pt_BR"
LC_MESSAGES="pt_BR"
LC_ALL=
[nazareth]/home/neves> python
Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> a = '\351\351\311' * 3
>>> print a
้้ษ้้ษ้้ษ
>>> from re import *
>>> p = compile(r'(้+)', I)
>>> m = p.search(a)
>>> print m.group(1)
้้

It should have matched the whole string, right?

--
Paulo Eduardo Neves
mailto:ne...@inf.puc-rio.br Rio de Janeiro - Brasil
Pager-> Central:(021)532-4499 Cod.:213 99 64

Guido van Rossum

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

> I've just tried it in unix and it also didn't work.
> See my test, probably the chars '้' and 'ษ' won't look good in your
> email, they are the chars '\351' and '\311', respectly:
>
> /home/neves> locale
> LANG=pt_BR
> LC_COLLATE="pt_BR"
> LC_CTYPE="pt_BR"
> LC_MONETARY="pt_BR"
> LC_NUMERIC="pt_BR"
> LC_TIME="pt_BR"
> LC_MESSAGES="pt_BR"
> LC_ALL=
> [nazareth]/home/neves> python
> Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> a = '\351\351\311' * 3
> >>> print a
> ้้ษ้้ษ้้ษ
> >>> from re import *
> >>> p = compile(r'(้+)', I)
> >>> m = p.search(a)
> >>> print m.group(1)
> ้้
> It should have matched the whole string, right?

No, you should have added

>>> import locale
>>> locale.setlocale(locale.LC_ALL, "")

at the start of your session, and passed I+L (or IGNORECASE+LOCALE) as
the flags to compile().

Unfortunately I just heard from Andrew Kuchling that the re module
doesn't do this right yet. Nevertheless, if it *did* do it right, you
would still have to do what I said here (i.e. the locale is not used
automatically; you must call setlocale() *and* pass the L flag to
compile).

Paulo Soares

unread,

Jan 28, 1998, 3:00:00 AM1/28/98

to

On Wednesday, January 28, 1998 20:00, Paulo Eduardo

Neves[SMTP:ne...@inf.puc-rio.br] wrote:
> Guido van Rossum wrote:
>
> > [Paulo]
> > > What about ignoring case in accented languages?
> > > I'd like to match '\351' and '\311', the same accented letter just
> > > that one is capitalized.
> > >
> > > Is the only way using string.translate()? Isn't it too slow?
> >
> > It's implemented in C so it should be FAST!
>
> Sure. But it is probably slower to do string.locale() and a pattern
> matching than working with a pattern matching that ignores the case.
>
> Other problem is that I'd lost my original text.
>
> >
> > > If it is so, has anyone already created this translation table?
> >
> > You can easily build a table yourself using string.maketrans():
> > e.g. string.maketrans("\351", "\311") returns a translation table that
> > maps \351 to \311 (the arguments are strings that are to be mapped
> > one-by-one).
>
> No problem, I just thought someone else should have already done that.
>
> >
> > But you're posting from Brazil, I presume --
>
> Yes.
>
> >if you set up your locale
> > correctly, shouldn't you be able to use \w and re.IGNORECASE to get
> > the right effect?
>
> I've tried it in windows and it doesn't work. The windows is an english
> version but the language is set to Brazilian Portuguese.
>

> I've just tried it in unix and it also didn't work.

> See my test, probably the chars 'é' and 'É' won't look good in your

> email, they are the chars '\351' and '\311', respectly:
>
> /home/neves> locale
> LANG=pt_BR
> LC_COLLATE="pt_BR"
> LC_CTYPE="pt_BR"
> LC_MONETARY="pt_BR"
> LC_NUMERIC="pt_BR"
> LC_TIME="pt_BR"
> LC_MESSAGES="pt_BR"
> LC_ALL=
> [nazareth]/home/neves> python
> Python 1.5 (#12, Jan 22 1998, 22:21:32) [C] on aix4
> Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
> >>> a = '\351\351\311' * 3
> >>> print a

> ééÉééÉééÉ
> >>> from re import *
> >>> p = compile(r'(é+)', I)

> >>> m = p.search(a)
> >>> print m.group(1)

> éé

>
>
> It should have matched the whole string, right?
>
>

> --
> Paulo Eduardo Neves
> mailto:ne...@inf.puc-rio.br Rio de Janeiro - Brasil
> Pager-> Central:(021)532-4499 Cod.:213 99 64

I have an application in Visual C++ (win95/NT) where I use the
'setlocale(LC_TIME, "portuguese")' to make sure that the result of
strftime is always a portuguese string regardless of the window version
where it runs. It works with the US version of win95 and the portuguese
one. Perhaps if you explicity set the locale it will work.

Best Regards,
Paulo Soares
pso...@consiste.pt