Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regexps with unicode-aware characterclasses?

3 views
Skip to first unread message

Stefan Rank

unread,
Aug 30, 2005, 9:54:26 AM8/30/05
to pytho...@python.org
Hi all,

in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.

<wishful thinking>

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

re.compile('(?u)[[:upper:]]')

or::

re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

</wishful thinking>

Fredrik Lundh

unread,
Aug 30, 2005, 10:33:07 AM8/30/05
to pytho...@python.org
Stefan Rank wrote:

> I know that there is a re.U switch that makes \w match all unicode word
> characters, but there are no subclasses of that ([[:upper:]] or preferably \u).

unicode character classes are not supported by the current RE engine.

it's usually possible to work around this by matching all characters ("\w") in Unicode
mode ("(?u)"), and postprocessing the result to get rid of invalid matches.

</F>

"Martin v. Löwis"

unread,
Sep 14, 2005, 2:09:03 AM9/14/05
to
Stefan Rank wrote:
> <wishful thinking>
>
> re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)

> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin

0 new messages