in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?
I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.
I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?
There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.
<wishful thinking>
re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))
or::
re.compile('(?u)[[:upper:]]')
or::
re.compile('(?u)\u')
for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?
</wishful thinking>
> I know that there is a re.U switch that makes \w match all unicode word
> characters, but there are no subclasses of that ([[:upper:]] or preferably \u).
unicode character classes are not supported by the current RE engine.
it's usually possible to work around this by matching all characters ("\w") in Unicode
mode ("(?u)"), and postprocessing the result to get rid of invalid matches.
</F>
This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do
uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)
Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.
(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)
> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?
For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.
Regards,
Martin