Google Groups không còn hỗ trợ đăng ký sử dụng hoặc đăng nội dung mới trên Usenet. Bạn vẫn có thể xem nội dung cũ.

Dismiss

regexps with unicode-aware characterclasses?

3 lượt xem

Chuyển tới thư đầu tiên chưa đọc

Stefan Rank

chưa đọc,

09:54:26 30 thg 8, 200530/8/05

đến pytho...@python.org

Hi all,

in a python re pattern, how do I match all unicode uppercase characters
(in a unicode string/in a utf-8 string)?

I know that there is string.uppercase/.lowercase which are
'locale-aware', but I don't think there is a "all locales" locale.

I know that there is a re.U switch that makes \w match all unicode word
characters, but there are no subclasses of that ([[:upper:]] or
preferably \u).
Or is there a module/extension to get that?

There is the module unicodedata, but it has no unicodedata.uppercase
that would correspond to string.uppercase.

re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

or::

re.compile('(?u)[[:upper:]]')

or::

re.compile('(?u)\u')

for the latter two, to work on utf-8 strings, would I have to set the
defaultencoding to utf-8?

</wishful thinking>

Fredrik Lundh

chưa đọc,

10:33:07 30 thg 8, 200530/8/05

đến pytho...@python.org

Stefan Rank wrote:

> I know that there is a re.U switch that makes \w match all unicode word
> characters, but there are no subclasses of that ([[:upper:]] or preferably \u).

unicode character classes are not supported by the current RE engine.

it's usually possible to work around this by matching all characters ("\w") in Unicode
mode ("(?u)"), and postprocessing the result to get rid of invalid matches.

</F>

"Martin v. Löwis"

chưa đọc,

02:09:03 14 thg 9, 200514/9/05

đến

Stefan Rank wrote:
> <wishful thinking>
>
> re.compile('|'.join([x.encode('utf8') for x in unicode.uppercase]))

This would (almost) work, but it would be terribly inefficient (time
linear to the number of alternatives). You can realistically do

uppers = [u'[']
for i in range(sys.maxunicode):
c = unichr(i)
if c.isupper(): uppers.append(c)
uppers.append(u']')
uppers = u"".join(uppers)
uppers_re = re.compile(uppers)

Compiling this expression is quite expensive; matching it is fairly
efficient (time independent of the number of characters in the class).
To save startup cost, consider pickling the compiled expression.

(syntax note: this only works because none of the characters special
to a RE class (]-^\) is an uppercase letter; otherwise, escaping might
be needed)

> for the latter two, to work on utf-8 strings, would I have to set the
> defaultencoding to utf-8?

For Unicode things, you should avoid using byte strings - especially
when it comes to regular expressions. Use Unicode strings instead.

Regards,
Martin

0 tin nhắn mới