Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

Python and Cyrillic characters in regular expression

2,513 views

Skip to first unread message

phasma

unread,

Sep 4, 2008, 10:42:47 AM9/4/08

Hi, I'm trying extract all alphabetic characters from string.

reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)

But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.

Please, help.

MRAB

unread,

Sep 4, 2008, 1:46:39 PM9/4/08

On Sep 4, 3:42 pm, phasma <xpa...@gmail.com> wrote:
> Hi, I'm trying extract all alphabetic characters from string.
>
> reg = re.compile('(?u)([\w\s]+)', re.UNICODE)

You don't need both (?u) and re.UNICODE: they mean the same thing.

This will actually match letters and whitespace.

> buf = re.match(string)
>
> But it's doesn't work. If string starts from Cyrillic character, all
> works fine. But if string starts from Latin character, match returns
> only Latin characters.
>

I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:

Program
=======
# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')

found = reg.match(u"ya я")
print found.group(1).encode("utf-8")

found = reg.match(u"я ya")
print found.group(1).encode("utf-8")

Output
======
ya я
я ya

Fredrik Lundh

unread,

Sep 4, 2008, 1:53:32 PM9/4/08

to pytho...@python.org

phasma wrote:

can you provide a few sample strings that show this behaviour?

</F>

phasma

unread,

Sep 5, 2008, 7:28:14 AM9/5/08

string = u"Привет"
(u'\u041f\u0440\u0438\u0432\u0435\u0442',)

string = u"Hi.Привет"
(u'Hi',)

MRAB

unread,

Sep 5, 2008, 10:28:12 AM9/5/08

On Sep 5, 12:28 pm, phasma <xpa...@gmail.com> wrote:
> string = u"ðÒÉ×ÅÔ"

All the characters are letters.

> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>

> string = u"Hi.ðÒÉ×ÅÔ"

The third character isn't a letter and isn't whitespace.

> (u'Hi',)

Fredrik Lundh

unread,

Sep 5, 2008, 1:43:14 PM9/5/08

to pytho...@python.org

phasma wrote:

> string = u"Привет"
> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
> string = u"Hi.Привет"
> (u'Hi',)

the [\w\s] pattern you used matches letters, numbers, underscore, and
whitespace. "." doesn't fall into that category, so the "match" method
stops when it gets to that character.

maybe you could use re.sub or re.findall?

>>> # replace all non-alphanumerics with the empty string
>>> re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

>>> # find runs of alphanumeric characters
>>> re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
>>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'

(the "sub" example expects you to specify what characters you want to
skip, while "findall" expects you to specify what you want to keep.)

</F>

0 new messages