reg = re.compile('(?u)([\w\s]+)', re.UNICODE)
buf = re.match(string)
But it's doesn't work. If string starts from Cyrillic character, all
works fine. But if string starts from Latin character, match returns
only Latin characters.
Please, help.
You don't need both (?u) and re.UNICODE: they mean the same thing.
This will actually match letters and whitespace.
> buf = re.match(string)
>
> But it's doesn't work. If string starts from Cyrillic character, all
> works fine. But if string starts from Latin character, match returns
> only Latin characters.
>
I'm encoding the Unicode results as UTF-8 in order to print them, but
I'm not having a problem with it otherwise:
Program
=======
# -*- coding: utf-8 -*-
import re
reg = re.compile('(?u)([\w\s]+)')
found = reg.match(u"ya я")
print found.group(1).encode("utf-8")
found = reg.match(u"я ya")
print found.group(1).encode("utf-8")
Output
======
ya я
я ya
can you provide a few sample strings that show this behaviour?
</F>
string = u"Hi.Привет"
(u'Hi',)
All the characters are letters.
> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
> string = u"Hi.ðÒÉ×ÅÔ"
The third character isn't a letter and isn't whitespace.
> (u'Hi',)
> string = u"Привет"
> (u'\u041f\u0440\u0438\u0432\u0435\u0442',)
>
> string = u"Hi.Привет"
> (u'Hi',)
the [\w\s] pattern you used matches letters, numbers, underscore, and
whitespace. "." doesn't fall into that category, so the "match" method
stops when it gets to that character.
maybe you could use re.sub or re.findall?
>>> # replace all non-alphanumerics with the empty string
>>> re.sub("(?u)\W+", "", string)
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'
>>> # find runs of alphanumeric characters
>>> re.findall("(?u)\w+", string)
[u'Hi', u'\u041f\u0440\u0438\u0432\u0435\u0442']
>>> "".join(re.findall("(?u)\w+", string))
u'Hi\u041f\u0440\u0438\u0432\u0435\u0442'
(the "sub" example expects you to specify what characters you want to
skip, while "findall" expects you to specify what you want to keep.)
</F>