Hebrew regex word boundries

430 views
Skip to first unread message

Yuval Adam

unread,
May 6, 2013, 5:15:04 AM5/6/13
to pywe...@googlegroups.com
I'm trying to get the following regex to match full words in Hebrew unicode strings:

    word = re.compile(r'\bמילה', re.U)  # note the unicode flag

… but not succeeding in actually matching my words against that pattern.

What am I doing wrong? Shouldn't it be possible to find word boundaries (\b) in unicode strings?

Yuval Adam

unread,
May 6, 2013, 5:28:38 AM5/6/13
to pywe...@googlegroups.com
Also signified by this pathologic test case:

    >>> re.compile(r'\W', re.U).split('לא עובד השיט הזה')
    ['', '', '', … '']

Meir Kriheli

unread,
May 6, 2013, 5:29:07 AM5/6/13
to pyweb-il
For a start, you should use re.search instead:

http://docs.python.org/2/library/re.html#search-vs-match

Plus, unless you're using python 3, the above are not unicode strings.

Cheers


--
You received this message because you are subscribed to the Google Groups "PyWeb-IL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyweb-il+u...@googlegroups.com.
To post to this group, send email to pywe...@googlegroups.com.
Visit this group at http://groups.google.com/group/pyweb-il?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--

Meir Kriheli

unread,
May 6, 2013, 5:30:09 AM5/6/13
to pyweb-il
Hi,

>>> re.compile(r'\W', re.U).split('לא עובד השיט הזה'.decode('utf-8'))

[u'\u05dc\u05d0', u'\u05e2\u05d5\u05d1\u05d3', u'\u05d4\u05e9\u05d9\u05d8', u'\u05d4\u05d6\u05d4']



--
You received this message because you are subscribed to the Google Groups "PyWeb-IL" group.
To unsubscribe from this group and stop receiving emails from it, send an email to pyweb-il+u...@googlegroups.com.
To post to this group, send email to pywe...@googlegroups.com.
Visit this group at http://groups.google.com/group/pyweb-il?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Yuval Adam

unread,
May 6, 2013, 5:41:21 AM5/6/13
to pywe...@googlegroups.com
Fuckin' unicode. Thanks.
Reply all
Reply to author
Forward
0 new messages