Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Identifying unicode punctuation characters with Python regex

305 views
Skip to first unread message

Shiao

unread,
Nov 14, 2008, 5:23:08 AM11/14/08
to
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

"Martin v. Löwis"

unread,
Nov 14, 2008, 5:27:46 AM11/14/08
to
> I'm trying to build a regex in python to identify punctuation
> characters in all the languages. Some regex implementations support an
> extended syntax \p{P} that does just that. As far as I know, python re
> doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin

Shiao

unread,
Nov 14, 2008, 5:31:09 AM11/14/08
to

Thanks Martin. I'll do this.

Mark Tolonen

unread,
Nov 14, 2008, 5:43:07 AM11/14/08
to

"Shiao" <mult...@gmail.com> wrote in message
news:3a95a51c-cc4f-45ff...@l33g2000pri.googlegroups.com...

You can always build your own pattern. Something like (Python 3.0rc2):

>>> import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>> import re
>>> r=re.compile('['+Po+']')
>>> x='我是美國人。'
>>> x
'我是美國人。'
>>> r.findall(x)
['。']

-Mark

Mark Tolonen

unread,
Nov 14, 2008, 6:30:39 AM11/14/08
to

"Mark Tolonen" <M8R-y...@mailinator.com> wrote in message
news:xsydnXWBAriky4DU...@comcast.com...

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>> import unicodedata as u
>>> A=''.join(chr(i) for i in range(65536))
>>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
>>> len(A)
65536
>>> len(P)
491
>>> len(re.findall('['+P+']',A)) # ] was naturally
>>> escaped
490
>>> set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
>>> len(re.findall('['+P+']',A))
491

-Mark

Shiao

unread,
Nov 14, 2008, 9:08:23 AM11/14/08
to
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.com> wrote:
> "Mark Tolonen" <M8R-yft...@mailinator.com> wrote in message
>
> news:xsydnXWBAriky4DU...@comcast.com...
>
>
>
>
>
> > "Shiao" <multis...@gmail.com> wrote in message

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)

jhermann

unread,
Nov 19, 2008, 6:36:26 AM11/19/08
to
> >>> P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.

0 new messages