Grupos de Google ya no admite nuevas publicaciones ni suscripciones de Usenet. El contenido anterior sigue siendo visible.

Dismiss

Identifying unicode punctuation characters with Python regex

Visto 305 veces

Saltar al primer mensaje no leído

Shiao

no leída,

14 nov 2008, 5:23:0814/11/08

Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

"Martin v. Löwis"

no leída,

14 nov 2008, 5:27:4614/11/08

> I'm trying to build a regex in python to identify punctuation
> characters in all the languages. Some regex implementations support an
> extended syntax \p{P} that does just that. As far as I know, python re
> doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin

Shiao

no leída,

14 nov 2008, 5:31:0914/11/08

Thanks Martin. I'll do this.

Mark Tolonen

no leída,

14 nov 2008, 5:43:0714/11/08

"Shiao" <mult...@gmail.com> wrote in message
news:3a95a51c-cc4f-45ff...@l33g2000pri.googlegroups.com...

You can always build your own pattern. Something like (Python 3.0rc2):

>>> import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>> import re
>>> r=re.compile('['+Po+']')
>>> x='我是美國人。'
>>> x
'我是美國人。'
>>> r.findall(x)
['。']

-Mark

Mark Tolonen

no leída,

14 nov 2008, 6:30:3914/11/08

"Mark Tolonen" <M8R-y...@mailinator.com> wrote in message
news:xsydnXWBAriky4DU...@comcast.com...

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>> import unicodedata as u
>>> A=''.join(chr(i) for i in range(65536))
>>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
>>> len(A)
65536
>>> len(P)
491
>>> len(re.findall('['+P+']',A)) # ] was naturally
>>> escaped
490
>>> set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
>>> len(re.findall('['+P+']',A))
491

-Mark

Shiao

no leída,

14 nov 2008, 9:08:2314/11/08

On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.com> wrote:
> "Mark Tolonen" <M8R-yft...@mailinator.com> wrote in message
>
> news:xsydnXWBAriky4DU...@comcast.com...
>
>
>
>
>
> > "Shiao" <multis...@gmail.com> wrote in message

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)

jhermann

no leída,

19 nov 2008, 6:36:2619/11/08

> >>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.

0 mensajes nuevos