Google 網路論壇不再支援新的 Usenet 貼文或訂閱項目,但過往內容仍可供查看。

Identifying unicode punctuation characters with Python regex

瀏覽次數:305 次
跳到第一則未讀訊息

Shiao

未讀,
2008年11月14日 清晨5:23:082008/11/14
收件者:
Hello,
I'm trying to build a regex in python to identify punctuation
characters in all the languages. Some regex implementations support an
extended syntax \p{P} that does just that. As far as I know, python re
doesn't. Any idea of a possible alternative?

Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.

Thank in advance for any suggestions.

John

"Martin v. Löwis"

未讀,
2008年11月14日 清晨5:27:462008/11/14
收件者:
> I'm trying to build a regex in python to identify punctuation
> characters in all the languages. Some regex implementations support an
> extended syntax \p{P} that does just that. As far as I know, python re
> doesn't. Any idea of a possible alternative?

You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".

Regards,
Martin

Shiao

未讀,
2008年11月14日 清晨5:31:092008/11/14
收件者:

Thanks Martin. I'll do this.

Mark Tolonen

未讀,
2008年11月14日 清晨5:43:072008/11/14
收件者:

"Shiao" <mult...@gmail.com> wrote in message
news:3a95a51c-cc4f-45ff...@l33g2000pri.googlegroups.com...

You can always build your own pattern. Something like (Python 3.0rc2):

>>> import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>> import re
>>> r=re.compile('['+Po+']')
>>> x='我是美國人。'
>>> x
'我是美國人。'
>>> r.findall(x)
['。']

-Mark

Mark Tolonen

未讀,
2008年11月14日 清晨6:30:392008/11/14
收件者:

"Mark Tolonen" <M8R-y...@mailinator.com> wrote in message
news:xsydnXWBAriky4DU...@comcast.com...

This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.

IDLE 3.0rc2
>>> import unicodedata as u
>>> A=''.join(chr(i) for i in range(65536))
>>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
>>> len(A)
65536
>>> len(P)
491
>>> len(re.findall('['+P+']',A)) # ] was naturally
>>> escaped
490
>>> set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
>>> len(re.findall('['+P+']',A))
491

-Mark

Shiao

未讀,
2008年11月14日 上午9:08:232008/11/14
收件者:
On Nov 14, 12:30 pm, "Mark Tolonen" <M8R-yft...@mailinator.com> wrote:
> "Mark Tolonen" <M8R-yft...@mailinator.com> wrote in message
>
> news:xsydnXWBAriky4DU...@comcast.com...
>
>
>
>
>
> > "Shiao" <multis...@gmail.com> wrote in message

Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)

jhermann

未讀,
2008年11月19日 清晨6:36:262008/11/19
收件者:
> >>> P=P.replace('\\','\\\\').replace(']','\\]')   # escape both of them.

re.escape() does this w/o any assumptions by your code about the regex
implementation.

0 則新訊息