Apart from manually including the punctuation character range for each
and every language, I don't see how this can be done.
Thank in advance for any suggestions.
John
You should use character classes. You can generate them automatically
from the unicodedata module: check whether unicodedata.category(c)
starts with "P".
Regards,
Martin
Thanks Martin. I'll do this.
You can always build your own pattern. Something like (Python 3.0rc2):
>>> import unicodedata
Po=''.join(chr(x) for x in range(65536) if unicodedata.category(chr(x)) ==
'Po')
>>> import re
>>> r=re.compile('['+Po+']')
>>> x='我是美國人。'
>>> x
'我是美國人。'
>>> r.findall(x)
['。']
-Mark
This was an interesting problem. Need to escape \ and ] to find all the
punctuation correctly, and it turns out those characters are sequential in
the Unicode character set, so ] was coincidentally escaped in my first
attempt.
IDLE 3.0rc2
>>> import unicodedata as u
>>> A=''.join(chr(i) for i in range(65536))
>>> P=''.join(chr(i) for i in range(65536) if u.category(chr(i))[0]=='P')
>>> len(A)
65536
>>> len(P)
491
>>> len(re.findall('['+P+']',A)) # ] was naturally
>>> escaped
490
>>> set(P)-set(re.findall('['+P+']',A)) # so only missing \
{'\\'}
>>> P=P.replace('\\','\\\\').replace(']','\\]') # escape both of them.
>>> len(re.findall('['+P+']',A))
491
-Mark
Mark,
Many thanks. I feel almost ashamed I got away with it so easily :-)
re.escape() does this w/o any assumptions by your code about the regex
implementation.