So I was wondering what is the pattern to extract (or to match) _true_
words ? Of course, I don't restrict myself to the ascii universe so that
the pattern [a-zA-Z]+ doesn't meet my needs.
"Word" presumably/intuitively; hence the non-standard "[:word:]"
POSIX-like character class alias for \w in some environments.
> But this feature doesn't permit to extract
> true words; by "true" I mean word composed only of _alphabetic_ letters (not
> digit nor underscore).
Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
And what of hyphenated terms (e.g. "re-lock")?
> So I was wondering what is the pattern to extract (or to match) _true_ words
> ? Of course, I don't restrict myself to the ascii universe so that the
> pattern [a-zA-Z]+ doesn't meet my needs.
AFAICT, there doesn't appear to be a nice way to do this in Python
using the std lib `re` module, but I'm not a regex guru.
POSIX character classes are unsupported, which rules out "[:alpha:]".
\w can be made Unicode/locale-sensitive, but includes digits and the
underscore, as you've already pointed out.
\p (Unicode property/block testing), which would allow for
"\p{Alphabetic}" or similar, is likewise unsupported.
Cheers,
Chris
--
http://blog.rebertia.com
letter_class = u"[" + u"".join(unichr(c) for c in range(0x10000) if
unichr(c).isalpha()) + u"]"
Alternatively, you could try the new regex implementation here:
http://pypi.python.org/pypi/regex
which adds support for Unicode properties, and do something like this:
words = regex.findall(ur"\p{Letter}+", unicode_text)
It's an interesting parsing problem to find word breaks in mixed
language text. It's quite common to find English and Japanese text
mixed. (See "http://www.dokidoki6.com/00_index1.html". Caution,
excessively cute.) Each ideograph is a "word", of course.
Parse this into words:
★12/25/2009★
6%DOKIDOKI VISUAL FILE vol.4を公開しました。
アルバムの上部で再生操作、下部でサムネイルがご覧いただけます。
John Nagle
> "Word" presumably/intuitively; hence the non-standard "[:word:]"
> POSIX-like character class alias for \w in some environments.
OK
> Are you intentionally excluding CJK ideographs (as not "letters"/alphabetic)?
Yes, CJK ideographs don't belong to the locale I'm working with ;)
> And what of hyphenated terms (e.g. "re-lock")?
I'm interested only with ascii letters and ascii letters with diacritics
Thanks for your response.