Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

regular expression unicode character class trouble

0 views
Skip to first unread message

Diez B. Roggisch

unread,
Sep 4, 2005, 12:53:32 PM9/4/05
to
Hi,

I need in a unicode-environment the character-class

set("\w") - set("[0-9]")

or aplha w/o num. Any ideas how to create that? And what performance
implications do I have to fear? I mean I guess that the characterclasses
aren't implementet as sets, but as comparison-function that compares a
value with certain well-defined ranges.

Regards,

Diez

Steven Bethard

unread,
Sep 4, 2005, 3:08:36 PM9/4/05
to
Diez B. Roggisch wrote:
> Hi,
>
> I need in a unicode-environment the character-class
>
> set("\w") - set("[0-9]")
>
> or aplha w/o num. Any ideas how to create that?

I'd use something like r"[^_\d\W]", that is, all things that are neither
underscores, digits or non-alphas. In action:

py> re.findall(r'[^_\d\W]+', '42badger100x__xxA1BC')
['badger', 'x', 'xxA', 'BC']

HTH,

STeVe

Diez B. Roggisch

unread,
Sep 5, 2005, 5:42:00 AM9/5/05
to
Steven Bethard wrote:
> I'd use something like r"[^_\d\W]", that is, all things that are neither
> underscores, digits or non-alphas. In action:
>
> py> re.findall(r'[^_\d\W]+', '42badger100x__xxA1BC')
> ['badger', 'x', 'xxA', 'BC']
>
> HTH,

Seems so, great!

Diez

0 new messages