regex help: splitting string gets weird groups

gry

unread,

Apr 8, 2010, 2:49:01 PM4/8/10

to

[ python3.1.1, re.__version__='2.2.1' ]
I'm trying to use re to split a string into (any number of) pieces of
these kinds:
1) contiguous runs of letters
2) contiguous runs of digits
3) single other characters

e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
'.', 'in', '#', '=', 1234]
I tried:
>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
('1234', 'in', '1234', '=')

Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
group? Is my regexp illegal somehow and confusing the engine?

I *would* like to understand what's wrong with this regex, though if
someone has a neat other way to do the above task, I'm also interested
in suggestions.

MRAB

unread,

Apr 8, 2010, 3:40:17 PM4/8/10

to pytho...@python.org

If the regex was illegal then it would raise an exception. It's doing
exactly what you're asking it to do!

First of all, there are 4 groups, with group 1 containing groups 2..4 as
alternatives, so group 1 will match whatever groups 2..4 match:

Group 1: (([A-Za-z]+)|([0-9]+)|([-.#=]))
Group 2: ([A-Za-z]+)
Group 3: ([0-9]+)
Group 4: ([-.#=])

It matches like this:

Group 1 and group 3 match '555'.
Group 1 and group 2 match 'tHe'.
Group 1 and group 4 match '-'.
Group 1 and group 2 match 'rain'.
Group 1 and group 4 match '.'.
Group 1 and group 2 match 'in'.
Group 1 and group 4 match '#'.
Group 1 and group 4 match '='.
Group 1 and group 3 match '1234'.

If a group matches then any earlier match of that group is discarded,
so:

Group 1 finishes with '1234'.
Group 2 finishes with 'in'.
Group 3 finishes with '1234'.
Group 4 finishes with '='.

A solution is:

>>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', '555tHe-rain.in#=1234')
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

Note: re.findall() returns a list of matches, so if the regex doesn't
contain any groups then it returns the matched substrings. Compare:

>>> re.findall("a(.)", "ax ay")
['x', 'y']
>>> re.findall("a.", "ax ay")
['ax', 'ay']

Jon Clements

unread,

Apr 8, 2010, 3:44:33 PM4/8/10

to

I would avoid .match and use .findall
(if you walk through them both together, it'll make sense what's
happening
with your match string).

>>> s = """555tHe-rain.in#=1234"""
>>> re.findall('[A-Za-z]+|[0-9]+|[-.#=]', s)
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

hth,

Jon.

Patrick Maupin

unread,

Apr 8, 2010, 3:46:01 PM4/8/10

to

IMO, for most purposes, for people who don't want to become re
experts, the easiest, fastest, best, most predictable way to use re is
re.split. You can either call re.split directly, or, if you are going
to be splitting on the same pattern over and over, compile the pattern
and grab its split method. Use a *single* capture group in the
pattern, that covers the *whole* pattern. In the case of your example
data:

>>> import re
>>> splitter=re.compile('([A-Za-z]+|[0-9]+|[-.#=])').split
>>> s='555tHe-rain.in#=1234'
>>> [x for x in splitter(s) if x]
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The reason for the list comprehension is that re.split will always
return a non-matching string between matches. Sometimes this is
useful even when it is a null string (see recent discussion in the
group about splitting digits out of a string), but if you don't care
to see null (empty) strings, this comprehension will remove them.

The reason for a single capture group that covers the whole pattern is
that it is much easier to reason about the output. The split will
give you all your data, in order, e.g.

>>> ''.join(splitter(s)) == s
True

HTH,
Pat

Tim Chase

unread,

Apr 8, 2010, 4:02:22 PM4/8/10

to gry, pytho...@python.org

gry wrote:
> [ python3.1.1, re.__version__='2.2.1' ]
> I'm trying to use re to split a string into (any number of) pieces of
> these kinds:
> 1) contiguous runs of letters
> 2) contiguous runs of digits
> 3) single other characters
>
> e.g. 555tHe-rain.in#=1234 should give: [555, 'tHe', '-', 'rain',
> '.', 'in', '#', '=', 1234]
> I tried:
>>>> re.match('^(([A-Za-z]+)|([0-9]+)|([-.#=]))+$', '555tHe-rain.in#=1234').groups()
> ('1234', 'in', '1234', '=')
>
> Why is 1234 repeated in two groups? and why doesn't "tHe" appear as a
> group? Is my regexp illegal somehow and confusing the engine?

well, I'm not sure what it thinks its finding but nested capture-groups
always produce somewhat weird results for me (I suspect that's what's
triggering the duplication). Additionally, you're only searching for
one match (.match() returns a single match-object or None; not all
possible matches within the repeated super-group).

> I *would* like to understand what's wrong with this regex, though if
> someone has a neat other way to do the above task, I'm also interested
> in suggestions.

Tweaking your original, I used

>>> s='555tHe-rain.in#=1234'
>>> import re
>>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
>>> r.findall(s)
['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

The only difference between my results and your results is that the 555
and 1234 come back as strings, not ints.

-tkc

gry

unread,

Apr 8, 2010, 4:36:28 PM4/8/10

to

On Apr 8, 3:40 pm, MRAB <pyt...@mrabarnett.plus.com> wrote:

...

> Group 1 and group 4 match '='.
> Group 1 and group 3 match '1234'.
>
> If a group matches then any earlier match of that group is discarded,

Wow, that makes this much clearer! I wonder if this behaviour
shouldn't be mentioned in some form in the python docs?
Thanks much!

Jon Clements

unread,

Apr 8, 2010, 4:37:14 PM4/8/10

to

On 8 Apr, 19:49, gry <georgeryo...@gmail.com> wrote:

Avoiding re's (for a bit of fun):
(no good for unicode obviously)

import string
from itertools import groupby, chain, repeat, count, izip

s = """555tHe-rain.in#=1234"""

unique_group = count()
lookup = dict(
chain(
izip(string.ascii_letters, repeat('L')),
izip(string.digits, repeat('D')),
izip(string.punctuation, unique_group)
)
)
parse = dict(D=int, L=str.capitalize)

print [ parse.get(key, lambda L: L)(''.join(items)) for key, items in
groupby(s, lambda L: lookup[L]) ]
[555, 'The', '-', 'Rain', '.', 'In', '#', '=', 1234]

Jon.

gry

unread,

Apr 8, 2010, 4:40:16 PM4/8/10

to

> >>> s='555tHe-rain.in#=1234'
> >>> import re
> >>> r=re.compile(r'([a-zA-Z]+|\d+|.)')
> >>> r.findall(s)
> ['555', 'tHe', '-', 'rain', '.', 'in', '#', '=', '1234']

This is nice and simple and has the invertible property that Patrick
mentioned above. Thanks much!

Patrick Maupin

unread,

Apr 8, 2010, 5:06:05 PM4/8/10

to

Yes, like using split(), this is invertible. But you will see a
difference (and for a given task, you might prefer one way or the
other) if, for example, you put a few consecutive spaces in the middle
of your string, where this pattern and findall() will return each
space individually, and split() will return them all together.

You *can* fix up the pattern for findall() where it will have the same
properties as the split(), but it will almost always be a more
complicated pattern than for the equivalent split().

Another thing you can do with split(): if you *think* you have a
pattern that fully covers every string you expect to throw at it, but
would like to verify this, you can make use of the fact that split()
returns a string between each match (and before the first match and
after the last match). So if you expect that every character in your
entire string should be a part of a match, you can do something like:

strings = splitter(s)
tokens = strings[1::2]
assert not ''.join(strings[::2])

Regards,
Pat