Splitting a string

Thomas Heller

unread,

Apr 2, 2010, 6:12:22 AM4/2/10

to pytho...@python.org

Maybe I'm just lazy, but what is the fastest way to convert a string
into a tuple containing character sequences and integer numbers, like this:

'si_pos_99_rep_1_0.ita' -> ('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Thanks for ideas,
Thomas

Alex Willmer

unread,

Apr 2, 2010, 6:44:01 AM4/2/10

to

This is very probably not the fastest execution wise, it was the
fastest development time wise:

import re

def maybe_int(x):
try:
return int(x)
except ValueError:
return x

def strings_n_ints(s):
return tuple(maybe_int(x) for x in re.findall('(\d+|\D+)', s))

>>> strings_n_ints('si_pos_99_rep_1_0.ita')

('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Alex

Peter Otten

unread,

Apr 2, 2010, 7:24:32 AM4/2/10

to

Thomas Heller wrote:

>>> parts = re.compile("([+-]?\d+)").split('si_pos_99_rep_1_0.ita')
>>> parts[1::2] = map(int, parts[1::2])
>>> parts
['si_pos_', 99, '_rep_', 1, '_', 0, '.ita']

Peter

Patrick Maupin

unread,

Apr 2, 2010, 9:48:47 AM4/2/10

to

You beat me to it. re.split() seems underappreciated for some
reason. When I first started using it (even though it was faster for
the tasks I was using it for than other things) I was really annoyed
at the empty strings it was providing between matches. It is only
within the past couple of years that I have come to appreciate the
elegant solutions that those empty strings allow for. In short,
re.split() is by far the fastest and most elegant way to use the re
module for a large class of problems.

So, the only thing I have to add to this solution is that, for this
particular regular expression, if the source string starts with or
ends with digits, you will get empty strings at the beginning or end
of the resultant list, so if this is a problem, you will want to check
for and discard those.

Regards,
Pat

Terry Reedy

unread,

Apr 2, 2010, 11:05:50 AM4/2/10

to pytho...@python.org

On 4/2/2010 6:21 AM, Shashwat Anand wrote:
> >>> s = 'si_pos_99_rep_1_0.ita'
> >>> res = tuple(re.split(r'(\d+)', s))
> >>> res
> ('si_pos_', '99', '_rep_', '1', '_', '0', '.ita')

This solves the core of the problem, but is not quite there ;-).
Thomas requested conversion of int literals to ints, which is easy:

import re
s = 'si_pos_99_rep_1_0.ita'
res = re.split(r'(\d+)', s)
for i,s in enumerate(res):
try: res[i] = int(s)
except: pass
res = tuple(res)
print(res)

('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

Terry Jan Reedy

Bearophile

unread,

Apr 2, 2010, 11:41:28 AM4/2/10

to

I don't know how fast this is (Python 2.x):

>>> from itertools import groupby
>>> t = 'si_pos_99_rep_1_0.ita'
>>> tuple(int("".join(g)) if h else "".join(g) for h,g in groupby(t, str.isdigit))

('si_pos_', 99, '_rep_', 1, '_', 0, '.ita')

It doesn't work with unicode strings.

Bye,
bearophile

Thomas Heller

unread,

Apr 2, 2010, 4:06:52 PM4/2/10

to pytho...@python.org

Patrick Maupin schrieb:

Thanks to all for these code snippets. Peter's solution is the winner -
most elegant and also the fastest. With an additional list comprehension
to remove the possible empty strings at the start and at the end I get
16 us. Interesting is that Xavier's solution (which is similar to
some code that I wrote myself) isn't so much slower; it get timings of
around 22 us.

--
Thanks,
Thomas

Peter Otten

unread,

Apr 2, 2010, 5:32:58 PM4/2/10

to

Thomas Heller wrote:

> Thanks to all for these code snippets. Peter's solution is the winner -
> most elegant and also the fastest. With an additional list comprehension
> to remove the possible empty strings at the start and at the end I get
> 16 us. Interesting is that Xavier's solution (which is similar to
> some code that I wrote myself) isn't so much slower; it get timings of
> around 22 us.

Deleting the first or last item is probably faster than looping over the
whole list. If there aren't any empty strings the overhead is constant.

_split = re.compile(r"(\d+)").split
def split(s):
if not s:
return ()
parts = _split(s)

parts[1::2] = map(int, parts[1::2])

if parts[-1] == "":
del parts[-1]
if parts[0] == "":
del parts[0]
return tuple(parts)

Peter

Patrick Maupin

unread,

Apr 2, 2010, 6:53:07 PM4/2/10

to

On Apr 2, 4:32 pm, Peter Otten <__pete...@web.de> wrote:

> _split = re.compile(r"(\d+)").split
> def split(s):
> if not s:
> return ()
> parts = _split(s)
> parts[1::2] = map(int, parts[1::2])
> if parts[-1] == "":
> del parts[-1]
> if parts[0] == "":
> del parts[0]
> return tuple(parts)
>

That's certainly faster than a list comprehension (at least on long
lists), but it might be a little obscure why the "if not s:" is
needed, so unless Thomas has a really long result list, he might want
to just keep the list comprehension, which is (IMO) very readable.

Alternatively, this is halfway between the previous example and the
list comprehension:

_split = re.compile(r"(\d+)").split
def split(s):

parts = _split(s)
parts[1::2] = map(int, parts[1::2])

for index in (-1, 0):
if parts and parts[index] == "":
del parts[index]
return tuple(parts)

BTW, I just remembered that, although I have often used the fact that
split returns alternating non-match/match/.../match/non-match in the
past, the last time I did this particular task (of splitting out
digits from a string), I didn't make use of that fact. But I wasn't
expecting very many splits for this case. FWIW, here's a class I
wrote that does this to a string for the express purpose of making
sorts work better:

http://code.google.com/p/pyeda/source/browse/trunk/kipy/kipy/utility/istring.py

Regards,
Pat

Peter Otten

unread,

Apr 3, 2010, 5:17:36 AM4/3/10

to

Patrick Maupin wrote:

> On Apr 2, 4:32 pm, Peter Otten <__pete...@web.de> wrote:
>
>> _split = re.compile(r"(\d+)").split
>> def split(s):
>> if not s:
>> return ()
>> parts = _split(s)
>> parts[1::2] = map(int, parts[1::2])

# because s is non-empty parts contains at least one
# item != "", and parts[x] below cannot fail with an
# IndexError

>> if parts[-1] == "":
>> del parts[-1]
>> if parts[0] == "":
>> del parts[0]
>> return tuple(parts)
>>
>
> That's certainly faster than a list comprehension (at least on long
> lists), but it might be a little obscure why the "if not s:" is
> needed,

The function is small; with a test suite covering the corner cases and
perhaps a comment* nothing should go wrong.

(*) you can certainly improve on my attempt

> so unless Thomas has a really long result list, he might want
> to just keep the list comprehension, which is (IMO) very readable.

Generally speaking performing tests of which you know they can't fail can
confuse the reader just as much as tests with unobvious interdependencies.

> Alternatively, this is halfway between the previous example and the
> list comprehension:
>
> _split = re.compile(r"(\d+)").split
> def split(s):
> parts = _split(s)
> parts[1::2] = map(int, parts[1::2])
> for index in (-1, 0):
> if parts and parts[index] == "":
> del parts[index]
> return tuple(parts)

I don't think that this is similar to the list comprehension approach
because it only tests the first and the last item instead of the whole list.
Both variants should therefore perform equally well for all but the empty
string argument. If that is a theoretical case you are free to choose the
more readable variant.

Peter

Patrick Maupin

unread,

Apr 3, 2010, 12:01:20 PM4/3/10

to

Yes, I see your point. The list comprehension will only treat the
ends differently, and will always pass the middle, so someone could be
confused about why the comprehension is there in the first place. I
guess I'm used to doing this same thing on lists that *could* have
empty strings in the middle (with more complicated regular expressions
with multiple match cases), so I didn't consider that.

> > Alternatively, this is halfway between the previous example and the
> > list comprehension:
>
> > _split = re.compile(r"(\d+)").split
> > def split(s):
> > parts = _split(s)
> > parts[1::2] = map(int, parts[1::2])
> > for index in (-1, 0):
> > if parts and parts[index] == "":
> > del parts[index]
> > return tuple(parts)
>
> I don't think that this is similar to the list comprehension approach
> because it only tests the first and the last item instead of the whole list.
> Both variants should therefore perform equally well for all but the empty
> string argument. If that is a theoretical case you are free to choose the
> more readable variant.

I agree that "halfway" was not a very precise way of describing the
differences. Like your solution, this only tests the outer elements.
Like the list comprehension, no short-circuit test before doing the
re.split is required. Also like the list comprehension, the act of
doing the same operation on multiple elements is refactored such that
operation is only coded once.

BUT...

All of this is just a user preference, and is extremely minor compared
to the observation that re.split() and extended string slicing can be
combined to give a very elegant solution to the problem!

Regards,
Pat

Steven D'Aprano

unread,

Apr 3, 2010, 11:00:22 PM4/3/10

to

On Sat, 03 Apr 2010 11:17:36 +0200, Peter Otten wrote:

>> That's certainly faster than a list comprehension (at least on long
>> lists), but it might be a little obscure why the "if not s:" is needed,
>
> The function is small; with a test suite covering the corner cases and
> perhaps a comment* nothing should go wrong.
>
> (*) you can certainly improve on my attempt
>
>> so unless Thomas has a really long result list, he might want to just
>> keep the list comprehension, which is (IMO) very readable.
>
> Generally speaking performing tests of which you know they can't fail
> can confuse the reader just as much as tests with unobvious
> interdependencies.

I'm not sure I agree with you.

Tests which you know can't fail are called assertions, pre-conditions and
post-conditions. We test them because if we don't, they will fail :)

--
Steven

Alf P. Steinbach

unread,

Apr 3, 2010, 11:01:48 PM4/3/10

to

* Steven D'Aprano:

>
> Tests which you know can't fail are called assertions, pre-conditions and
> post-conditions. We test them because if we don't, they will fail :)

:-)

It's the umbrella law.

Cheers,

- Alf

Patrick Maupin

unread,

Apr 3, 2010, 11:10:20 PM4/3/10

to

On Apr 3, 10:00 pm, Steven D'Aprano <st...@REMOVE-THIS-

cybersource.com.au> wrote:
> Tests which you know can't fail are called assertions, pre-conditions and
> post-conditions. We test them because if we don't, they will fail :)

Well, yes, but that can get rather tedious at times:

a = 1
assert 0 < a < 2
b = a + 3
assert 2 < b - a < 4
c = b * 5
assert not c % 5

At least, I usually ameliorate the pain a little bit by not bothering
to print any debugging information at the assertions until one of them
actually fails; otherwise I'd *never* get any real coding done :-)

Steven D'Aprano

unread,

Apr 4, 2010, 3:37:01 AM4/4/10

to

On Sat, 03 Apr 2010 20:10:20 -0700, Patrick Maupin wrote:

> On Apr 3, 10:00 pm, Steven D'Aprano <st...@REMOVE-THIS-
> cybersource.com.au> wrote:
>> Tests which you know can't fail are called assertions, pre-conditions
>> and post-conditions. We test them because if we don't, they will fail
>> :)
>
> Well, yes, but that can get rather tedious at times:
>
> a = 1
> assert 0 < a < 2

Please tell me that's just an exaggerated example for illustration
purposes, and that you don't *actually* do that!

In any case, the *right* test would be:

a = 1
assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

*wink*

> At least, I usually ameliorate the pain a little bit by not bothering to
> print any debugging information at the assertions until one of them
> actually fails; otherwise I'd *never* get any real coding done :-)

Ditto.

--
Steven

Peter Otten

unread,

Apr 4, 2010, 5:58:25 AM4/4/10

to

Steven D'Aprano wrote:

> On Sat, 03 Apr 2010 11:17:36 +0200, Peter Otten wrote:
>
>>> That's certainly faster than a list comprehension (at least on long
>>> lists), but it might be a little obscure why the "if not s:" is needed,
>>
>> The function is small; with a test suite covering the corner cases and
>> perhaps a comment* nothing should go wrong.
>>
>> (*) you can certainly improve on my attempt
>>
>>> so unless Thomas has a really long result list, he might want to just
>>> keep the list comprehension, which is (IMO) very readable.
>>
>> Generally speaking performing tests of which you know they can't fail
>> can confuse the reader just as much as tests with unobvious
>> interdependencies.
>
>
> I'm not sure I agree with you.

I'm going to help you make up your mind ;)

> Tests which you know can't fail are called assertions, pre-conditions and
> post-conditions. We test them because if we don't, they will fail :)

Note that I said /can/ not /do/ confuse. Consider the actual context,
removing a special value from the start/end of a list:

(1)

if parts[0] == "":
del parts[0]

if not parts: return parts

if parts[-1] == "":
del parts[-1]

return parts

(2)
return [item for item in parts if item != ""]

Now assume you have to familiarize yourself with the above code. Variant (1)
clearly expresses that the code is meant to touch only the first and last
item. (2) could be just removing all empty strings from a list.

Another way to look at it: it is much easier to refactor from (1) to (2)
than from (2) to (1).

As to assertions etc: you don't perform arbitrary tests like assert 42 ==
42, you make estimates about what can go wrong if you get unexpected input
or make an error in an algorithm that you cannot safely grasp in its
entirety. As an example you could add

assert "" not in parts[1:-1]

as a precondition to the above snippets. The superfluous tests in the list
comprehension are the opposite of that assertion: they could eat items == ""
in the list that either are intended to pass through or that indicate an
error in code above.

Personally, though, I prefer unit tests over assertions.

Peter

Patrick Maupin

unread,

Apr 4, 2010, 11:01:36 AM4/4/10

to

On Apr 4, 4:58 am, Peter Otten <__pete...@web.de> wrote:

> Personally, though, I prefer unit tests over assertions.

IMO, the primary use cases for assertions and unit tests are not the
same.

When I have a well-defined, clearly understood specification that I am
coding to and fully implementing with the aim of producing a fully
general purpose solution that I are offering to others, unit tests are
appropriate to make sure that the code matches the spec, and even to
provide usage examples.

When I am using Python in what I would call a more personal way, for
example in an exploratory fashion to extract value from some large
mass of data, or even to do something like parsing on a subset of
another language, where I have no intention, desire, or time to
implement the entire spec, assertions are perfectly appropriate to
make sure that my code complains at a convenient point about input
data that it would not handle properly.

It is often the case that a more personal project will grow up and
become more general purpose and widely used. At this point, it is
appropriate to remove assertions, add unit tests, etc., but to
complain about not having unit tests in the first place is to avoid
acknowledging that there may be a higher cost associated with unit
tests than with assertions, especially with immature code that may be
severely refactored several times before settling on the best
solution.

Regards,
Pat

Patrick Maupin

unread,

Apr 4, 2010, 11:03:49 AM4/4/10

to

On Apr 4, 2:37 am, Steven D'Aprano <st...@REMOVE-THIS-
cybersource.com.au> wrote:

> In any case, the *right* test would be:
>
> a = 1
> assert a == 1 and a*5==5 and str(a)=='1' and [None,a,None][a] is a

You're right. I was very tired when I wrote that, and forgot the last
3 assertions...