Behavior of re.split on empty strings is unexpected

John Nagle

unread,

Aug 2, 2010, 1:34:25 PM8/2/10

to

The regular expression "split" behaves slightly differently than string
split:

>>> import re
>>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)

>>> kresplit2.split(" HELLO THERE ")
['', 'HELLO', 'THERE', '']

>>> kresplit2.split("VERISIGN INC.")
['VERISIGN', 'INC', '']

I'd thought that "split" would never produce an empty string, but
it will.

The regular string split operation doesn't yield empty strings:

>>> " HELLO THERE ".split()
['HELLO', 'THERE']

If I try to get the functionality of string split with re:

>>> s2 = " HELLO THERE "
>>> kresplit4 = re.compile(r'\W+', re.UNICODE)
>>> kresplit4.split(s2)
['', 'HELLO', 'THERE', '']

I still get empty strings.

The documentation just describes re.split as "Split string by the
occurrences of pattern", which is not too helpful.

John Nagle

Peter Otten

unread,

Aug 2, 2010, 2:01:21 PM8/2/10

to

John Nagle wrote:

> The regular string split operation doesn't yield empty strings:
>
> >>> " HELLO THERE ".split()
> ['HELLO', 'THERE']

Note that invocation without separator argument (or None as the separator)
is special in that respect:

>>> " hello there ".split(" ")
['', 'hello', 'there', '']

Peter

MRAB

unread,

Aug 2, 2010, 2:02:47 PM8/2/10

to pytho...@python.org

John Nagle wrote:
> The regular expression "split" behaves slightly differently than string
> split:
>
> >>> import re
> >>> kresplit = re.compile(r'[^\w\&]+',re.UNICODE)
>
> >>> kresplit2.split(" HELLO THERE ")
> ['', 'HELLO', 'THERE', '']
>
> >>> kresplit2.split("VERISIGN INC.")
> ['VERISIGN', 'INC', '']
>
> I'd thought that "split" would never produce an empty string, but
> it will.
>
> The regular string split operation doesn't yield empty strings:
>
> >>> " HELLO THERE ".split()
> ['HELLO', 'THERE']
>

Yes it does.

>>> " HELLO THERE ".split(" ")
['', '', '', 'HELLO', '', '', '', 'THERE', '', '', '']

> If I try to get the functionality of string split with re:
>
> >>> s2 = " HELLO THERE "
> >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
> >>> kresplit4.split(s2)
> ['', 'HELLO', 'THERE', '']
>
> I still get empty strings.
>
> The documentation just describes re.split as "Split string by the
> occurrences of pattern", which is not too helpful.
>

It's the plain str.split() which is unusual in that:

1. it splits on sequences of whitespace instead of one per occurrence;

2. it discards leading and trailing sequences of whitespace.

Compare:

>>> " A B ".split(" ")
['', '', 'A', '', 'B', '', '']

with:

>>> " A B ".split()
['A', 'B']

It just happens that the unusual one is the most commonly used one, if
you see what I mean! :-)

John Nagle

unread,

Aug 2, 2010, 3:41:13 PM8/2/10

to

On 8/2/2010 11:02 AM, MRAB wrote:
> John Nagle wrote:
>> The regular expression "split" behaves slightly differently than
>> string split:

occurrences of pattern", which is not too helpful.
>>
> It's the plain str.split() which is unusual in that:
>
> 1. it splits on sequences of whitespace instead of one per occurrence;

That can be emulated with the obvious regular expression:

re.compile(r'\W+')

> 2. it discards leading and trailing sequences of whitespace.

But that can't, or at least I can't figure out how to do it.

> It just happens that the unusual one is the most commonly used one, if
> you see what I mean! :-)

The no-argument form of "split" shouldn't be that much of a special
case.

John Nagle

Thomas Jollans

unread,

Aug 2, 2010, 3:52:09 PM8/2/10

to pytho...@python.org

On 08/02/2010 09:41 PM, John Nagle wrote:
> On 8/2/2010 11:02 AM, MRAB wrote:
>> John Nagle wrote:
>>> The regular expression "split" behaves slightly differently than
>>> string split:
> occurrences of pattern", which is not too helpful.
>>>
>> It's the plain str.split() which is unusual in that:
>>
>> 1. it splits on sequences of whitespace instead of one per occurrence;
>
> That can be emulated with the obvious regular expression:
>
> re.compile(r'\W+')
>
>> 2. it discards leading and trailing sequences of whitespace.
>
> But that can't, or at least I can't figure out how to do it.

[ s in rexp.split(long_s) if s ]

John Nagle

unread,

Aug 2, 2010, 5:22:25 PM8/2/10

to

On 8/2/2010 12:52 PM, Thomas Jollans wrote:
> On 08/02/2010 09:41 PM, John Nagle wrote:
>> On 8/2/2010 11:02 AM, MRAB wrote:
>>> John Nagle wrote:
>>>> The regular expression "split" behaves slightly differently than
>>>> string split:
>> occurrences of pattern", which is not too helpful.
>>>>
>>> It's the plain str.split() which is unusual in that:
>>>
>>> 1. it splits on sequences of whitespace instead of one per occurrence;
>>
>> That can be emulated with the obvious regular expression:
>>
>> re.compile(r'\W+')
>>
>>> 2. it discards leading and trailing sequences of whitespace.
>>
>> But that can't, or at least I can't figure out how to do it.
>
> [ s in rexp.split(long_s) if s ]

Of course I can discard the blank strings afterward, but
is there some way to do it in the "split" operation? If
not, then the default case for "split()" is too non-standard.

(Also, "if s" won't work; if s != '' might)

John Nagle

Thomas Jollans

unread,

Aug 2, 2010, 6:07:58 PM8/2/10

to pytho...@python.org

On 08/02/2010 11:22 PM, John Nagle wrote:
>> [ s in rexp.split(long_s) if s ]
>
> Of course I can discard the blank strings afterward, but
> is there some way to do it in the "split" operation? If
> not, then the default case for "split()" is too non-standard.
>
> (Also, "if s" won't work; if s != '' might)

Of course it will work. Empty sequences are considered false in Python.

Python 3.1.2 (release31-maint, Jul 8 2010, 09:18:08)
[GCC 4.4.4] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> sprexp = re.compile(r'\s+')
>>> [s for s in sprexp.split(' spaces every where ! ') if s]
['spaces', 'every', 'where', '!']
>>> list(filter(bool, sprexp.split(' more spaces \r\n\t\t ')))
['more', 'spaces']
>>>

(of course, the list comprehension I posted earlier was missing a couple
of words, which was very careless of me)

samwyse

unread,

Aug 2, 2010, 8:53:12 PM8/2/10

to

On Aug 2, 12:34 pm, John Nagle <na...@animats.com> wrote:
> The regular expression "split" behaves slightly differently than string
> split:

I'm going to argue that it's the string split that's behaving oddly.
To see why, let's first look at some simple CSV values:
cat,dog
,missing,,values,

How many fields are on each line and what are they? Here's what
re.split(',') says:

>>> re.split(',', 'cat,dog')
['cat', 'dog']
>>> re.split(',', ',missing,,values,')
['', 'missing', '', 'values', '']

Note that the presence of missing values is clearly flagged via the
presence of empty strings in the results. Now let's look at string
split:

>>> 'cat,dog'.split(',')
['cat', 'dog']
>>> ',missing,,values,'.split(',')
['', 'missing', '', 'values', '']

It's the same results. Let's try it again, but replacing the commas
with spaces.

>>> re.split(' ', 'cat dog')
['cat', 'dog']
>>> re.split(' ', ' missing values ')
['', 'missing', '', 'values', '']
>>> 'cat dog'.split(' ')
['cat', 'dog']
>>> ' missing values '.split(' ')
['', 'missing', '', 'values', '']

It's the same results; however many people don't like these results
because they feel that whitespace occupies a privileged role. People
generally agree that a string of consecutive commas means missing
values, but a string of consecutive spaces just means someone held the
space-bar down too long. To accommodate this viewpoint, the string
split is special-cased to behave differently when None is passed as a
separator. First, it splits on any number of whitespace characters,
like this:

>>> re.split('\s+', ' missing values ')
['', 'missing', 'values', '']
>>> re.split('\s+', 'cat dog')
['cat', 'dog']

But it also eliminates any empty strings from the head and tail of the
list, because that's what people generally expect when splitting on
whitespace:

>>> 'cat dog'.split(None)
['cat', 'dog']
>>> ' missing values '.split(None)
['missing', 'values']

Message has been deleted

John Nagle

unread,

Aug 3, 2010, 1:41:57 PM8/3/10

to

On 8/2/2010 5:53 PM, samwyse wrote:
> On Aug 2, 12:34 pm, John Nagle<na...@animats.com> wrote:
>> The regular expression "split" behaves slightly differently than string
>> split:
>
> I'm going to argue that it's the string split that's behaving oddly.

I tend to agree.

It doesn't seem to be possible to get the same semantics with
any regular expression split. The default "split" has a special
case for head and tail whitespace, and there's no way to express
that with a regular expression split. Applying "strip" first
will work, of course. The documentation should reflect
that.

John Nagle

jhermann

unread,

Aug 5, 2010, 6:07:55 AM8/5/10

to

On Aug 2, 7:34 pm, John Nagle <na...@animats.com> wrote:
> >>> s2 = " HELLO THERE "
> >>> kresplit4 = re.compile(r'\W+', re.UNICODE)
> >>> kresplit4.split(s2)
> ['', 'HELLO', 'THERE', '']
>
> I still get empty strings.

>>> re.findall(r"\w+", " a b c ")
['a', 'b', 'c']