str.split() with empty separator

Ulrich Eckhardt

unread,

Sep 15, 2009, 8:31:26 AM9/15/09

to

Hi!

"'abc'.split('')" gives me a "ValueError: empty separator".
However, "''.join(['a', 'b', 'c'])" gives me "'abc'".

Why this asymmetry? I was under the impression that the two would be
complementary.

Uli

--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

Dave Angel

unread,

Sep 15, 2009, 9:35:58 AM9/15/09

to Ulrich Eckhardt, pytho...@python.org

Ulrich Eckhardt wrote:
> Hi!
>
> "'abc'.split('')" gives me a "ValueError: empty separator".
> However, "''.join(['a', 'b', 'c'])" gives me "'abc'".
>
> Why this asymmetry? I was under the impression that the two would be
> complementary.
>
> Uli
>
>

I think the problem is that join() is lossy; if you try "".join(['a',
'bcd', 'e']) then there's no way to reconstruct the original list with
split(). Now that can be true even with actual separators, but perhaps
this was the reasoning.

Anyway, if you want to turn a string into a list of single-character
strings, then use
list("abcde")

DaveA

Vlastimil Brom

unread,

Sep 15, 2009, 9:33:51 AM9/15/09

to pytho...@python.org

2009/9/15 Ulrich Eckhardt <eckh...@satorlaser.com>:

> Hi!
>
> "'abc'.split('')" gives me a "ValueError: empty separator".
> However, "''.join(['a', 'b', 'c'])" gives me "'abc'".
>
> Why this asymmetry? I was under the impression that the two would be
> complementary.
>
> Uli
>

maybe it isn't quite obvious, what the behaviour in this case should be;
re.split also works with empty delimiter (and returns the original string)
>>> re.split("", "abcde")
['abcde']

If you need to split the string into the list of single characters
like in your example, list() is the possible way:
>>> list("abcde")
['a', 'b', 'c', 'd', 'e']
>>>

vbr

jeffunit

unread,

Sep 15, 2009, 10:41:16 AM9/15/09

to pytho...@python.org

I wrote a program that diffs files and prints out matching file names.
I will be executing the output with sh, to delete select files.

Most of the files names are plain ascii, but about 10% of them have unicode
characters in them. When I try to print the string containing the name, I get
an exception:

'ascii' codec can't encode character '\udce9'
in position 37: ordinal not in range(128)

The string is:

'./Julio_Iglesias-Un_Hombre_Solo-05-Qu\udce9_no_se_rompa_la_noche.mp3'

This is on a windows xp system, using python 3.1 which I compiled
with the cygwin
linux compatability layer tool.

Can you tell me what encoding I need to print \udce9 and how to set python to
that encoding mode?

thanks,
jeff

MRAB

unread,

Sep 15, 2009, 12:30:30 PM9/15/09

to pytho...@python.org

Vlastimil Brom wrote:
> 2009/9/15 Ulrich Eckhardt <eckh...@satorlaser.com>:

>> Hi!
>>
>> "'abc'.split('')" gives me a "ValueError: empty separator".
>> However, "''.join(['a', 'b', 'c'])" gives me "'abc'".
>>
>> Why this asymmetry? I was under the impression that the two would be
>> complementary.
>>
>> Uli
>>
>

> maybe it isn't quite obvious, what the behaviour in this case should be;
> re.split also works with empty delimiter (and returns the original string)
>>>> re.split("", "abcde")
> ['abcde']
>
> If you need to split the string into the list of single characters
> like in your example, list() is the possible way:
>>>> list("abcde")

> ['a', 'b', 'c', 'd', 'e']
>
I'd prefer it to split into characters. As for re.split, there are times
when it would be nice to be able to split on a zero-width match such as
r"\b" (word boundary).

Mark Tolonen

unread,

Sep 16, 2009, 12:25:23 AM9/16/09

to pytho...@python.org

"jeffunit" <je...@jeffunit.com> wrote in message
news:2009091514412...@cdptpa-omta01.mail.rr.com...

That looks like a "surrogate escape" (See PEP 383)
http://www.python.org/dev/peps/pep-0383/. It indicates the wrong encoding
was used to decode the filename.

-Mark

jeffunit

unread,

Sep 16, 2009, 12:48:10 AM9/16/09

to pytho...@python.org

That seems likely. How do I set the encoding to something correct to
decode the filename?

Clearly windows knows how to display it.
I suspect since I complied python with cygwin, that it is using a
POSIX standard,
rather than a windows specific standard. Of course ideally, I would
like my code to work
on linux as well as windows, as I back up all of my data to a linux
machine with
samba.

thanks,
jeff

Chris Rebert

unread,

Sep 16, 2009, 1:07:40 AM9/16/09

to jeffunit, pytho...@python.org

Have you perhaps tried using the native Windows version of Python?

Cheers,
Chris
--
http://blog.rebertia.com

Hendrik van Rooyen

unread,

Sep 16, 2009, 3:29:00 AM9/16/09

to pytho...@python.org

On Tuesday 15 September 2009 14:50:11 Xavier Ho wrote:
> On Tue, Sep 15, 2009 at 10:31 PM, Ulrich Eckhardt
>
> <eckh...@satorlaser.com>wrote:

> > "'abc'.split('')" gives me a "ValueError: empty separator".
> > However, "''.join(['a', 'b', 'c'])" gives me "'abc'".
> >
> > Why this asymmetry? I was under the impression that the two would be
> > complementary.
>

> I'm not sure about asymmetry, but how would you implement a split method
> with an empty delimiter to begin with? It doesn't make much sense anyway.

I fell into this trap some time ago too.
There is no such string method.

The opposite of "".join(aListOfChars) is
list(aString)

- Hendrik

Duncan Booth

unread,

Sep 16, 2009, 1:21:54 PM9/16/09

to

jeffunit <je...@jeffunit.com> wrote:

>>That looks like a "surrogate escape" (See PEP 383)
>>http://www.python.org/dev/peps/pep-0383/. It indicates the wrong
>>encoding was used to decode the filename.
>
> That seems likely. How do I set the encoding to something correct to
> decode the filename?
>
> Clearly windows knows how to display it.
> I suspect since I complied python with cygwin, that it is using a
> POSIX standard,
> rather than a windows specific standard. Of course ideally, I would
> like my code to work
> on linux as well as windows, as I back up all of my data to a linux
> machine with
> samba.
>

If you are running on a Linux system then the filenames are stored encoded
as bytes but the system does not store the encoding. In fact different
files in the same directory could use different encodings. That's why
Python 3.1 uses the surrogate escapes so that you can at least work with
the files even if you can't display the filenames.

If you are running on Windows and using the native Python to access an NTFS
formatted partition then there shouldn't be a problem: the filenames are
stored as unicode and Python uses the unicode apis. Of course you may still
not be able to display the filenames if they contain characters not
available in your output codepage.

If you use cygwin a quick search on Google turned up some old discussions
implying that it uses the 8 bit apis which convert characters using the
current codepage and converts characters it cannot handle to '?' but I have
no idea if that still applies.

David C Ullrich

unread,

Sep 17, 2009, 2:42:56 PM9/17/09

to

On Tue, 15 Sep 2009 14:31:26 +0200, Ulrich Eckhardt wrote:

> Hi!
>
> "'abc'.split('')" gives me a "ValueError: empty separator". However,
> "''.join(['a', 'b', 'c'])" gives me "'abc'".
>
> Why this asymmetry?

The docs say

"If sep is given, consecutive delimiters are not grouped together and are
deemed to delimit empty strings (for example, '1,,2'.split(',') returns
['1', '', '2']). "

Now suppose sep = ''. That means split() should return an infinitely
long list of empty strings! Because if sep = '' then the
string 'hello' starts with an empty string followed by sep
followed by an empty string followed by sep followed by an
empty string followed by sep... that's all before we get to
the 'h'.