Problem splitting a string

5 views
Skip to first unread message

Anthony Liu

unread,
Oct 15, 2005, 12:52:07 AM10/15/05
to pytho...@python.org
I have this simple string:

mystr = 'this_NP is_VL funny_JJ'

I want to split it and give me a list as

['this', 'NP', 'is', 'VL', 'funny', 'JJ']

1. I tried mystr.split('_| '), but this gave me:

['this_NP is_VL funny_JJ']

It is not splitted at all.

2. I tried mystr.split('_'), and this gave me:

['this', 'NP is', 'VL funny', 'JJ']

in which, space is not used as a delimiter.

3. I tried mystr.split(' '), and this gave me:

['this_NP', 'is_VL', 'funny_JJ']

in which, '_' is not used as delimiter.

I think the documentation does say that the
separator/delimiter can be a string representing all
delimiters we want to use.

I do I split the string by using both ' ' and '_' as
the delimiters at once?

Thanks.




__________________________________
Yahoo! Mail - PC Magazine Editors' Choice 2005
http://mail.yahoo.com

Erik Max Francis

unread,
Oct 15, 2005, 12:59:51 AM10/15/05
to
Anthony Liu wrote:

> I have this simple string:
>
> mystr = 'this_NP is_VL funny_JJ'
>
> I want to split it and give me a list as
>
> ['this', 'NP', 'is', 'VL', 'funny', 'JJ']
>
> 1. I tried mystr.split('_| '), but this gave me:
>
> ['this_NP is_VL funny_JJ']
>
> It is not splitted at all.

Use re.split:

>>> re.split('_| ', s)


['this', 'NP', 'is', 'VL', 'funny', 'JJ']

--
Erik Max Francis && m...@alcyone.com && http://www.alcyone.com/max/
San Jose, CA, USA && 37 20 N 121 53 W && AIM erikmaxfrancis
To love without criticism is to be betrayed.
-- Djuna Barnes

Robert Kern

unread,
Oct 15, 2005, 1:03:08 AM10/15/05
to pytho...@python.org
Anthony Liu wrote:
> I have this simple string:
>
> mystr = 'this_NP is_VL funny_JJ'
>
> I want to split it and give me a list as
>
> ['this', 'NP', 'is', 'VL', 'funny', 'JJ']

> I think the documentation does say that the


> separator/delimiter can be a string representing all
> delimiters we want to use.

No, it doesn't.

In [1]: str.split?
Type: method_descriptor
Base Class: <type 'method_descriptor'>
String Form: <method 'split' of 'str' objects>
Namespace: Python builtin
Docstring:
S.split([sep [,maxsplit]]) -> list of strings

Return a list of the words in the string S, using sep as the
delimiter string. If maxsplit is given, at most maxsplit
splits are done. If sep is not specified or is None, any
whitespace string is a separator.

> I do I split the string by using both ' ' and '_' as
> the delimiters at once?

You could use regular expressions as Jason Stitt mentions, or you could
replace '_' with ' ' and then split.

In [2]: mystr = 'this_NP is_VL funny_JJ'

In [3]: mystr.replace('_', ' ').split()
Out[3]: ['this', 'NP', 'is', 'VL', 'funny', 'JJ']

--
Robert Kern
rk...@ucsd.edu

"In the fields of hell where the grass grows high
Are the graves of dreams allowed to die."
-- Richard Harter

Mike Meyer

unread,
Oct 15, 2005, 2:07:51 AM10/15/05
to
Robert Kern <rober...@gmail.com> writes:
> Anthony Liu wrote:
>> I have this simple string:
>>
>> mystr = 'this_NP is_VL funny_JJ'
>>
>> I want to split it and give me a list as
>>
>> ['this', 'NP', 'is', 'VL', 'funny', 'JJ']
> You could use regular expressions as Jason Stitt mentions, or you could
> replace '_' with ' ' and then split.
>
> In [2]: mystr = 'this_NP is_VL funny_JJ'
>
> In [3]: mystr.replace('_', ' ').split()
> Out[3]: ['this', 'NP', 'is', 'VL', 'funny', 'JJ']

A third alternative is to split once, then split the substrings a
second time and stitch the results back together:

>>> sum([x.split('_') for x in mystr.split()], [])


['this', 'NP', 'is', 'VL', 'funny', 'JJ']

Which is probably slow. To bad extend doesn't take multiple arguments.

<mike
--
Mike Meyer <m...@mired.org> http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.

Steven D'Aprano

unread,
Oct 15, 2005, 3:47:24 AM10/15/05
to
On Fri, 14 Oct 2005 21:52:07 -0700, Anthony Liu wrote:

> I have this simple string:
>
> mystr = 'this_NP is_VL funny_JJ'
>
> I want to split it and give me a list as
>
> ['this', 'NP', 'is', 'VL', 'funny', 'JJ']

> I think the documentation does say that the


> separator/delimiter can be a string representing all
> delimiters we want to use.

No, the delimiter is the delimiter, not a list of delimiters.

The only exception is delimiter=None, which splits on any whitespace.

[Aside: I think a split-on-any-delimiter function would be useful.]

> I do I split the string by using both ' ' and '_' as
> the delimiters at once?

Something like this:

mystr = 'this_NP is_VL funny_JJ'

L1 = mystr.split() # splits on whitespace
L2 = []
for item in L1:
L2.extend(item.split('_')

You can *almost* do that as a one-liner:

L2 = [item.split('_') for item in mystr.split()]

except that gives a list like this:

[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]

which needs flattening.

--
Steven.

Paul Rubin

unread,
Oct 15, 2005, 3:40:51 AM10/15/05
to
Anthony Liu <antony...@yahoo.com> writes:
> I do I split the string by using both ' ' and '_' as
> the delimiters at once?

Use re.split.

Alex Martelli

unread,
Oct 15, 2005, 4:51:41 AM10/15/05
to
Steven D'Aprano <st...@REMOVETHIScyber.com.au> wrote:
...

> You can *almost* do that as a one-liner:

No 'almost' about it...

> L2 = [item.split('_') for item in mystr.split()]
>
> except that gives a list like this:
>
> [['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]
>
> which needs flattening.

....because the flattening is easy:

[ x for x in y.split('_') for y in z.split(' ') ]


Alex

Alex Martelli

unread,
Oct 15, 2005, 4:54:56 AM10/15/05
to
Mike Meyer <m...@mired.org> wrote:
...

> A third alternative is to split once, then split the substrings a
> second time and stitch the results back together:
>
> >>> sum([x.split('_') for x in mystr.split()], [])
> ['this', 'NP', 'is', 'VL', 'funny', 'JJ']
>
> Which is probably slow. To bad extend doesn't take multiple arguments.

Using sum on lists is DEFINITELY slow -- avoid it like the plague.

If you have a list of lists LOL, DON'T use sum(LOL, []), but rather

[x for x in y for y in LOL]


Alex

Steven D'Aprano

unread,
Oct 15, 2005, 7:01:42 AM10/15/05
to


py> mystr = 'this_NP is_VL funny_JJ'
py> [x for x in y.split('_') for y in mystr.split(' ')]
Traceback (most recent call last):
File "<stdin>", line 1, in ?
NameError: name 'y' is not defined


This works, but isn't flattened:

py> [x for x in [y.split('_') for y in mystr.split(' ')]]


[['this', 'NP'], ['is', 'VL'], ['funny', 'JJ']]

--
Steven.

SPE - Stani's Python Editor

unread,
Oct 15, 2005, 6:47:21 AM10/15/05
to
Use re.split, as this is the fastest and cleanest way.
However, iff you have to split a lot of strings, the best is:

import re
delimiters = re.compile('_| ')

def split(x):
return delimiters.split(x)

>>> split('this_NP is_VL funny_JJ')


['this', 'NP', 'is', 'VL', 'funny', 'JJ']

Stani
--
SPE - Stani's Python Editor http://pythonide.stani.be

Fredrik Lundh

unread,
Oct 15, 2005, 7:16:05 AM10/15/05
to pytho...@python.org
"SPE - Stani's Python Editor" wrote:

> Use re.split, as this is the fastest and cleanest way.
> However, iff you have to split a lot of strings, the best is:
>
> import re
> delimiters = re.compile('_| ')
>
> def split(x):
> return delimiters.split(x)

or, shorter:

import re
split = re.compile('_| ').split

to quickly build a splitter for an arbitrary set of separator characters, use

separators = "_ :+"

split = re.compile("[" + re.escape(separators) + "]").split

to deal with arbitrary separators, you need to be a little bit more careful
when you prepare the pattern:

separators = sep1, sep2, sep3, sep4, ...

pattern = "|".join(re.escape(p) for p in reversed(sorted(separators)))
split = re.compile(pattern).split

</F>

Kent Johnson

unread,
Oct 15, 2005, 9:22:21 AM10/15/05
to
Steven D'Aprano wrote:
> On Sat, 15 Oct 2005 10:51:41 +0200, Alex Martelli wrote:
>>[ x for x in y.split('_') for y in z.split(' ') ]
>
> py> mystr = 'this_NP is_VL funny_JJ'
> py> [x for x in y.split('_') for y in mystr.split(' ')]
> Traceback (most recent call last):
> File "<stdin>", line 1, in ?
> NameError: name 'y' is not defined

The order of the 'for' clauses is backwards:
>>> [x for y in mystr.split(' ') for x in y.split('_')]


['this', 'NP', 'is', 'VL', 'funny', 'JJ']

Kent

Kent Johnson

unread,
Oct 15, 2005, 9:28:03 AM10/15/05
to
Alex Martelli wrote:
> Using sum on lists is DEFINITELY slow -- avoid it like the plague.
>
> If you have a list of lists LOL, DON'T use sum(LOL, []), but rather
>
> [x for x in y for y in LOL]

Should be
>>> lol = [[1,2],[3,4]]
>>> [x for y in lol for x in y]
[1, 2, 3, 4]

The outer loop comes first.

Kent

Reply all
Reply to author
Forward
0 new messages