Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Splitting strings on spaces, unless inside quotes

706 views
Skip to first unread message

Richard Livsey

unread,
Jan 6, 2006, 7:08:39 PM1/6/06
to
I want to split a string into words, but group quoted words together
such that...

some words "some quoted text" some more words

would get split up into:

["some", "words", "some quoted text", "some", "more", "words"]

So far I'm drawing a blank on the 'Ruby way' to do this and the only
solutions I can think of are turning out to be fairly ugly.

Any advice would be great. Thanks in advance.

--
R.Livsey
http://livsey.org

Eero Saynatkari

unread,
Jan 6, 2006, 8:00:06 PM1/6/06
to

Naively, you can try something like this:

s = 'foo bar "baz quux" roo'
s.scan(/(?:"")|(?:"(.*[^\\])")|(\w+)/).flatten.compact

Elaborate as necessary (add support for single quotes or something).

> R.Livsey


E


Tim Heaney

unread,
Jan 6, 2006, 8:07:52 PM1/6/06
to
Richard Livsey <ric...@livsey.org> writes:

> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

How about the csv module? Despite the name, you don't have to use
commas.

require 'csv'
CSV::parse_line('some words "some quoted text" some more words', ' ')

I hope this helps,

Tim

James Edward Gray II

unread,
Jan 6, 2006, 8:52:50 PM1/6/06
to

I agree that CSV is the way to go, but here's a direct attempt:

>> example = %Q{some words "some quoted text" some more words}
=> "some words \"some quoted text\" some more words"
>> example.scan(/\s+|\w+|"[^"]*"/).
?> reject { |token| token =~ /^\s+$/ }.
?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
=> ["some", "words", "some quoted text", "some", "more", "words"]

Hope that gives you some fresh ideas.

James Edward Gray II


Matthew Moss

unread,
Jan 6, 2006, 8:54:40 PM1/6/06
to
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words

sa = s.split(/"/).collect { |x| x.strip }
(0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x }.flatten


Michael 'entropie' Trommer

unread,
Jan 6, 2006, 8:55:45 PM1/6/06
to
* James Edward Gray II (ja...@grayproductions.net) wrote:
> >> example = %Q{some words "some quoted text" some more words}
> => "some words \"some quoted text\" some more words"
> >> example.scan(/\s+|\w+|"[^"]*"/).
> ?> reject { |token| token =~ /^\s+$/ }.
> ?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
> => ["some", "words", "some quoted text", "some", "more", "words"]

impressive


So long
--
Michael 'entropie' Trommer; http://ackro.org

ruby -e "0.upto((a='njduspAhnbjm/dpn').size-1){|x| a[x]-=1}; p 'mailto:'+a"

Matthew Moss

unread,
Jan 6, 2006, 9:00:33 PM1/6/06
to
> (0...sa.size).to_a.zip(sa).collect { |i,x| (i&1).zero? ? x.split : x }.flatten

Just realized that Range responds to zip, so the to_a is unnecessary.

This looks slightly cleaner to me:

(1..sa.size).zip(sa).collect { |i,x| (i&1).zero? ? x : x.split }.flatten


Xavier Noria

unread,
Jan 6, 2006, 9:17:41 PM1/6/06
to
On Jan 7, 2006, at 1:08, Richard Livsey wrote:

> I want to split a string into words, but group quoted words
> together such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

Curiously, someone asked exactly that on freenode#perl tonight.

If the input is that simple and is assumed to be well-formed this is
enough:

irb(main):005:0> %q{some words "some quoted text" some "" more
words}.scan(/"[^"]*"|\S+/)
=> ["some", "words", "\"some quoted text\"", "some", "\"\"", "more",
"words"]

Since nothing was said about this, it does not handle escaped quotes,
and it assumes quotes are always balanced, so a field cannot be %q
{"foo}, for example.

-- fxn


dbl...@wobblini.net

unread,
Jan 6, 2006, 9:33:34 PM1/6/06
to
Hi --

On Sat, 7 Jan 2006, James Edward Gray II wrote:

> On Jan 6, 2006, at 6:08 PM, Richard Livsey wrote:
>
>> I want to split a string into words, but group quoted words together such
>> that...
>>
>> some words "some quoted text" some more words
>>
>> would get split up into:
>>
>> ["some", "words", "some quoted text", "some", "more", "words"]
>>
>> So far I'm drawing a blank on the 'Ruby way' to do this and the only
>> solutions I can think of are turning out to be fairly ugly.
>>
>> Any advice would be great. Thanks in advance.
>
> I agree that CSV is the way to go, but here's a direct attempt:

Me too (end of disclaimer :-)


>>> example = %Q{some words "some quoted text" some more words}
> => "some words \"some quoted text\" some more words"
>>> example.scan(/\s+|\w+|"[^"]*"/).
> ?> reject { |token| token =~ /^\s+$/ }.
> ?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
> => ["some", "words", "some quoted text", "some", "more", "words"]

I think you could do less work:

example.scan(/"[^"]+"|\S+/).map { |word| word.delete('"') }

(Or am I overlooking some reason you'd want to capture sequences of
spaces?)

I changed the \w+ to \S+ (and moved it after the | to avoid having it
sponge up too much) in case the words included non-\w characters.

I guess with zero-width positive lookbehind/ahead one could do it
without the map operation.


David

--
David A. Black
dbl...@wobblini.net

"Ruby for Rails", from Manning Publications, coming April 2006!
http://www.manning.com/books/black


James Edward Gray II

unread,
Jan 6, 2006, 9:45:01 PM1/6/06
to
On Jan 6, 2006, at 8:33 PM, dbl...@wobblini.net wrote:

>>>> example = %Q{some words "some quoted text" some more words}
>> => "some words \"some quoted text\" some more words"
>>>> example.scan(/\s+|\w+|"[^"]*"/).
>> ?> reject { |token| token =~ /^\s+$/ }.
>> ?> map { |token| token.sub(/^"/, "").sub(/"$/, "") }
>> => ["some", "words", "some quoted text", "some", "more", "words"]
>
> I think you could do less work:
>
> example.scan(/"[^"]+"|\S+/).map { |word| word.delete('"') }
>
> (Or am I overlooking some reason you'd want to capture sequences of
> spaces?)
>
> I changed the \w+ to \S+ (and moved it after the | to avoid having it
> sponge up too much) in case the words included non-\w characters.

You're right, that's better all around.

> I guess with zero-width positive lookbehind/ahead one could do it
> without the map operation.

You can drop the map(), if you're willing to replace it with two
other calls:

>> example = %Q{some words "some quoted text" some more words}
=> "some words \"some quoted text\" some more words"

>> example.scan(/"([^"]+)"|(\S+)/).flatten.compact


=> ["some", "words", "some quoted text", "some", "more", "words"]

James Edward Gray II

Florian Groß

unread,
Jan 6, 2006, 11:00:34 PM1/6/06
to
Richard Livsey wrote:

> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

Try this:

> irb(main):001:0> require 'shellwords'; Shellwords.shellwords 'some words "some quoted text" some more words'
> => ["some", "words", "some quoted text", "some", "more", "words"]

--
http://flgr.0x42.net/

ara.t....@noaa.gov

unread,
Jan 6, 2006, 11:01:05 PM1/6/06
to

briliant!

-a
--
===============================================================================
| ara [dot] t [dot] howard [at] noaa [dot] gov
| all happiness comes from the desire for others to be happy. all misery
| comes from the desire for oneself to be happy.
| -- bodhicaryavatara
===============================================================================

William James

unread,
Jan 8, 2006, 9:44:50 AM1/8/06
to
Richard Livsey wrote:
> I want to split a string into words, but group quoted words together
> such that...
>
> some words "some quoted text" some more words
>
> would get split up into:
>
> ["some", "words", "some quoted text", "some", "more", "words"]

s = 'some words "some quoted text" some more words'
p s.split( / *"(.*?)" *| / )

Geoff Jacobsen

unread,
Jan 9, 2006, 5:31:35 AM1/9/06
to

Which along with the CSV solution can't handle complex cases:

s='one two" "\'with quotes\' "three "'

s.split( / *"(.*?)" *| / )

=> ["one", "two", " ", "'with", "quotes'", "three "]

require 'csv'
CSV::parse_line(s)
=> []

but Shellwords can:

require 'shellwords'
Shellwords.shellwords(s)
=> ["one", "two with quotes", "three "]

Robert Klemme

unread,
Jan 9, 2006, 6:48:28 AM1/9/06
to

Another option is to use scan instead of split:

>> 'some words "some quoted text" some more words'.scan
%r{"(?:(?:[^"]|\\.)*)"|\S+}
=> ["some", "words", "\"some quoted text\"", "some", "more", "words"]

With some additional effort even the quotes can be removed (using grouping
for example).

>> r=[];'some words "some quoted text" some more
words'.scan(%r{"((?:[^"]|\\.)*)"|(\S+)}) {|m| r << m.detect {|x|x}};r
=> ["some", "words", "some quoted text", "some", "more", "words"]

Kind regards

robert

William James

unread,
Jan 9, 2006, 2:22:14 PM1/9/06
to

This is not a "more complex case"; it is an invalid case.
The original poster simply wanted to avoid splitting on spaces
within double quotes, not within single quotes.

The shellwords "solution" is a solution to a different problem, not
to this one. It can't even handle a simple case:

require 'shellwords'
s = "why can't you think?"
Shellwords.shellwords(s)

ArgumentError: Unmatched single quote: 't you think?

Geoff Jacobsen

unread,
Jan 9, 2006, 8:50:07 PM1/9/06
to
On Tue, 2006-01-10 at 04:23 +0900, William James wrote:
> Geoff Jacobsen wrote:
> > On Mon, 2006-01-09 at 18:13 +0900, William James wrote:
> > > Richard Livsey wrote:
> > > > I want to split a string into words, but group quoted words together
> > > > such that...
> > > >
> > > > some words "some quoted text" some more words
> > > >
> > > > would get split up into:
> > > >
> > > > ["some", "words", "some quoted text", "some", "more", "words"]
> > >
> > > s = 'some words "some quoted text" some more words'
> > > p s.split( / *"(.*?)" *| / )
> >
> > Which along with the CSV solution can't handle complex cases:
> >
> > s='one two" "\'with quotes\' "three "'
> >
> > s.split( / *"(.*?)" *| / )
> > => ["one", "two", " ", "'with", "quotes'", "three "]
..

> > but Shellwords can:
> >
> > require 'shellwords'
> > Shellwords.shellwords(s)
> > => ["one", "two with quotes", "three "]
>
> This is not a "more complex case"; it is an invalid case.
> The original poster simply wanted to avoid splitting on spaces
> within double quotes, not within single quotes.
>
> The shellwords "solution" is a solution to a different problem, not
> to this one. It can't even handle a simple case:
>
> require 'shellwords'
> s = "why can't you think?"
> Shellwords.shellwords(s)
>
> ArgumentError: Unmatched single quote: 't you think?
>

I agree my example doesn't match the originators request but *I think*
there is enough ambiguity about the post to postulate that they may want
more real-world cases such as:

s='symbol "William said: \"why can't you think?\"" 123 "<xml>foo</xml>"'
Shellwords.shellwords(s)

=> ["symbol", "William said: \"why can't you think?\"", "123",
"<xml>foo</xml>"]

So Shellwords may indeed be a solution to this problem but the problem
is not stated precisely enough to know.


0 new messages