split string at commas respecting quotes when string not in csv format

R. David Murray

unread,

Mar 26, 2009, 3:51:34 PM3/26/09

to pytho...@python.org

OK, I've got a little problem that I'd like to ask the assembled minds
for help with. I can write code to parse this, but I'm thinking it may
be possible to do it with regexes. My regex foo isn't that good, so if
anyone is willing to help (or offer an alternate parsing suggestion)
I would be greatful. (This has to be stdlib only, by the way, I
can't introduce any new modules into the application so pyparsing is
not an option.)

The challenge is to turn a string like this:

a=1,b="0234,)#($)@", k="7"

into this:

[("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]

--
R. David Murray http://www.bitdance.com

Tim Chase

unread,

Mar 26, 2009, 4:30:18 PM3/26/09

to R. David Murray, pytho...@python.org

> The challenge is to turn a string like this:
>
> a=1,b="0234,)#($)@", k="7"
>
> into this:
>
> [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]

A couple solutions "work" for various pathological cases of input
data:

import re
s = 'a=1,b="0234,)#($)@", k="7"'
r = re.compile(r"""
(?P<varname>\w+)
\s*=\s*(?:
"(?P<quoted>[^"]*)"
|
(?P<unquoted>[^,]+)
)
""", re.VERBOSE)
results = [
(m.group('varname'),
m.group('quoted') or
m.group('unquoted')
)
for m in r.finditer(s)
]

############### or ##############################

r = re.compile(r"""
(\w+)
\s*=\s*(
"(?:[^"]*)"
|
[^,]+
)
""", re.VERBOSE)
results = [
(m.group(1), m.group(2).strip('"'))
for m in r.finditer(s)
]

Things like internal quoting ('b="123\"456", c="123""456"') would
require a slightly smarter parser.

-tkc

John Machin

unread,

Mar 26, 2009, 4:46:08 PM3/26/09

to

The challenge is for you to explain unambiguously what you want.

1. a=1 => "1" and k="7" => "7" ... is this a mistake or are the quotes
optional in the original string when not required to protect a comma?

2. What is the rule that explains the transmogrification of @ to # in
your example?

3. Is the input guaranteed to be syntactically correct?

The following should do close enough to what you want; adjust as
appropriate.

>>> import re
>>> s = """a=1,b="0234,)#($)@", k="7" """
>>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
>>> rx.findall(s)
[('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
>>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
[('a', '1'), ('b', '2')]
>>>

HTH,
John

R. David Murray

unread,

Mar 26, 2009, 5:11:23 PM3/26/09

to pytho...@python.org

Tim Chase <pytho...@tim.thechases.com> wrote:
> r = re.compile(r"""
> (\w+)
> \s*=\s*(
> "(?:[^"]*)"
> |
> [^,]+
> )
> """, re.VERBOSE)
> results = [
> (m.group(1), m.group(2).strip('"'))
> for m in r.finditer(s)
> ]
>
> Things like internal quoting ('b="123\"456", c="123""456"') would
> require a slightly smarter parser.

Thank you thank you. I owe you a dinner if we are ever in the
same town (are you at Pycon?).

I'm not going to worry about the internal quotes unless it shows up in
the real data. I'm pretty sure it's now allowed by the spec.

Grant Edwards

unread,

Mar 26, 2009, 5:19:23 PM3/26/09

to

On 2009-03-26, R. David Murray <rdmu...@bitdance.com> wrote:
> Tim Chase <pytho...@tim.thechases.com> wrote:
>> r = re.compile(r"""
>> (\w+)
>> \s*=\s*(
>> "(?:[^"]*)"
>> |
>> [^,]+
>> )
>> """, re.VERBOSE)
>> results = [
>> (m.group(1), m.group(2).strip('"'))
>> for m in r.finditer(s)
>> ]
>>
>> Things like internal quoting ('b="123\"456", c="123""456"') would
>> require a slightly smarter parser.
>
> Thank you thank you.

We'll wait until you need to modify that and then ask if you're
still grateful. ;)

"There once was a programmer who had a problem..."

--
Grant Edwards grante Yow! I'm a fuschia bowling
at ball somewhere in Brittany
visi.com

Paul McGuire

unread,

Mar 26, 2009, 5:22:39 PM3/26/09

to

If you must cram all your code into a single source file, then
pyparsing would be problematic. But pyparsing's installation
footprint is really quite small, just a single Python source file. So
if your program spans more than one file, just add pyparsing.py into
the local directory along with everything else.

Then you could write this little parser and be done (note the
differentiation between 1 and "7"):

test = 'a=1,b="0234,)#($)@", k="7"'

from pyparsing import Suppress, Word, alphas, alphanums, \
nums, quotedString, removeQuotes, Group, delimitedList

EQ = Suppress('=')
varname = Word(alphas,alphanums)
integer = Word(nums).setParseAction(lambda t:int(t[0]))
varvalue = integer | quotedString.setParseAction(removeQuotes)
var_assignment = varname("name") + EQ + varvalue("rhs")
expr = delimitedList(Group(var_assignment))

results = expr.parseString(test)
print results.asList()
for assignment in results:
print assignment.name, '<-', repr(assignment.rhs)

Prints:

[['a', 1], ['b', '0234,)#($)@'], ['k', '7']]
a <- 1
b <- '0234,)#($)@'
k <- '7'

-- Paul

Tim Chase

unread,

Mar 26, 2009, 5:32:45 PM3/26/09

to R. David Murray, pytho...@python.org

R. David Murray wrote:
> Tim Chase <pytho...@tim.thechases.com> wrote:
>> r = re.compile(r"""
>> (\w+)
>> \s*=\s*(
>> "(?:[^"]*)"
>> |
>> [^,]+
>> )
>> """, re.VERBOSE)
>> results = [
>> (m.group(1), m.group(2).strip('"'))
>> for m in r.finditer(s)
>> ]
>

> Thank you thank you. I owe you a dinner if we are ever in the
> same town (are you at Pycon?).

Attended PyCon '07 here in Dallas (my back yard), but didn't make
'08 or '09

Grant Edwards wrote:
> We'll wait until you need to modify that and then ask if you're
> still grateful. ;)
>
> "There once was a programmer who had a problem..."

Indeed...I should have commented it a little :)

r = re.compile(r"""
(\w+) # the variable name
\s*=\s* # the equals with optional ws around it
( # grab a group of either
"(?:[^"]*)" # double-quotes around non-quoted stuff
| # or
[^,]+ # stuff that's not a comma
) # end of the value-grab
""", re.VERBOSE)

One of the benefits of re.VERBOSE allows making them a little
less opaque. Unfortunately this version (the non-named-tagged
version) includes the surrounding quotes in the second
capture-group, so the list-comprehension has to strip off the
surrounding quotes.

-tkc

Terry Reedy

unread,

Mar 26, 2009, 5:43:34 PM3/26/09

to pytho...@python.org

R. David Murray wrote:
> OK, I've got a little problem that I'd like to ask the assembled minds
> for help with. I can write code to parse this, but I'm thinking it may
> be possible to do it with regexes. My regex foo isn't that good, so if
> anyone is willing to help (or offer an alternate parsing suggestion)
> I would be greatful. (This has to be stdlib only, by the way, I
> can't introduce any new modules into the application so pyparsing is
> not an option.)
>
> The challenge is to turn a string like this:
>
> a=1,b="0234,)#($)@", k="7"
>
> into this:
>
> [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]

But the starting string IS is csv format, where the values are strings
with the format name=string.

>>> import csv
>>> myDialect = csv.excel
>>> myDialect.skipinitialspace = True # needed for space before 'k'
>>> a=list(csv.reader(['''a=1,b="0234,)#($)@", k="7"'''], myDialect))[0]
>>> a
['a=1', 'b="0234', ')#($)@"', 'k="7"']
>>> b=[tuple(s.split('=',1)) for s in a]
>>> b
[('a', '1'), ('b', '"0234'), (')#($)@"',), ('k', '"7"')]

Terry Jan Reedy

John Machin

unread,

Mar 26, 2009, 6:20:57 PM3/26/09

to

It's in the csv format that Excel accepts on input but this is
irrelevant. The output does not meet the OP's requirements; it has
taken the should-have-been-protected comma as a delimiter, and
produced FOUR elements instead of THREE ... also note '"0234' has a
leading " and ')#($)@"' has a trailing "

R. David Murray

unread,

Mar 26, 2009, 10:45:52 PM3/26/09

to pytho...@python.org

John Machin <sjma...@lexicon.net> wrote:
> On Mar 27, 6:51 am, "R. David Murray" <rdmur...@bitdance.com> wrote:
> > OK, I've got a little problem that I'd like to ask the assembled minds
> > for help with. I can write code to parse this, but I'm thinking it may
> > be possible to do it with regexes. My regex foo isn't that good, so if
> > anyone is willing to help (or offer an alternate parsing suggestion)
> > I would be greatful. (This has to be stdlib only, by the way, I
> > can't introduce any new modules into the application so pyparsing is
> > not an option.)
> >
> > The challenge is to turn a string like this:
> >
> > a=1,b="0234,)#($)@", k="7"
> >
> > into this:
> >
> > [("a", "1"), ("b", "0234,)#($)#"), ("k", "7")]
>
> The challenge is for you to explain unambiguously what you want.
>
> 1. a=1 => "1" and k="7" => "7" ... is this a mistake or are the quotes
> optional in the original string when not required to protect a comma?

optional.

> 2. What is the rule that explains the transmogrification of @ to # in
> your example?

Now that's a mistake :)

> 3. Is the input guaranteed to be syntactically correct?

If it's not, it's the customer that gets to deal with the error.

> The following should do close enough to what you want; adjust as
> appropriate.
>
> >>> import re
> >>> s = """a=1,b="0234,)#($)@", k="7" """
> >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
> >>> rx.findall(s)
> [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
> >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
> [('a', '1'), ('b', '2')]
> >>>

I'm going to save this one and study it, too. I'd like to learn
to use regexes better, even if I do try to avoid them when possible :)

R. David Murray

unread,

Mar 26, 2009, 10:50:54 PM3/26/09

to pytho...@python.org

Paul McGuire <pt...@austin.rr.com> wrote:
> On Mar 26, 2:51 pm, "R. David Murray" <rdmur...@bitdance.com> wrote:
> > OK, I've got a little problem that I'd like to ask the assembled minds
> > for help with. I can write code to parse this, but I'm thinking it may
> > be possible to do it with regexes. My regex foo isn't that good, so if
> > anyone is willing to help (or offer an alternate parsing suggestion)
> > I would be greatful. (This has to be stdlib only, by the way, I
> > can't introduce any new modules into the application so pyparsing is
> > not an option.)
>

> If you must cram all your code into a single source file, then
> pyparsing would be problematic. But pyparsing's installation
> footprint is really quite small, just a single Python source file. So
> if your program spans more than one file, just add pyparsing.py into
> the local directory along with everything else.

It isn't a matter of wanting to cram the code into a single source file.
I'm fixing a bug in a vendor-installed application. A ten line locally
maintained patch is bad enough, installing a whole new external dependency
is just Not An Option :)

Tim Chase

unread,

Mar 27, 2009, 6:19:01 AM3/27/09

to R. David Murray, pytho...@python.org

>> >>> import re
>> >>> s = """a=1,b="0234,)#($)@", k="7" """
>> >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
>> >>> rx.findall(s)
>> [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
>> >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
>> [('a', '1'), ('b', '2')]
>> >>>
>

> I'm going to save this one and study it, too. I'd like to learn
> to use regexes better, even if I do try to avoid them when possible :)

This regexp is fairly close to the one I used, but I employed the
re.VERBOSE flag to split it out for readability. The above
breaks down as

[ ]* # optional whitespace, traditionally "\s*"
(\w+) # tag the variable name as one or more "word" chars
= # the literal equals sign
( # tag the value
[^",]+ # one or more non-[quote/comma] chars
| # or
"[^"]*" # quotes around a bunch of non-quote chars
) # end of the value being tagged
[ ]* # same as previously, optional whitespace ("\s*")
(?: # a non-capturing group (why?)
, # a literal comma
| # or
$ # the end-of-line/string
) # end of the non-capturing group

Hope this helps,

-tkc

John Machin

unread,

Mar 27, 2009, 9:49:08 AM3/27/09

to

On Mar 27, 9:19 pm, Tim Chase <python.l...@tim.thechases.com> wrote:
> >> >>> import re
> >> >>> s = """a=1,b="0234,)#($)@", k="7" """
> >> >>> rx = re.compile(r'[ ]*(\w+)=([^",]+|"[^"]*")[ ]*(?:,|$)')
> >> >>> rx.findall(s)
> >> [('a', '1'), ('b', '"0234,)#($)@"'), ('k', '"7"')]
> >> >>> rx.findall('a=1, *DODGY*SYNTAX* b=2')
> >> [('a', '1'), ('b', '2')]
>
> > I'm going to save this one and study it, too. I'd like to learn
> > to use regexes better, even if I do try to avoid them when possible :)
>
> This regexp is fairly close to the one I used, but I employed the
> re.VERBOSE flag to split it out for readability. The above
> breaks down as
>
> [ ]* # optional whitespace, traditionally "\s*"

No, it's optional space characters -- T'd regard any other type of
whitespace there as a stuff-up.

> (\w+) # tag the variable name as one or more "word" chars
> = # the literal equals sign
> ( # tag the value
> [^",]+ # one or more non-[quote/comma] chars
> | # or
> "[^"]*" # quotes around a bunch of non-quote chars
> ) # end of the value being tagged
> [ ]* # same as previously, optional whitespace ("\s*")

same correction as previously

> (?: # a non-capturing group (why?)

a group because I couldn't be bothered thinking too hard about the
precedence of the | operator, and non-capturing because the OP didn't
want it captured.

> , # a literal comma
> | # or
> $ # the end-of-line/string
> ) # end of the non-capturing group
>
> Hope this helps,

Me too :-)

Cheers,
John

Paul McGuire

unread,

Mar 27, 2009, 9:54:48 AM3/27/09

to

Mightent there be whitespace on either side of the '=' sign? And if
you are using findall, why is the bit with the delimiting commas or
end of line/string necessary? I should think findall would just skip
over this stuff, like it skips over *DODGY*SYNTAX* in your example.

-- Paul

Tim Chase

unread,

Mar 27, 2009, 10:19:29 AM3/27/09

to Paul McGuire, pytho...@python.org

> Mightent there be whitespace on either side of the '=' sign? And if
> you are using findall, why is the bit with the delimiting commas or
> end of line/string necessary? I should think findall would just skip
> over this stuff, like it skips over *DODGY*SYNTAX* in your example.

Which would leave you with the solution(s) fairly close to what I
original posited ;-)

(my comment about the "non-capturing group (why?)" was in
relation to not needing to find the EOL/comma because findall()
doesn't need it, as Paul points out, not the precedence of the
"|" operator.)

-tkc