regexp for sequence of quoted strings

17 views
Skip to first unread message

g...@ll.mit.edu

unread,
May 25, 2005, 3:40:58 PM5/25/05
to
I have a string like:
{'the','dog\'s','bite'}
or maybe:
{'the'}
or sometimes:
{}

[FYI: this is postgresql database "array" field output format]

which I'm trying to parse with the re module.
A single quoted string would, I think, be:
r"\{'([^']|\\')*'\}"

but how do I represent a *sequence* of these separated
by commas? I guess I can artificially tack a comma on the
end of the input string and do:

r"\{('([^']|\\')*',)\}"

but that seems like an ugly hack...

I want to end up with a python array of strings like:

['the', "dog's", 'bite']

Any simple clear way of parsing this in python would be
great; I just assume that "re" is the appropriate technique.
Performance is not an issue.

-- George

Steven Bethard

unread,
May 25, 2005, 4:11:42 PM5/25/05
to
g...@ll.mit.edu wrote:
> I have a string like:
> {'the','dog\'s','bite'}
> or maybe:
> {'the'}
> or sometimes:
> {}
>
[snip]

>
> I want to end up with a python array of strings like:
>
> ['the', "dog's", 'bite']
>
> Any simple clear way of parsing this in python would be
> great; I just assume that "re" is the appropriate technique.
> Performance is not an issue.


py> s = "{'the','dog\'s','bite'}"
py> s


"{'the','dog's','bite'}"

py> s[1:-1]


"'the','dog's','bite'"

py> s[1:-1].split(',')
["'the'", "'dog's'", "'bite'"]
py> [item[1:-1] for item in s[1:-1].split(',')]
['the', "dog's", 'bite']

py> s = "{'the'}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['the']

py> s = "{}"
py> [item[1:-1] for item in s[1:-1].split(',')]
['']

Not sure what you want in the last case, but if you want an empty list,
you can probably add a simple if-statement to check if s[1:-1] is non-empty.

HTH,

STeVe

Alexander Schmolck

unread,
May 25, 2005, 4:55:28 PM5/25/05
to
g...@ll.mit.edu writes:

> I have a string like:
> {'the','dog\'s','bite'}
> or maybe:
> {'the'}
> or sometimes:
> {}
>
> [FYI: this is postgresql database "array" field output format]
>
> which I'm trying to parse with the re module.
> A single quoted string would, I think, be:
> r"\{'([^']|\\')*'\}"

what about {'dog \\', ...} ?

If you don't need to validate anything you can just forget about the commas
etc and extract all the 'strings' with findall,

The regexp below is a bit too complicated (adapted from something else) but I
think will work:

In [90]:rex = re.compile(r"'(?:[^\n]|(?<!\\)(?:\\)(?:\\\\)*\n)*?(?<!\\)(?:\\\\)*?'")

In [91]:rex.findall(r"{'the','dog\'s','bite'}")
Out[91]:["'the'", "'dog\\'s'", "'bite'"]

Otherwise just add something like ",|}$" to deal with the final } instead of a
comma.

Alternatively, you could also write a regexp to split on the "','" bit and trim
the first and the last split.

'as


Paul McGuire

unread,
May 25, 2005, 5:59:13 PM5/25/05
to
Pyparsing includes some built-in quoted string support that might
simplify this problem. Of course, if you prefer regexp's, I'm by no
means offended!

Check out my Python console session below. (You may need to expand the
unquote method to do more handling of backslash escapes.)

-- Paul
(Download pyparsing at http://pyparsing.sourceforge.net.)

>>> from pyparsing import delimitedList, sglQuotedString
>>> text = r"'the','dog\'s','bite'"
>>> def unquote(s,l,t):
... t2 = t[0][1:-1]
... return t2.replace("\\'","'")
...
>>> sglQuotedString.setParseAction(unquote)
>>> g = delimitedList( sglQuotedString )
>>> g.parseString(text).asList()
['the', "dog's", 'bite']

Steven Bethard

unread,
May 25, 2005, 7:18:07 PM5/25/05
to
Paul McGuire wrote:
>>>>text = r"'the','dog\'s','bite'"
>>>>def unquote(s,l,t):
>
> ... t2 = t[0][1:-1]
> ... return t2.replace("\\'","'")
> ...

Note also, that the codec 'string-escape' can be used to do what's done
with str.replace in this example:

py> s


"'the','dog\\'s','bite'"

py> s.replace("\\'", "'")


"'the','dog's','bite'"

py> s.decode('string-escape')


"'the','dog's','bite'"

Using str.decode() is a little more general as it will also decode other
escaped characters. This may be good or bad depending on your needs.

STeVe

Paul McGuire

unread,
May 26, 2005, 2:56:53 AM5/26/05
to
Ah, this is much better than my crude replace technique. I forgot
about str.decode().

Thanks!
-- Paul

g...@ll.mit.edu

unread,
May 27, 2005, 3:21:36 PM5/27/05
to
PyParsing rocks! Here's what I ended up with:

def unpack_sql_array(s):
import pyparsing as pp
withquotes = pp.dblQuotedString.setParseAction(pp.removeQuotes)
withoutquotes = pp.CharsNotIn('",')
parser = pp.StringStart() + \
pp.Word('{').suppress() + \
pp.delimitedList(withquotes ^ withoutquotes) + \
pp.Word('}').suppress() + \
pp.StringEnd()
return parser.parseString(s).asList()

unpack_sql_array('{the,dog\'s,"foo,"}')
['the', "dog's", 'foo,']

[[Yes, this input is not what I stated originally. Someday, when I
reach a higher plane of existance, I will post a *complete* and
*correct* query to usenet...]]

Does the above seem fragile or questionable in any way?
Thanks all for your comments!

-- George

Paul McGuire

unread,
May 28, 2005, 1:29:17 AM5/28/05
to
George -

Thanks for your enthusiastic endorsement!

Here are some quibbles about your pyparsing grammar (but really, not
bad for a first timer):
1. The Word class is used to define "words" or collective groups of
characters, by specifying what sets of characters are valid as leading
and/or body chars, as in:
integer = Word(digitsFrom0to9)
firstName = Word(upcaseAlphas, lowcaseAlphas)
In your parser, I think you want the Literal class instead, to match
the literal string '{'.

2. I don't think there is any chance to confuse a withQuotes with a
withoutQuotes, so you can try using the "match first" operator '|',
rather than the greedy matching "match longest" operator '^'.

3. Lastly, don't be too quick to use asList() to convert parse results
into lists - parse results already have most of the list accessors
people would need to access the returned matched tokens. asList() just
cleans up the output a bit.

Good luck, and thanks for trying pyparsing!
-- Paul

Magnus Lycka

unread,
May 30, 2005, 5:23:07 AM5/30/05
to
g...@ll.mit.edu wrote:
> I have a string like:
> {'the','dog\'s','bite'}
> or maybe:
> {'the'}
> or sometimes:
> {}
...
> I want to end up with a python array of strings like:
>
> ['the', "dog's", 'bite']

Assuming that you trust the input, you could always use eval,
but since it seems fairly easy to solve anyway, that might
not be the best (at least not safest) solution.

>>> strings = [r'''{'the','dog\'s','bite'}''', '''{'the'}''', '''{}''']
>>> for s in strings:
... print eval('['+s[1:-1]+']')
...
['the', "dog's", 'bite']
['the']
[]

Reply all
Reply to author
Forward
0 new messages