grammar supporting ambiguity

37 views
Skip to first unread message

raghavan

unread,
Apr 10, 2014, 1:24:52 PM4/10/14
to modgr...@googlegroups.com
I have a simple script to detect the first and last names (see below). This works if the input string exactly matches the defined grammar (e.g. John Smith). However, if I input "John W. Smith", I get a - modgrammar.ParseError: [line 1, column 7] Expected end of input: Found '. Smith'


My question is - how can I specify the grammar and parse a bunch of strings that don't exactly match the grammar. What I'd ideally like to see is with the grammar defined below, for John W. Smith, I get FirstName = 'John' and LastName='Smith', ignoring 'W.' rather than me specifying a grammar for middle names/initials.

Thanks.

---------------------------------

from modgrammar import *

grammar_whitespace_mode = 'optional'


class FirstName (Grammar):

    grammar = (WORD("A-Z", "a-z"))

class LastName (Grammar):

    grammar = (WORD("A-Z", "a-z"))


class MyGrammar (Grammar):

    grammar = (FirstName, OPTIONAL(LastName))

myparser = MyGrammar.parser()

result = myparser.parse_string("John W. Smith")

----------------------------------------------------------------------

Alex Stewart

unread,
Apr 10, 2014, 6:07:18 PM4/10/14
to modgr...@googlegroups.com
There are a few different ways to do that..  You want to be careful, though, because you probably don't want to be quite as flexible as you think you do..

If you want to match anything at all between a FirstName and a LastName, then you could do something like:

grammar = (FirstName, ZERO_OR_MORE(ANY), LastName)

You'd want to be careful with this, though, because currently you've got grammar_whitespace_mode = 'optional', which means it won't require spaces between the terms, so the ZERO_OR_MORE(ANY) would potentially match everything up to the last character, and then LastName could match only the last character, and that would be considered a valid match.  For this sort of thing you'd probably want to set grammar_whitespace_mode = 'required' instead.

Unfortunately, even that won't work if you want the LastName to be optional, because if you do:

grammar = (FirstName, ZERO_OR_MORE(ANY), OPTIONAL(LastName)) # Don't do this

Then the problem is that ZERO_OR_MORE(ANY) will always match anything after the FirstName to the end of the text, which means LastName won't have anything left to match, but since it's declared to be optional, that's OK, so the parser will consider that a valid result, and return it as the match.

If you want to do that sort of thing, what you probably need to do instead is split up the expression into its different possible forms, and then make sure the most explicit match possibility (the one that includes both FirstName and LastName) is always tried first, like so:

grammar = (G(FirstName, ZERO_OR_MORE(ANY), LastName) | FirstName)

Frankly, you probably don't want to match absolutely any string of any characters anyway, though.  I mean, if somehow you ended up with an input string of "John said, 'Hello world!' (while holding a banana) <-- check this", should that come back with a successful match of FirstName = "John", LastName = "this", or should it more accurately indicate a parse error instead?

So I'd suggest doing something more like:

grammar = (G(FirstName, OPTIONAL(WORD('A-Za-z.')), LastName) | FirstName)

(or if you want to be a bit more flexible, maybe even ZERO_OR_MORE instead of OPTIONAL..  Of course, you could also just define MiddleName as WORD('A-Za-z.'), and do ZERO_OR_MORE(MiddleName), which would be clearer as to the intent, and if for some reason you did want to extract that info at a later point it would already be there ready to be pulled out.  I know you were saying you didn't want to do that, but if you've come this far already...)

Hope this helps,
--Alex


--
You received this message because you are subscribed to the Google Groups "modgrammar" group.
To unsubscribe from this group and stop receiving emails from it, send an email to modgrammar+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

raghavan

unread,
Apr 10, 2014, 6:20:52 PM4/10/14
to modgr...@googlegroups.com
Thanks Alex! yes, agreed. I probably will be better off defining a grammar for middle name. What's the syntax for matching a single letter as opposed to a word? For example, for middle name, I'd like to match a full word or a single letter followed by a  period ('.')

Alex Stewart

unread,
Apr 10, 2014, 6:58:39 PM4/10/14
to modgr...@googlegroups.com
What you probably want for that is WORD('"A-Z", count=1) (or alternately WORD('A-Z", min=1, max=1), which means the same thing)

--Alex

raghavan

unread,
Apr 11, 2014, 2:29:44 PM4/11/14
to modgr...@googlegroups.com
Yes, and the syntax for something like 'W.' where there is a period after a single character?

Thanks.

Alex Stewart

unread,
Apr 11, 2014, 3:18:54 PM4/11/14
to modgr...@googlegroups.com
That's the same as any syntax for one thing followed by another:

(WORD("A-Z", count=1), ".")

So, for example:

class MiddleInitial (Grammar):
    grammar = (WORD("A-Z", count=1), ".")

class MiddleName (Grammar):
    grammar = (WORD("A-Z", "a-z"))

class MyGrammar (Grammar):
    grammar = (G(FirstName, ZERO_OR_MORE(MiddleName | MiddleInitial), LastName) | FirstName)

--Alex

raghavan

unread,
Apr 11, 2014, 3:31:14 PM4/11/14
to modgr...@googlegroups.com
Thanks Alex. You've been very helpful.
Reply all
Reply to author
Forward
0 new messages