Replacing chronic's brittle regex with a treetop grammar

32 views
Skip to first unread message

James Cox

unread,
Oct 27, 2011, 8:16:35 AM10/27/11
to treet...@googlegroups.com
Hey,

I'm trying to come up with a peg for natural language time parsing. Chronic's matcher is old and doesn't work well, and it's very hard to extend/fix. So i'm experimenting with creating a tt grammar - and hoping i can identify parts of a date/time string, which i can then postprocess/attach semantic value to, and then calculate into a Time/Date object.

so, given a string like:

"the sunday before last", "4 fridays hence" or "jan 1st last year"

I'd like to parse these and convert to meaningful data.

'the sunday before last' translates to sunday, past, 2
'2 fridays hence' translates to friday, future, 4
'jan 1st last year' translates to jan, 1, past, year-1

(or something like that :))

i've started in building my grammar, but i'm struggling to get my head
around how to approach it - and therefore would appreciate any
feedback as to the best kind of structure.

e.g. i'm not sure that PEG is completely right, and since there are no
tokens separating content (other than a space) it's been tricky to
figure out how to approach it.

so... if anyone is willing to help share some pointers and discuss
approach for this, i'd very much appreciate it!

Thanks,

James Cox

markus

unread,
Oct 27, 2011, 2:34:51 PM10/27/11
to treet...@googlegroups.com
If I were doing something like this I would probably proceed on two
fronts:

1. Start building a corpus of examples / test cases

Tuesday
Tuesday after next
last Tuesday
etc.

2. Start building primitives / and base patterns (pseudo grammar here):

weekday -> "Monday" / "Tuesday" / ...
next -> "next" / ("this" sp "coming") / ...
rel_seq -> (before / after)? (next / prior)

Then I'd tie them together with a forgiving top rule that tries to match
things, and skips ahead (e.g. probably to the next non-alphanumeric) if
it can't, start running the examples through the grammar, fixing /
adding as I went.

And I'd keep going until the work of extending coverage to the
additional cases wasn't worth the effort.

There's no right or final answer in cases like this, and if you try to
design it all up front you'll never finish. Better to dig in and
iterate. IMHO.

-- M


James Cox

unread,
Oct 27, 2011, 2:41:15 PM10/27/11
to treet...@googlegroups.com
That's really helpful -- and in fact, the way i'm doing it. I'm
starting with chronic's test corpus, and have been working on defining
the primitives and patterns.

I guess my main concern is trying to figure out how to build the super
forgiving top rule, as it's not clear how to define the boundaries-
and so i either end up with stack recursion exceptions or it won't
parse at all.

:(

> --
> You received this message because you are subscribed to the Google Groups "Treetop Development" group.
> To post to this group, send email to treet...@googlegroups.com.
> To unsubscribe from this group, send email to treetop-dev...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/treetop-dev?hl=en.
>
>

--
James Cox,
Consultant, Raconteur, Photographer, Entrepreneur
t: +1 347 433 0567  e: ja...@imaj.es w: http://imaj.es/
talk: http://twitter.com/imajes photos: http://flickr.com/imajes

markus

unread,
Oct 27, 2011, 6:11:25 PM10/27/11
to treet...@googlegroups.com
J --

> I guess my main concern is trying to figure out how to build the super
> forgiving top rule, as it's not clear how to define the boundaries-
> and so i either end up with stack recursion exceptions or it won't
> parse at all.

You might try something like:

rule sloppy_match
skip_punctuation (
picky_match sloppy_match /
skip_text sloppy_match /
''
)
end
rule skip_text
[a-zA-Z0-9]+ skip_punctuation
end
rule skip_punctuation
[,; ]*
end

where picky_match is your prototype for the eventual top rule. In other
words, try to match at the start, but if it doesn't move ahead a bit and
try again; in either case, try applying picky_match every place that it
might work (possibly matching multiple times in each test string).

-- M


Reply all
Reply to author
Forward
0 new messages