Deepest match metadata

2 views
Skip to first unread message

malthe

unread,
May 27, 2010, 8:55:18 AM5/27/10
to lepl
Consider the following matcher:

Literal("A") & Literal("1")

I'd parse, say, the following two strings, in two separate runs:

"A" and "A2"

In either case, I'd like to annotate the matchers above such that I'd
get the error feedback:

Expected <some annotated name that pertains to the second matcher>,
got nothing.
Expected <some annotated name that pertains to the second matcher>,
got "2".

Right now I seem to get some stream object back with values -1, -1,
None (or similar). Not very informative in my case.

I do realize that it gets more complicated when we take the OR
operator into account (i.e. it would expect one of a set of matchers),
but that might be solveable.

I'm also not sure exactly how the use of tokens would fit in here, if
at all.

Any help greatly appreciated; I'm quite happy with what I've seen in
Lepl so far and it'd be great if it can support this use-case.

Thanks,

\malthe

andrew cooke

unread,
May 27, 2010, 9:24:45 AM5/27/10
to lepl

I think you are looking for something like:

Literal('A') & (('1' > MyResult) | (Any() ** MyErrorClass))

where MyErrorClass takes the keyword arguments described at
http://www.acooke.org/lepl/matchers.html#index-79

I'm not sure that exactly does what you want, but it might help you
find the appropriate solution.

My experience of this approach is that it's rather tricky to get
right, because the error case might be accidentally triggered by
backtracking. In theory First (or "%") should help with this, but I
haven't used that for a long time (at some point I will check / add
the test for that).

In other words, this may be more reliable (but is so obscure I am not
100% sure it will work!):

Literal('A') & (('1' > MyResult) % (Any() ** MyErrorClass))

Andrew

Malthe Borch

unread,
May 27, 2010, 9:41:29 AM5/27/10
to le...@googlegroups.com
On 27 May 2010 16:24, andrew cooke <and...@acooke.org> wrote:
>
> I think you are looking for something like:
>
>  Literal('A') & (('1' > MyResult) | (Any() ** MyErrorClass))

After I'd posted I took an even closer look at the calculator example
and started experimenting along the same lines.

I find that I don't actually need to pipe preceding tokens to a
``MyResult`` function; if I use the built-in ``make_error`` then it's
just part of the result as an ``Error`` object.

So I think this is workable.

Would it make sense to first run a lexer on the input stream and then
create matchers for those tokens? How does that work in practice ––
something like ``parse_list``? The docs are sort of unclear about what
you do after you've done the lexing part.

Meanwhile I'll look into your % operator.

\malthe

andrew cooke

unread,
May 27, 2010, 4:54:58 PM5/27/10
to lepl

Did the other email answer the lexing question here?

I find it better not to think of lexing as being a separate phase.
Instead, think of Token() as being just like any other matcher, but
with the restriction that (1) it must be reduced internally to a
regexp and (2) matching is greedy so backtracking is limited.

In other words, lexing is just an optimisation (that happens to
simplify handling of spaces). It's largely invisible in the grammar
and you don't really need to know how it's implemented. I hope.

Andrew

Malthe Borch

unread,
May 28, 2010, 1:10:02 AM5/28/10
to le...@googlegroups.com
I find that things work quite predictable with the regular matchers,
but the lexer is a challenge to me:

Both of these fail:

matcher1 = Token(Word())[1:] | Token(Any())[1:]
matcher2 = Token(Word())[1:] % Token(Any())[1:]
print matcher1.parse("John Smith")
print matcher2.parse("John Smith")

While this succeeds:

matcher3 = Token(Word())[1:]
print matcher3.parse("John Smith")

I don't understand why that is.

\malthe

> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>
>

--
Malthe Borch
Technical Advisor
UNICEF Uganda
+256 (0) 703 945 965

andrew cooke

unread,
May 28, 2010, 9:15:34 AM5/28/10
to le...@googlegroups.com

Any() will match " ". So "John Smith" is tokenized as "John" " " "Smith" and
neither matcher1 nor matcher2 will match that.

In the case that works, no token matches " ", but it is a space, so it is
discarded.

Instead of Any() you might wany Any(string.ascii_letters), for example.

Andrew

Reply all
Reply to author
Forward
0 new messages