[LEPL] Lookahead and tokens

34 views
Skip to first unread message

jazg

unread,
May 15, 2010, 11:42:00 PM5/15/10
to lepl
I'd like to do this:

~Lookahead(x) & m

where "m" is a matcher made up of tokens. This isn't allowed though.
Is there an easy way to get the same effect?

The only thing I can think of is having special versions of each token
like (token(~Lookahead(x) & Any()[:])) and a modified "m" that uses
the specialized versions of the tokens, but that sounds tedious and
confusing.

--
You received this message because you are subscribed to the Google Groups "lepl" group.
To post to this group, send email to le...@googlegroups.com.
To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/lepl?hl=en.

andrew cooke

unread,
May 16, 2010, 7:55:17 AM5/16/10
to le...@googlegroups.com

Nope, not with tokens. The tokenizer is currently way too simple to support
lookahead.

I am working on better regexp support, and that will eventually allow this (I
think), but it's a long, long way from being ready.

If you give more details of why you want to do this I may be able to suggest a
workaround, but without more details I can't think of anything apart from
using the lookahead inside the parser.

Also, remember that backtracking doesn't work for tokens, so putting Lookahead
inside the token won't work - you will make it fail, but then have no
alternative.

In short, the tokenizer canonly be used where the grammar is simple enough for
it to work. That's why it's optional. If it can't be used then don't use it
- use something like DroppedSpace() instead (it won't make much difference to
the complexity of the grammar).

Sorry,
Andrew

jazg

unread,
May 16, 2010, 2:22:26 PM5/16/10
to lepl
Well, here is one reason I wanted to use tokens instead of separators.

For example, "while x" in python is a loop, but "whilex" is not parsed
as a loop because it's a valid variable name. Whereas "while(x)" can
only be interpreted as a loop because "(" isn't allowed in the middle
of a name and "(x)" is an expression.

I can implement this with tokens:

name = Token("[a-z]+")
while_ = Token("while")
lparen = Token("\\(")
rparen = Token("\\)")
expression = Delayed()
loop = (~while_ & expression) > "loop"
expression += (name | (~lparen & expression & ~rparen)) > "exp"

In this case everything works: "whilex" is parsed as a single
expression, and both "while x" and "while(x)" are parsed as loops.

If I try the same thing with regular matchers and DroppedSpace, I end
up with "whilex" being considered a loop instead of a name. I can't
figure out a nice way to get the same results as I do with tokens. Is
there something simple I'm overlooking?
> > For more options, visit this group athttp://groups.google.com/group/lepl?hl=en.
>
> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group athttp://groups.google.com/group/lepl?hl=en.

andrew cooke

unread,
May 16, 2010, 5:22:03 PM5/16/10
to le...@googlegroups.com

1 - I really meant, an example of why you needed lookahead in tokens.

2 - I don't think you will have the problem you described, though, with not
using tokens.

For example (this is untested, so please try to ignore stupid mistakes):

name = Word()
with DroppedSpace():
while = "while" & name & ":"
assignment = name & "=" & name
statement = while | assignment

now consider statement.parse("whilex = foo")

First, "while" & name will match. But then ":" will fail. So then the parser
will try assignment instead.

3 - If you want to force spaces, use:

with Separator(Drop(Space()[1:])):

(there's a couple of errors in the docs - one is that DroppedSpace isn't in
the index; the other is that it says that it matches one or more spaces when,
as you have seen, it matches zero or more)

4 - This (requiring spaces) gets complicated when you have optional values
separated by spaces (because if the optional thing is missing, you can
stillend up requiring a space on either side of "nothing"). SmartSeparator1()
and SmartSeparator2() try to address this - see
http://www.acooke.org/lepl/operators.html#index-97

5 - However, getting offside parsing to work (if you are trying to parse
Python) without the lexer is going to be "interesting". If you want offside
parsing, finding a solution with tokens is important.

Hope that helps - handling spaces is complex, and there are many different
options. Personally I would recommend (2) - simply going with zero or more
spaces and relying on the parser backtracking.

Andrew

andrew cooke

unread,
May 16, 2010, 5:49:37 PM5/16/10
to le...@googlegroups.com

I just updated the docs on acooke.org to fix the errors mentioned below - if
you're reading http://www.acooke.org/lepl/operators.html#index-97 please hit
reload. Thanks. Andrew

jazg

unread,
May 16, 2010, 9:51:14 PM5/16/10
to lepl
1 - On second thought I don't only want ~Lookahead(x) because it fails
for anything that begins with x. I want to allow that, and only fail
if it matches x alone. So here's a basic example,

token1 = Token(Lower())
token2 = Token(Lower())
m = token1 & token2[1:]

Now I also want a specialized version of m that fails specifically on
"ab" but allows any other combination of letters (including things
like "abc" or "cab").

2 - This is my example translated to not use tokens:

name = Lower()[:,...]
while_ = Literal("while")
lparen = Literal("(")
rparen = Literal(")")
with DroppedSpace():
expression = Delayed()
loop = (~while_ & expression) > "loop"
expression += (name | (~lparen & expression & ~rparen)) > "exp"
test = loop | expression

When I tried test.parse("whilex") the result is "loop", but it should
be "exp".
I changed the order of "test" to (expression | loop) and it properly
parsed "whilex" as "exp", but "while x" fails at " x". I don't
understand why.

5 - I was afraid of that... I'm not parsing python but I do want
offside parsing.



On May 16, 5:22 pm, andrew cooke <and...@acooke.org> wrote:
> 1 - I really meant, an example of why you needed lookahead in tokens.
>
> 2 - I don't think you will have the problem you described, though, with not
> using tokens.
>
> For example (this is untested, so please try to ignore stupid mistakes):
>
> name = Word()
> with DroppedSpace():
> while = "while" & name & ":"
> assignment = name & "=" & name
> statement = while | assignment
>
> now consider statement.parse("whilex = foo")
>
> First, "while" & name will match. But then ":" will fail. So then the parser
> will try assignment instead.
>
> 3 - If you want to force spaces, use:
>
> with Separator(Drop(Space()[1:])):
>
> (there's a couple of errors in the docs - one is that DroppedSpace isn't in
> the index; the other is that it says that it matches one or more spaces when,
> as you have seen, it matches zero or more)
>
> 4 - This (requiring spaces) gets complicated when you have optional values
> separated by spaces (because if the optional thing is missing, you can
> stillend up requiring a space on either side of "nothing"). SmartSeparator1()
> and SmartSeparator2() try to address this - seehttp://www.acooke.org/lepl/operators.html#index-97

andrew cooke

unread,
May 17, 2010, 8:40:21 AM5/17/10
to le...@googlegroups.com
On Sun, May 16, 2010 at 06:51:14PM -0700, jazg wrote:
> 1 - On second thought I don't only want ~Lookahead(x) because it fails
> for anything that begins with x. I want to allow that, and only fail
> if it matches x alone. So here's a basic example,
>
> token1 = Token(Lower())
> token2 = Token(Lower())
> m = token1 & token2[1:]
>
> Now I also want a specialized version of m that fails specifically on
> "ab" but allows any other combination of letters (including things
> like "abc" or "cab").

OK, so what I think you are saying, from above and the rest of your email, is
that you have some words, like "while" which are *keywords* and which cannot
be used as variable names in your language.

In that case I would do something like this:

k_while = "while"
k_if = "if"
keywords = [k_while, k_if]

t_while = Token(k_while)
t_if = Token(k_if)
t_variable = Token(Lower()[1:,...])(~Lookahead(Or(*keywords)) & Lower()[1:,...])

In this case it doesn't matter that the t_variable Token cannot backtrack,
because if someone has "while" as a word in their source, it can only be a
keyword.


> 2 - This is my example translated to not use tokens:
>
> name = Lower()[:,...]
> while_ = Literal("while")
> lparen = Literal("(")
> rparen = Literal(")")
> with DroppedSpace():
> expression = Delayed()
> loop = (~while_ & expression) > "loop"
> expression += (name | (~lparen & expression & ~rparen)) > "exp"
> test = loop | expression
>
> When I tried test.parse("whilex") the result is "loop", but it should
> be "exp".
> I changed the order of "test" to (expression | loop) and it properly
> parsed "whilex" as "exp", but "while x" fails at " x". I don't
> understand why.

My guess, from just quickly looking at that, is that you have problems because
you have

name = Lower()[:,...]

instead of

name = Lower()[1:,...]


Andrew

andrew cooke

unread,
May 17, 2010, 8:56:14 AM5/17/10
to le...@googlegroups.com

Sorry, that should be:

t_variable = Token(Lower()[1:,...])(~Lookahead(Or(*keywords) & Eos())
& Lower()[1:,...])

to avoid rejecting things like "whilex"

Andrew

jazg

unread,
May 17, 2010, 5:35:25 PM5/17/10
to lepl
But Eos() means the absolute end of input right? Consider something
like this...

var = ~Lookahead("while" & Eos()) & Lower()[1:,...]
assign = var + "=1"

var.parse("while") will fail, but assign.parse("while=1") will match
because var looks ahead and sees "=".

jazg

unread,
May 17, 2010, 5:45:17 PM5/17/10
to lepl
The problem is I can only use that with a single token. What if I want
to apply the lookahead to an entire matcher like (token1 & (token2 |
token3)[0:])?

andrew cooke

unread,
May 17, 2010, 5:47:13 PM5/17/10
to le...@googlegroups.com

I had

t_variable = Token(Lower()[1:,...])(~Lookahead(Or(*keywords) & Eos())
& Lower()[1:,...])

Split that in two:

BaseToken = Token(Lower()[1:,...])
t_variable = BaseToken(~Lookahead(Or(*keywords) & Eos())
& Lower()[1:,...])

there are two different stages here.

In first stage, the input is split into tokens. Only the matchers passed to
Token() are used.

In the second stage, matchers *within* tokens are matched. You can think of
the argument passed to BaseToken() as matching just the text of the token. So
there, Eos() matches the end of the token (see the manual for the lexer - this
is what I call "specialised tokens" there).

For example:

t = Token(Lower()[1:,...])(Any()[2,...] & Eos())
t[:].parse("ab cd")
['ab', 'cd']

Andrew

andrew cooke

unread,
May 17, 2010, 5:49:57 PM5/17/10
to le...@googlegroups.com

You want
token1 & (token2 | token3)[0:])
to not match "while"?

Tokens are longest match, so if you have a token that matches all of "while",
it won't be possible for smaller tokens to match part of it.

You can't avoid "wh ile" from matching the above (token1 as "wh" and token2 as
"ile", for example), but why would you want to?

Maybe I am not understanding?

Andrew

jazg

unread,
May 17, 2010, 7:06:13 PM5/17/10
to lepl
It's more like this:

a = Token(Any("abcd"))
b = Token(Any("efgh"))
both = a + b

Now I want "both" to normally match each combination, but also have a
special case where "ae" is disallowed.

But now that you reminded me that this would also match a and b with a
space in between, I think I'm doing this completely wrong. I have
probably made a lot of stupid mistakes because I'm converting code
that originally didn't use tokens, instead of designing it with tokens
from the start.

Looking at everything you posted, I think I have a good solution now:

a = Any("abcd")
b = Any("efgh")
t_a = Token(a)
t_b = Token(b)
t_both = Token(a + b)
both_no_ae = both(~Lookahead("ae" & Eos()) & Any()[:,...])

I will try to apply this to my real code and if I still have problems
I will show a more specific example of what I'm doing.
> > > To unsubscribe from this group, send email...
>
> read more »

andrew cooke

unread,
May 17, 2010, 7:31:32 PM5/17/10
to le...@googlegroups.com

Great - sounds like a breakthrough! :o)

I have found a bug in Lepl related to Token(Word()) so if you see an error
about \U.... and ascii codecs, don't try to fix it yourself! I'm working on a
fix, but also having a new bathroom installed, and tonight I need to go
buybricks and sand and cement...

Cheers,
Andrew
Reply all
Reply to author
Forward
0 new messages