using external lexer?

4 views
Skip to first unread message

Ben

unread,
Oct 31, 2010, 7:00:58 AM10/31/10
to lepl
Is there an easy way to plug in an external lexer or in some other way
pass a stream of already annotated tokens to the parsing process?

I've got an application where I'm trying to mix free-form and
structured content -- what I'd like to do is something like this:

raw = Token(...)
structure = Token(...)

grammar = structure(some_grammar_start_symbol)

grammar.parse(token_stream)

where token_stream is already split up into raw and structure tokens
-- for my application it seems to be much easier to tokenize with code
than with a lexer ... trying to parse the language directly (without
tokenizing) also seems hard. But I do want to use a grammar to
describe the structured content and the rules governing the mixing of
the structured and unstructured content ...

I'm hoping maybe there's some func token_by_id I could write and put
here ...?

raw = Token(token_by_id('raw'))
structure = Token(token_by_id('structure'))

Thanks for a great tool -- appreciate any advice!

Ben

unread,
Oct 31, 2010, 9:43:42 PM10/31/10
to lepl
It was perhaps wrong of me to assume I needed to use the Token class
for this.

I found another similar topic in the mailing list archives:

http://groups.google.com/group/lepl/browse_thread/thread/09a1052ef592211c#

I'm trying to accomplish something similar and am trying a similar
approach -- but I can't seem to mix normal strings with opaque
objects ...

class _rawtext(object):
def __init__(self, value):
self.value = value

@function_matcher
def rawtext(support, stream):
if stream and isinstance(stream[0], _rawtext):
return ([stream[0]], stream[1:])

raw_then_literal = rawtext() & Literal('a')

raw_then_literal.parse([_rawtext('some unstructured data'), 'a'])

Gives

FullFirstMatchException: The match failed at '['a']'.

Does what I'm trying to do make sense? Is there some way to
accomplish what I want?

andrew cooke

unread,
Oct 31, 2010, 9:59:48 PM10/31/10
to le...@googlegroups.com

Hi,

Sorry for not replying today - been busy. Was going to reply tomorrow
(and still will).

I have a quick look and thought what you were doing might be possible, but
it's going to take some thought - the stream types are more complex than
you might expect (because they carry location in formation).

Looking at what you have below, I am a little surprised it does not work.
That approach should be fine. Maybe try "parse_list" rather than just
"parse"?

But I don't really have time today, sorry. Will look at this tomorrow
(unless Comcast is even suckier than usual and absorbs the whole day just
trying to pay the bill...).

Andrew

andrew cooke

unread,
Nov 1, 2010, 6:40:29 AM11/1/10
to le...@googlegroups.com

Hi,

I realised (while asleep?!) that your types are wrong here.

Usually a stream is (effectively) a string. So stream[a:b] is a subset of
a string, which is also a string.

However, in this case, your stream is a list and stream[a:b] is a sublist
of a list. So Literal() needs to match a list, not a string. So instead
of Literal('a') you need Literal(['a']).

>>> from lepl import *
>>> class _rawtext(object):
... def __init__(self, value):
... self.value = value
...
>>> @function_matcher
... def rawtext(support, stream):
... if stream and isinstance(stream[0], _rawtext):
... return ([stream[0]], stream[1:])
...

>>> raw_then_literal = rawtext() & Literal(['a'])
>>> raw_then_literal.parse([_rawtext('some unstructured data'), 'a'])

[<__main__._rawtext object at 0x7fe511864b90>, ['a']]

Is that sufficient for what you want? Or would it be better to look again
at your original approach?

Andrew

On Sun, 31 Oct 2010 18:43:42 -0700 (PDT), Ben <cohe...@gmail.com> wrote:

andrew cooke

unread,
Nov 1, 2010, 10:22:15 AM11/1/10
to le...@googlegroups.com

OK, this is horrendously ugly, but it will give you some idea of how to
subvert the token and stream processing.

In the example below I create a token, but disable the lexer (I need to
set compiler=True on the token because otherwise a sanity check in the
parser flags an error). Then I short-circuit the stream generation by
providing my own dummy stream. The stream is in the internal format
expected for tokens - each entry is a list of possible token IDs and then
the value.

I specialise the token to test for even, and supply an even and an odd
number. I get a partial match error because only the even value matched.

I'm sorry this isn't more elegant - at some point I should add a simple
interface that lets you do this kind of thing.


from lepl import *

if __name__ == '__main__':

@function_matcher
def isEven(support, stream):
if stream[0] % 2 == 0:
return stream[0], stream[1:]

special = Token('Specialised')
special.compiled = True
even = special(isEven)

class DummyStreamFactory(object):

def auto(self, x):
return x

special.config.stream_factory(DummyStreamFactory()).no_lexer()
print special.parse([([special.id_], 2), ([special.id_], 3)])


Looking at that I am now starting to wonder why you need tokens/lexer at
all, so you may well be right in your other approach...!

Andrew

PS I'm travelling tomorrow through Thursday, probably without my laptop,
but will reply to email when I return.

andrew cooke

unread,
Nov 1, 2010, 10:31:00 AM11/1/10
to le...@googlegroups.com

Sorry, no - this will not work.

The code below has an error, in that I used "special" rather than "even"
as the parser. If I correct that then I hit a pile of errors due to
inconsistencies in stream types.

I really don't think the lexer code can do this. Hopefully you can just
use ordinary parsing on a list.

Sorry again,
Andrew


On Mon, 01 Nov 2010 09:22:15 -0500, andrew cooke <and...@acooke.org>
wrote:

Ben

unread,
Nov 2, 2010, 12:54:15 AM11/2/10
to lepl
Hi Andrew --

Thanks for your responses! That will work I think. It feels a bit
strange to represent the 'non-opaque' strings as 'lists of length 1
strings' ... I wonder if there might be performance implications doing
that? But it shouldn't matter for my use.

I spent some time looking at the Source and Stream classes -- I was
trying to figure out if it would be easy to write a custom version of
one of those which would let me pass a sequence of opaque objects and
'strings' and write my matcher's like this:

opaque_text = rawtext()
balanced_text = Literal('a') & opaque_text & Literal('b')

instead of

opaque_text = rawtext()
structured_text = Literal(['a']) & opaque_text & Literal(['b'])

Thanks for all your help!

On Nov 1, 4:40 am, andrew cooke <and...@acooke.org> wrote:
> Hi,
>
> I realised (while asleep?!) that your types are wrong here.
>
> Usually a stream is (effectively) a string.  So stream[a:b] is a subset of
> a string, which is also a string.
>
> However, in this case, your stream is a list and stream[a:b] is a sublist
> of a list.  So Literal() needs to match a list, not a string.  So instead
> of Literal('a') you need Literal(['a']).
>
> >>> from lepl import *
> >>> class _rawtext(object):
>
> ...     def __init__(self, value):
> ...         self.value = value
> ...>>> @function_matcher
>
> ... def rawtext(support, stream):
> ...     if stream and isinstance(stream[0], _rawtext):
> ...         return ([stream[0]], stream[1:])
> ...>>> raw_then_literal = rawtext() & Literal(['a'])
> >>> raw_then_literal.parse([_rawtext('some unstructured data'), 'a'])
>
> [<__main__._rawtext object at 0x7fe511864b90>, ['a']]
>
> Is that sufficient for what you want?  Or would it be better to look again
> at your original approach?
>
> Andrew
>
> On Sun, 31 Oct 2010 18:43:42 -0700 (PDT), Ben <cohen....@gmail.com> wrote:
> > It was perhaps wrong of me to assume I needed to use the Token class
> > for this.
>
> > I found another similar topic in the mailing list archives:
>
> http://groups.google.com/group/lepl/browse_thread/thread/09a1052ef592...

andrew cooke

unread,
Nov 2, 2010, 6:57:59 AM11/2/10
to le...@googlegroups.com
On Mon, 1 Nov 2010 21:54:15 -0700 (PDT), Ben <cohe...@gmail.com> wrote:
> Hi Andrew --
>
> Thanks for your responses! That will work I think. It feels a bit
> strange to represent the 'non-opaque' strings as 'lists of length 1
> strings' ... I wonder if there might be performance implications doing
> that? But it shouldn't matter for my use.

Unfortunately, if efficiency is paramount, Lepl isn't really the right
solution.

> I spent some time looking at the Source and Stream classes -- I was
> trying to figure out if it would be easy to write a custom version of
> one of those which would let me pass a sequence of opaque objects and
> 'strings' and write my matcher's like this:
>
> opaque_text = rawtext()
> balanced_text = Literal('a') & opaque_text & Literal('b')
>
> instead of
>
> opaque_text = rawtext()
> structured_text = Literal(['a']) & opaque_text & Literal(['b'])

I was wondering about that too (although the way it works now is
consistent, I think - Literal matches a sequence of values from the input
stream; normally that's a sequence of characters in a string, but here it's
a sequence of entries in a list). A simple solution would be to
automatically adapt things. I haven't tried this, but something like:

def Lift(matcher):
def MyMatcher(text):
return matcher([text])
return MyMatcher

which you would use like:

MyLiteral = Lift(Literal)
structured_text = MyLiteral('a') & opaque_text & MyLiteral('b')

although this only works for matchers that take a single argument, etc
etc.

Another approach would be something similar that gives a matcher which
receives a modified stream (so the returnedmatcher modifies the stream to
remove the extra list), but to get that right you would need to use
trampoline_matcher_factory, which is pretty much undocumented.

Andrew

Ben

unread,
Nov 4, 2010, 2:52:55 AM11/4/10
to lepl

>
> Another approach would be something similar that gives a matcher which
> receives a modified stream (so the returnedmatcher modifies the stream to
> remove the extra list), but to get that right you would need to use
> trampoline_matcher_factory, which is pretty much undocumented.
>
> Andrew

That sounds nice -- might that be easy to do? I think I might vaguely
understand how the trampoline_matcher_factory works (from looking at
the examples in the code) -- but I have no idea how to invoke a
submatcher on input that's not already been correctly wrapped up in a
'Stream' ...
Reply all
Reply to author
Forward
0 new messages