Nesting and +

14 views
Skip to first unread message

wires

unread,
Nov 10, 2011, 4:19:35 PM11/10/11
to lepl
Hello,

I want to parse nested comments like this:

(* aba
(* boo *)
baz
(*bar*)
foo
*)

So I made the following parser:

# nested comments
start = Literal('(*')
end = Literal('*)')
both = Or(start, end)

nested_comment = Delayed()
contents = (nested_comment | ~Lookahead(both) & Any())[:]
comment = start & contents & end > Comment
nested_comment += comment

Which parses, but...

Node
`- Comment
+- u'(*'
+- u' '
+- u'a'
+- u'b'
+- u'a'
+- u'\n'
+- u' '
+- u' '
+- u' '
+- Comment
| +- u'(*'
| +- u' '
| +- u'b'
| +- u'o'
| +- u'o'
| +- u' '
| `- u'*)'
+- u'\n'
+- u' '
+- u' '
+- u' '
+- u'b'
+- u'a'
+- u'z'
+- u'\n'
+- u' '
+- u' '
+- u' '
+- Comment
| +- u'(*'
| +- u'b'
| +- u'a'
| +- u'r'
| `- u'*)'
+- u'\n'
+- u' '
+- u' '
+- u' '
+- u'f'
+- u'o'
+- u'o'
+- u'\n'
`- u'*)'

...that is not what I want. What is the easiest way to join the text
together into a single string? Like this:

Node
`- Comment
+- u'(*'
+- u' aba\n '
+- Comment
| +- u'(*'
| +- u' boo '
| `- u'*)'
+- u'\n baz\n '
+- Comment
| +- u'(*'
| +- u'bar'
| `- u'*)'
+- u'\n foo\n'
`- u'*)'

Thanks a lot!

andrew cooke

unread,
Nov 10, 2011, 7:47:13 PM11/10/11
to le...@googlegroups.com
hi,

you need to use [...] to join together the fragments, but you only want to joint together the text, not the comments, so need to break things open a little.

instead of 

    nested_comment = Delayed()
    contents = (nested_comment | ~Lookahead(both) & Any())[:]
    comment = start & contents & end > Comment
    nested_comment += comment

try something more like: 

  nested_comment = Delayed()
  word = (~Lookahead(both) & Any())[:,...]
  contents = (nested_comment | word)[:]
  comment = start & contents & end > Comment 
  nested_comment += comment 

(i haven't tried it out, but that shoulf give you the right idea).

andrew

Jelle Herold

unread,
Nov 12, 2011, 5:19:11 PM11/12/11
to le...@googlegroups.com
Hi Andrew,

Thanks for answering!
This is what I tried initially but the parser then becomes very inefficient... to the point where I kill it before it's done parsing a file.

In case it is helpful the file can be found here:

There is a test case for the nested comments: python coqproc.py tests/minimal.v

Do you see a way to do this?

Thanks,
Jelle.

andrew cooke

unread,
Nov 12, 2011, 5:41:31 PM11/12/11
to le...@googlegroups.com

is it slow parsing the tiny test case, or slow parsing some larger file? if
the latter, can you show me the file that it is having problems with?

more generally, it could be that it contains an error (in which case lepl is
backtracking like crazy because it is stuck) or some problem with the grammar
that i can't see, but which might be fixed by appropriate use of First().

andrew

> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>

Jelle Herold

unread,
Feb 7, 2012, 9:16:02 PM2/7/12
to le...@googlegroups.com
On 11/12/2011 04:41 PM, andrew cooke wrote:
> > > you need to use [...] to join together the fragments, but you > > > only want to joint together the text, not the comments, so need > > > to break things open a little. > > > > > > instead of nested_comment = Delayed() contents = (nested_comment > > > | ~Lookahead(both) & Any())[:] comment = start & contents & end > > > > Comment nested_comment += comment > > > > > > try something more like: nested_comment = Delayed() word = > > > (~Lookahead(both) & Any())[:,...] contents = (nested_comment | > > > word)[:] comment = start & contents & end > Comment > > > nested_comment += comment > > > > > > (i haven't tried it out, but that shoulf give you the right > > > idea). > > > > This is what I tried initially but the parser then becomes very > > inefficient... to the point where I kill it before it's done > > parsing a file. >
> is it slow parsing the tiny test case, or slow parsing some larger > file? if the latter, can you show me the file that it is having > problems with? > > more generally, it could be that it contains an error (in which case > lepl is backtracking like crazy because it is stuck) or some problem > with the grammar that i can't see, but which might be fixed by > appropriate use of First().

Well, it had problems with all test files,
Here is a simplified version of a parser that uses lookahead but shows the same problem.

----
#! /usr/bin/env python

from lepl import *


start = Literal('(*')
end = Literal('*)')
 
# collect (as many as possible) symbols other than '(*' or '*)'
word = (~Lookahead(start|end) & Any()) #[:,...]

parser = (start|end|word)[:].get_parse()

print parser("foo(*bar(*baz*)meh*)teh")
----

If you uncomment the #[:,...] bit, the parser hangs with 100% cpu.

I'm probably doing something stupid here, but I don't see what...

Any ideas?

Thanks again,
Jelle.

andrew cooke

unread,
Feb 7, 2012, 3:36:59 PM2/7/12
to le...@googlegroups.com

ok, i don't know for sure, but just looking at that raises the following huge
red flag:

[:] means repeat any number of times from 0 upwards.

that means that x[:] will successfully match an empty string.

that means that x[:][:] will sit for ever, repeatedly matching empty strings,
happy to be making progress.

now obviously you don't have anything that obviously wrong, but if
uncommenting a [:] leads to 100% cpu then it suggests that somwhere else that
thing (which can be empty) is being repeated, and the system is repetedly
matching the empty string.

the usual fix is to change [:] to [1:] (and perhaps that should be the default
when i next release a major release).

andrew

Jelle Herold

unread,
Feb 7, 2012, 9:48:47 PM2/7/12
to le...@googlegroups.com
On 02/07/2012 02:36 PM, andrew cooke wrote:
> ok, i don't know for sure, but just looking at that raises the following huge
> red flag:
>
> [:] means repeat any number of times from 0 upwards.
>
> that means that x[:] will successfully match an empty string.
>
> that means that x[:][:] will sit for ever, repeatedly matching empty strings,
> happy to be making progress.
>
> now obviously you don't have anything that obviously wrong, but if
> uncommenting a [:] leads to 100% cpu then it suggests that somwhere else that
> thing (which can be empty) is being repeated, and the system is repetedly
> matching the empty string.
>
> the usual fix is to change [:] to [1:] (and perhaps that should be the default
> when i next release a major release).

Ahh! *slap* that's it... so, the following works,
(and from here I should be able to fix the rest)

----
#! /usr/bin/env python

from lepl import *

start = Literal('(*')
end = Literal('*)')

# collect (as many as possible) symbols other than '(*' or '*)'

word = (~Lookahead(start|end)& Any()) #[1:,...]

parser = (start|end|word)[1:].get_parse()

print parser("foo(*bar(*baz*)meh*)teh")
----

Thanks Andrew!

Jelle.

Jelle Herold

unread,
Feb 7, 2012, 11:05:53 PM2/7/12
to le...@googlegroups.com
On 02/07/2012 02:36 PM, andrew cooke wrote:
> ok, i don't know for sure, but just looking at that raises the following huge
> red flag:
> [...]
>

So just for the record, should anyone be wondering, here is a parser
that deals with nested comments, including empty ones, and dropping the
delimiters.

Cheers!

#! /usr/bin/env python

from lepl import *

class Comment(Node): pass

start = Literal('(*')
end = Literal('*)')

# a word is any number of symbols not '(*' or '*)'
word = (~Lookahead(start|end) & Any())[1:,...]

nested_comment = Delayed()
contents = (nested_comment | word)[1:]
comment = ~start & (contents|Empty()) & ~end > Comment
nested_comment += comment

parser = nested_comment[1:].get_parse()

print parser("(*bar(*foo(*toz*)meh*)*)")[0]

andrew cooke

unread,
Feb 7, 2012, 6:09:31 PM2/7/12
to le...@googlegroups.com

thanks + glad it works, andrew

Reply all
Reply to author
Forward
0 new messages