matching balanced-dynamic tokens

22 views
Skip to first unread message

Haroldo Stenger

unread,
Dec 28, 2010, 11:38:44 AM12/28/10
to le...@googlegroups.com
hi !

I'm trying to write a parser of a type of text file that resembles xml  (tags are alone in their own lines, opening tags and closing tags differ in a fixed prefix or postfix, tags share a 'dinamic' part, namely their name). Is there an operator already written in lepl , that I can use to construct this syntax ? The problem would be similar to matching a if/then/else/endif structure , weren't it for they don't have a 'dynamic' part , although (and obviously) if they nest, recursive parsing should apply (the same as in my 'xml'-like textfile.

Any kind hint ?

thanks a lot in advance

Haroldo

Jasper St. Pierre

unread,
Dec 28, 2010, 6:31:06 PM12/28/10
to le...@googlegroups.com
Do you have a sample?

> --
> You received this message because you are subscribed to the Google Groups
> "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to
> lepl+uns...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/lepl?hl=en.
>

andrew cooke

unread,
Dec 29, 2010, 7:30:59 AM12/29/10
to le...@googlegroups.com

About to get on a plane, but my reply on this stack overflow question
may be relevant -
http://stackoverflow.com/questions/3535706/implementing-parser-for-markdown-like-language

Please say if it's not any help.

Andrew

andrew cooke

unread,
Dec 29, 2010, 7:32:25 AM12/29/10
to le...@googlegroups.com

Ah, but that's not dynamic. You want to check that the closing matches
the opening name? There is nothing inbuilt to do that - you'll need to
do it explicitly. I'll post an example within a few days (I hope -
assuming snow is not an issue etc etc...) Andrew

On Wed, 29 Dec 2010 07:30:59 -0500, andrew cooke <and...@acooke.org>
wrote:

Haroldo Stenger

unread,
Dec 29, 2010, 6:36:13 PM12/29/10
to le...@googlegroups.com
Thanks for any samples you could write.

basically the structure to parse looks like:

*tag1
    dwdjwojdiwei
    *tag2
       ;kljw;klklklk
         kujkjkj
    *end*tag2
lsdkjwlk
*end*tag1

*tag1 should match *end*tag1
*tagz2 should match *end*tag2

being * the fixed part in the opening tag and *end* the fixed part in the closing tag
and also
being tag1 and tag2 the dynamic parts in the opening tags , and tag1and tag2 the variable parts in the closing parts (so, they are the same in opening an closing).

Additionally, I'd like to discard whitespace in lines that contain tags, but I'd prefer to keep the same the whitespace in lines in between the tags.

I'd also like to be able to reference  the tag name hierarchy that applies to any text-in-between-tags once parsed (but I assume that is part of what I have to learn about lepl, possibly by biulding a stack of open tags in any moment , and snapshotting it for each contents I'd like to analyze)

hope that helps understand the text to parse

thank you
 
regards,
Haroldo

2010/12/29 andrew cooke <and...@acooke.org>

andrew cooke

unread,
Dec 31, 2010, 3:36:12 PM12/31/10
to le...@googlegroups.com

Hi,

Hope this helps. It's an interesting example, because the trampoline
matcher factory isn't well documented. I will add this to the next
release as an example.

Andrew

from logging import basicConfig, INFO
from lepl import *


basicConfig(level=INFO)


def line(matcher):
'''Include space before and a newline after'''
return ~Space()[:] & matcher & ~Space()[:] & ~Literal('\n')

@trampoline_matcher_factory()
def make_pair(start, end, contents):
'''Generate a matcher that checks start and end tags.
`start` must match the start and return the "label"
`end` must match the end and return the same "label"
If end fails then contents is used instead (and the matcher
repeats.
The return value is a pair that contains the label and
the contents.'''
def matcher(support, stream0):
(match, stream1) = yield start._match(stream0)
label = match[0]
result = []
while True:
# can we end?
try:
(match, stream2) = yield end._match(stream1)
if match and match[0] == label:
yield ([(label, result)], stream2)
else:
support._info('end matched, but %s != %s' %
(match[0], label))
except StopIteration:
support._info('failed to match end')
pass
# failed to end, so try matching contents instead
(match, stream2) = yield contents._match(stream1)
result += match
support._info('matched %s: %s' % (match, result))
stream1 = stream2
return matcher

nested = Delayed()

start = line(~Literal('*') & Word())
end = line(~Literal('*end*') & Word())
# note that we force at least one match here ([1:]) to avoid repeated
empty
# matches (which would give an infinite loop instead of failing)
# also, nested must go first here, or we match *bar as data
contents = nested | line(Word()[1:,~Space()[:]])

nested += make_pair(start, end, contents)

def test(result, target):
assert result == target, str(target) + ' != ' + str(result)

# start token
test(start.parse('*foo\n'), ['foo'])
test(start.parse(' *foo \n'), ['foo'])

# end token
test(end.parse('*end*foo\n'), ['foo'])
test(end.parse(' *end*foo \n'), ['foo'])

# simple pair
test(nested.parse('*foo\n*end*foo\n'), [('foo', [])])

# simple contents
test(contents.parse('ab c\n'), ['ab', 'c'])

# nested contents
test(nested.parse(
'''*foo
ab c
*end*foo
'''), [('foo', ['ab', 'c'])])

# multiple lines
test(nested.parse(
'''*foo
ab c
p q rs tu
*end*foo
'''), [('foo', ['ab', 'c', 'p', 'q', 'rs', 'tu'])])

# multiple nestings
test(nested.parse(
'''*foo
ab c
*bar
p q rs tu
*end*bar
*end*foo
'''), [('foo', ['ab', 'c', ('bar', ['p', 'q', 'rs', 'tu'])])])

# this is consistent, but perhaps not expected. if you don't want
this, then
# you need to define contents more carefully - perhaps exclude '*' from
the
# characters in Word() for example.
test(nested.parse(
'''*foo
ab c
*bar
p q rs tu
*end*baz
*end*foo
'''), [('foo', ['ab', 'c', '*bar', 'p', 'q', 'rs', 'tu', '*end*baz'])])

andrew cooke

unread,
Dec 31, 2010, 8:01:15 PM12/31/10
to le...@googlegroups.com

In svn at
https://code.google.com/p/lepl/source/browse/src/lepl/_example/dynamic.py

On Fri, 31 Dec 2010 15:36:12 -0500, andrew cooke <and...@acooke.org>
wrote:


> Hi,
>
> Hope this helps. It's an interesting example, because the trampoline
> matcher factory isn't well documented. I will add this to the next
> release as an example.

[...]

Haroldo Stenger

unread,
Jan 3, 2011, 3:56:22 AM1/3/11
to le...@googlegroups.com, and...@acooke.org
Hi Andrew ,

the grammar worked excellent. I noticed it goes slow on rather bigger inputs. I plan to run it on real big inputs. I believe the slugishness has to do with slicing into words the contents in the 'non marker' lines, besides the 'big lookahead' involved (which I don't know if should be treated differently, what your opinion on this?).

Were these lines treated as lines, not as list of words, maybe it would go faster ? I hope so ! , and also, I'm wondering if the 'line continuation' abstraction (although not explicit in this case as in the \ case, here just the end marker ends a set of lines) , could be somehow applied to the benefit of rapidness.

I  provide my dataset just in case it helps, which Iput in a string in the code/

btw, I used the attached code to parse.
 
thanks for the hard work.

best regards,
Haroldo




2010/12/31 andrew cooke <and...@acooke.org>

--
grammar_three.py

andrew cooke

unread,
Jan 3, 2011, 10:59:03 AM1/3/11
to le...@googlegroups.com

If you get a horrendous slowdown like that it's because something has
gone very wrong and it's having to search a large solution space to find
a consistent parse. In this case, it's because the input file has bad
data. You can help avoid this by making the parser more "accurate" so
that it tries less things (for example, by requiring that a starting tag
is always : followed by a capital letter, followed by only lower-case
letters).

ALSO there was a bug in the code I gave originally, which may not have
helped help (sorry). You need a "return" as follows:

@trampoline_matcher_factory()
def make_pair(start, end, contents):

def matcher(support, stream0):
(match, stream1) = yield start._match(stream0)
label = match[0]
result = []
while True:

try:
(match, stream2) = yield end._match(stream1)

if match:
if match[0] == label:


yield ([(label, result)], stream2)

return # THIS WAS MISSING
else:
support._debug('Bad end: %s' % match[0]) #
debug code to find the error in data
except StopIteration:
pass


(match, stream2) = yield contents._match(stream1)
result += match

stream1 = stream2
return matcher

The problem with the input data is here (see how making the parser more
accurate will help? and also how the debug logging above helps identify
this?):

:Att:Font
Size=9
:eFont
:Txt:Font
Size=9
Bold
:eFont

I am still looking at this; there may be other issues too.

Andrew

On Mon, 3 Jan 2011 06:56:22 -0200, Haroldo Stenger
<haroldo...@gmail.com> wrote:
> Hi Andrew ,
>
> the grammar worked excellent. I noticed it goes slow on rather bigger
> inputs. I plan to run it on real big inputs. I believe the
> slugishness
> has to do with slicing into words the contents in the 'non marker'
> lines, besides the 'big lookahead' involved (which I don't know if
> should be treated differently, what your opinion on this?).
>
> Were these lines treated as lines, not as list of words, maybe it
> would go faster ? I hope so ! , and also, I'm wondering if the 'line
> continuation' abstraction (although not explicit in this case as in
> the case, here just the end marker ends a set of lines) , could be
> somehow applied to the benefit of rapidness.
>
> I  provide my dataset just in case it helps, which Iput in a string
> in the code/
>
> btw, I used the attached code to parse.
>  
> thanks for the hard work.
>
> best regards,
> Haroldo
>
> 2010/12/31 andrew cooke
>

> test(start.parse('*foon'), ['foo'])


> test(start.parse(' *foo n'), ['foo'])
>
> # end token

> test(end.parse('*end*foon'), ['foo'])


> test(end.parse(' *end*foo n'), ['foo'])
>
> # simple pair

> test(nested.parse('*foon*end*foon'), [('foo', [])])
>
> # simple contents
> test(contents.parse('ab cn'), ['ab', 'c'])

> To post to this group, send email to le...@googlegroups.com [2].


> To unsubscribe from this group, send email to

> lepl+uns...@googlegroups.com [3].


> For more options, visit this group at

> http://groups.google.com/group/lepl?hl=en [4].


>
> --
> You received this message because you are subscribed to the Google
> Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to
> lepl+uns...@googlegroups.com.
> For more options, visit this group at
> http://groups.google.com/group/lepl?hl=en.
>
>

> Links:
> ------
> [1] mailto:and...@acooke.org
> [2] mailto:le...@googlegroups.com
> [3] mailto:lepl%2Bunsu...@googlegroups.com
> [4] http://groups.google.com/group/lepl?hl=en

andrew cooke

unread,
Jan 3, 2011, 11:18:43 AM1/3/11
to le...@googlegroups.com
Running

def make_3():
text = get_data(1)
fix =
compile_(r'(?m)^(\s*)(?::(?:[A-Z][a-z]*)+)+?(:(?:[A-Z][a-z]*)+.*)$')
text = fix.sub(r'\1\2', text)
with get_file(3, 'w') as out:
out.write(text)

on the input I found the following errors:


pl6 lepl-hg: diff ./src/lepl/_performance/dynamic/data-1.txt
./src/lepl/_performance/dynamic/data-3.txt
392c392
< :Att:Font
---
> :Font
395c395
< :Txt:Font
---
> :Font
786c786
< :Title:ColorInfo
---
> :ColorInfo
788c788
< :Rows:ColorInfo
---
> :ColorInfo
799c799
< :Title:ColorInfo
---
> :ColorInfo
801c801
< :Rows:ColorInfo
---
> :ColorInfo
811c811
< :Title:ColorInfo
---
> :ColorInfo
813c813
< :Rows:ColorInfo
---
> :ColorInfo
819c819
< :Att:Font
---
> :Font
824c824
< :Tit:Font
---
> :Font

andrew cooke

unread,
Jan 3, 2011, 11:49:20 AM1/3/11
to le...@googlegroups.com

Using the "restricted" parser below, with the "broken" input data, but
with "hola" changed to "Hola", the parser works in a reasonable time.
This is because something like :Att:Font is not recognised as a start
marker (which helps the parser avoid doing a search for the closing
token, which does not exist, causing all the data to be read then
discarded).

Andrew


def line(matcher):


return ~Space()[:] & matcher & ~Space()[:] & ~Literal('\n')


@trampoline_matcher_factory()
def make_pair(start, end, contents):

def matcher(support, stream0):
(match, stream1) = yield start._match(stream0)
label = match[0]
result = []
while True:
try:
(match, stream2) = yield end._match(stream1)
if match:
if match[0] == label:
yield ([(label, result)], stream2)
return

else:
support._debug('Bad end: %s' % match[0])

except StopIteration:
pass
(match, stream2) = yield contents._match(stream1)
result += match
stream1 = stream2
return matcher


def base():
nested = Delayed()
start = line(~Literal(':') & Word())
end = line(~Literal(':e') & Word())
contents = nested | line(Word()[:,~Space()[:]])


nested += make_pair(start, end, contents)

return nested

def restricted():
nested = Delayed()
camel = Word(ascii_uppercase, ascii_lowercase)[1:, ...]
start = line(~Literal(':') & camel)
end = line(~Literal(':e') & camel)
anything = AnyBut('\n')[:, ...] & ~Literal('\n')
contents = nested | anything


nested += make_pair(start, end, contents)

return nested

Reply all
Reply to author
Forward
0 new messages