Parsing Greenbook Patent Data: would like to be able to build a generator

240 views
Skip to first unread message

Robottaway

unread,
Aug 4, 2011, 3:59:54 PM8/4/11
to lepl
Hi All,

I'm using LEPL again! I have a need to build a parser and OS it for
parsing patent data that is in the greenbook format. I'll be building
a parser also for XML format.

These files are pretty big, and I have no need to parse the whole file
into memory at once. What I would like to do is have the parser yield
per patent record processed so that I only deal with one at a time.

http://paste.ofcode.org/7wyi2QsvsV2bCjPXPn8WWk

Is what I have so far; this code doesn't try to do the generator. I
tried using the config.no_first_full_match() on PATN_node which did
allow me to match just the first, and I was thinking I could use the
stream data that returns to be able to then parse the next? I couldn't
get much farther but maybe that is an approach I should look into?

So the syntax I would like to do is:

for patent in greenbook('/location/of/patent/data.txt'):
print patent

Eventually I plan to just have a way of abstracting the file as
greenbook or xml

for patent in parse_patents('/location/of/patent/data.txt'):
print patent

the patent value returned will be an interface that is the same
regardless of the xml or txt format.

Any help would be greatly appreciated!

Cheers,
Rob Ottaway

andrew cooke

unread,
Aug 4, 2011, 4:22:06 PM8/4/11
to le...@googlegroups.com

i had an example like this when i was worried about memory use.

there's a "hack" that makes it work. let me dig it out...

ok, this is the "output problem" discussed here -
http://acooke.org/lepl/resources.html

hope that helps (at work, so haven't taken time to look at your code, but i
think what i linked to is very relevant. i hope).

andrew

> --
> You received this message because you are subscribed to the Google Groups "lepl" group.
> To post to this group, send email to le...@googlegroups.com.
> To unsubscribe from this group, send email to lepl+uns...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/lepl?hl=en.
>

Robottaway

unread,
Sep 28, 2011, 12:27:20 PM9/28/11
to lepl
Thank you Andrew, for the help, worked great.

For anyone interested the code for the parser is here:

https://bitbucket.org/robottaway/greenbook-parser/src/3718bd3393d9/patent_grab/greenbook.py

andrew cooke

unread,
Sep 28, 2011, 1:06:50 PM9/28/11
to le...@googlegroups.com

great - thanks the feedback (it's useful to know that memory "workaround"
works).

andrew

Robottaway

unread,
Oct 9, 2011, 4:28:45 PM10/9/11
to lepl
Hi Andrew,

Your iterable pattern works great, but now I'm experiencing a problem
where memory gets way out of hand. I didn't figure this out until I
stopped using my smaller dev file containing a low number of patents,
and parsed a real file containing 1200+ patents. I actually have
converted my parsing so that I pull out each patent section and parse
one single patent at a time, rather than try and parse them all in one
go while yielding at each, using your workaround. My new way doesn't
seem to help though, still I use way too much memory. Actually I'm
pretty sure it's using the same amount my way or your iterable way.
I'm thinking it's unrelated to the size of the input. I garbage
collect by calling gc.collect() but that frees none of the memory used
up.

I'm using top to monitor mem use, and after parsing a few patents I
can see a drop of 200+ MB. As I let it go it eventually eats up all
the mem and my machines starts freezing up.

You can see the parser here:
https://bitbucket.org/robottaway/patent-grab/src/b7cdc3179899/patent_grab/greenbook.py#cl-198

It's fairly complex compared to other parsers I've built in the past.
It's definitely fast enough, I just need to solve the memory issue.
Thinking there is something maybe in my code that causes a loop of
some sort where the parser is generating new objects often enough to
cause the large memory usage.

Have you seen any issues like this before? Is this parser maybe too
complex for a tool like LEPL? Thinking I could use Antlr if I have to,
but I don't relish the though of J2EE since I hoped to use some other
Python tools in this project.

Anyhow, thank you for all your help so far!

On Sep 28, 10:06 am, andrew cooke <and...@acooke.org> wrote:
> great - thanks the feedback (it's useful to know that memory "workaround"
> works).
>
> andrew
>
>
>
>
>
>
>
> On Wed, Sep 28, 2011 at 09:27:20AM -0700, Robottaway wrote:
> > Thank you Andrew, for the help, worked great.
>
> > For anyone interested the code for the parser is here:
>
> >https://bitbucket.org/robottaway/greenbook-parser/src/3718bd3393d9/pa...

andrew cooke

unread,
Oct 9, 2011, 4:46:44 PM10/9/11
to le...@googlegroups.com

lepl will eat memory like crazy because it needs to backtrack (so it needs to
keep a record of all previous state).

however, you typically don't need that state ("real life" grammars don't need
to backtrack all the way to the start). so you can limit the amount of state
that lepl stores.

the solution is described here -
http://acooke.org/lepl/resources.html#the-input-solution (basically, use
config.low_memory()).

however, be warned - it will run more slowly (since it has to track memory use
and "unlink" data). and if you give too small a value for the length then you
may fail to parse the data (if you discard too much backtracking information).

i should also say that this is a very infrequently used option and was buggy
for some time. so check you're using the latest version (5.0).

hope that helps,
andrew

andrew cooke

unread,
Oct 9, 2011, 4:48:32 PM10/9/11
to le...@googlegroups.com

ps if speed is an issue, see the post on this list from Luca (title "[LEPL]
Lepl and Cython"). that may help...
Reply all
Reply to author
Forward
0 new messages