Urgent issue: the Amara equivalent of streamed XSLT with massive XML as source and RDF as output

15 views
Skip to first unread message

Chimezie Ogbuji

unread,
Mar 3, 2011, 12:48:52 PM3/3/11
to akara, uc...@ogbuji.net
Hello, Uche, and all.

I have a very large number of massive XML documents I need to convert
to RDF/XML using as much streaming capabilities as is possible in
Amara. Normally, I would use XSLT for that but given my desire to do
it efficiently, I need the processing pipeline to be fully streamed
and lazy. In fact, I began using XSLT originally and this was working
without memory issues but much slower than I imagine it could be if
setup to properly to stream the processing in a lazily fashion. In
addition, some of the things I need to do require capabilities that
XPath and XSLT do not have (robust REGEX matching primarily) and I
felt that porting the transform to Python would be the way to go if I
wanted to do this efficiently.

The documents are enormous and have on the order of 30,000 <Article>
elements 3 levels deep that are the primary target of the
transformation. The very large documents (15 MB on average) have the
following structure:

<MedlineCitationSet>
<MedlineCitation>
<Article>..etc..</>
</MedlineCitation>
<MedlineCitation> .. </>
</MedlineCitationSet>

My approach to setting up a streaming XML transformation pipeline
using the latest version of Amara 2 (version '2.0a5' from PyPi) is the
following basic script outline:

class ArticleHandler(..etc..):
def __init__(self,feed,..etc..):
self.feed = feed
..etc..
def handleArticle(node,..etc..):
..etc..
self.feed.send( ..etc.. )
self.feed.send( ..etc.. )

def main(..etc..):
..snip..
output = StringIO()
w = structwriter(indent=u"yes",stream=output)
feed = w.cofeed(
ROOT(
E_CURSOR(
(RDF.RDFNS, u'rdf:RDF'))))
handler = ArticleHandler(feed,..etc..)

@coroutine
def receive_nodes(..etc..):
while True:
node = yield
handler.handleArticle(node.Article,..etc..)
return

target = receive_nodes(..etc..)
pushtree(doc, u'MedlineCitation', target.send,
entity_factory=entity_base)
handler.feed.close()
target.close()
return output.getvalue()

Where handleArticle on ArticleHandler is the entry point to a series
of calls that use structwriter to generate the resulting XML
programmatically.

So, basically, I'm using pushtree to efficiently only dispatch events
for the inner elements I care about and using the coroutine-based
structwriter to generate content. My understanding of these two
approaches is that this should be the most efficient pipeline
available in Amara for streaming, python-based XML transformation.
The memory overhead should be very minimal as well. However, I'm
getting what looks to be a fatal memory error when I run this on a
single one of these gigantic files (below). Is there something I'm
doing wrong? Perhaps there is a circular loop that the use of
coroutines with pushtree and structwriters is causing that is not
obvious to me?

Thanks in advance.

Python(7528) malloc: *** mmap(size=262144) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
.. repeated hundreds of times ..
Traceback (most recent call last):
..etc..
File "..etc..", line ..etc.., in receive_nodes
handler.handleArticle(node.Article,..etc..)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 235, in cofeed
buf.send(val)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 270, in cofeed
buf.send(val)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 461, in do
sink.feed(obj)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 197, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 174, in feed
self.feed(lead, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/__init__.py", line 62,
in evaluate
self.compile(compiler)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/nodesets.py", line 39,
in _make_block
'STORE_ATTR', 'node',
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/compiler/assembler.py", line 51,
in emit
add(instr)
MemoryError


Dave Pawson

unread,
Mar 3, 2011, 12:58:58 PM3/3/11
to ak...@googlegroups.com
On 3 March 2011 17:48, Chimezie Ogbuji <chim...@gmail.com> wrote:
> Hello, Uche, and all.
>
> I have a very large number of massive XML documents

> The documents are enormous and have on the order of 30,000 <Article>


> elements 3 levels deep that are the primary target of the
> transformation.  The very large documents (15 MB on average) have the
> following structure:


Mike Kay, with Saxon, xslt2, estimates about four times memory
requirements. So < 100Mb isn't too bad?
XSLT 2 has the regex you need, if you feel you can leave Python
for this job?

HTH

--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk

Uche Ogbuji

unread,
Mar 3, 2011, 3:11:30 PM3/3/11
to ak...@googlegroups.com, Chimezie Ogbuji
I'm wondering whether the memory issue could be with the structwriter, rather than pushtree.  I've beaten pushtree pretty hard, and I think it should be able to handle your level of usage.

Could you try replacing the handler.handleArticle line in receive_nodes to a simple

print repr(node)

Or the like to at least make sure pushtree is not the issue?  If that does run through without memory issues, I have a couple of follow-up ideas.

--Uche



--
You received this message because you are subscribed to the Google Groups "akara" group.
To post to this group, send email to ak...@googlegroups.com.
To unsubscribe from this group, send email to akara+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/akara?hl=en.




--
Uche Ogbuji                       http://uche.ogbuji.net
Weblog: http://copia.ogbuji.net
Poetry ed @TNB: http://www.thenervousbreakdown.com/author/uogbuji/
Founding Partner, Zepheira        http://zepheira.com
Linked-in: http://www.linkedin.com/in/ucheogbuji
Articles: http://uche.ogbuji.net/tech/publications/
Friendfeed: http://friendfeed.com/uche
Twitter: http://twitter.com/uogbuji
http://www.google.com/profiles/uche.ogbuji

Chimezie Ogbuji

unread,
Mar 3, 2011, 3:57:15 PM3/3/11
to Uche Ogbuji, ak...@googlegroups.com
I replaced the handleArticle invokation with print repr(node) and got
a bunch of the following printed to STDOUT:

MedlineCitation at 0x2be43078: name u'MedlineCitation', 0 namespaces,
2 attributes, 17 children>

It finished without any memory issues.

On Thu, Mar 3, 2011 at 3:11 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> I'm wondering whether the memory issue could be with the structwriter,
> rather than pushtree.  I've beaten pushtree pretty hard, and I think it
> should be able to handle your level of usage.
> Could you try replacing the handler.handleArticle line in receive_nodes to a
> simple
> print repr(node)
> Or the like to at least make sure pushtree is not the issue?  If that does
> run through without memory issues, I have a couple of follow-up ideas.
> --Uche

-- Chime

Uche Ogbuji

unread,
Mar 3, 2011, 4:32:02 PM3/3/11
to Chimezie Ogbuji, ak...@googlegroups.com
Whew! Well that's good news and bad news.  The good news is that pushtree is not the problem, which is good because it's designed to handle this case.

The bad news is that I suspect it's structwriter.  I don't think that's been tested to emit 30K elements, and sounds like it's failing such a test ;)

Or wait!  Maybe not.  I just noticed this:

   output = StringIO()
   w = structwriter(indent=u"yes",stream=output)

So you're creating a StringIO buffer to hold the entire XML output.  That's never going to work unless you got a mainframe handy ;)

Just so you know, structwriter with StringIO is almost never needed.  Just use structencoder instead (it's also in the tutorial).  But that would break in this case too.  What you want to do is write directly to the file, so you're buffering on disk, not in memory

   f = open('output.xml', 'w')
   w = structwriter(indent=u"yes",stream=f)

See if that helps.

--Uche

Chimezie Ogbuji

unread,
Mar 3, 2011, 9:50:47 PM3/3/11
to Uche Ogbuji, ak...@googlegroups.com
On Thu, Mar 3, 2011 at 4:32 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> Whew! Well that's good news and bad news.  The good news is that pushtree is
> not the problem, which is good because it's designed to handle this case.

Ok.

> The bad news is that I suspect it's structwriter.  I don't think that's been
> tested to emit 30K elements, and sounds like it's failing such a test ;)

> ..snip..


> So you're creating a StringIO buffer to hold the entire XML output.  That's
> never going to work unless you got a mainframe handy ;)
> Just so you know, structwriter with StringIO is almost never needed.  Just
> use structencoder instead (it's also in the tutorial).  But that would break
> in this case too.  What you want to do is write directly to the file, so
> you're buffering on disk, not in memory
>    f = open('output.xml', 'w')
>    w = structwriter(indent=u"yes",stream=f)

Ahh yes. Thanks. I made the change so now structwriter takes a file
object instead as you showed, however, I'm getting the same error.
Although, the malloc traceback has only one entry (rather than
hundreds). I also added a counter on the number of elements being
triggered by pushtree up to the error and the final number is 17k in
this case.

Python(13743) malloc: *** mmap(size=16777216) failed (error code=12)


*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

..snip..


pushtree(doc, u'MedlineCitation', target.send, entity_factory=entity_base)

..snip..
for grantEl in articleNode.xml_select('GrantList/Grant'):
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/util.py",
line 52, in simple_evaluate
return ctx.evaluate(expr)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/__init__.py",
line 224, in evaluate
return parsed.evaluate(self)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/expressions/__init__.py",
line 64, in evaluate
docstring=unicode(self))
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/__init__.py",
line 45, in compile
firstlineno)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/assembler.py",
line 60, in assemble
stacksize = self._compute_stack_size()
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/assembler.py",
line 73, in _compute_stack_size
stack_effect = {


-- Chime

Uche Ogbuji

unread,
Mar 4, 2011, 9:50:49 AM3/4/11
to Chimezie Ogbuji, ak...@googlegroups.com
OK so maybe there is indeed a memory problem with structwriter.  If you get to it before me could you create a ticket at:

http://foundry.zepheira.com/projects/amara ?

My main idea for a workaround in that case is to see if it works to create the top element separately from its children.  So for example, at the beginning of things do:

f.write('<rdf:RDF...>')

then at the end do:

f.write('</rdf:RDF>')

And instead of structwriter.cofeed, create each separate rdf:Description or whatever as if it were a separate top-level document, to that same file stream f.  So you are not holding one top-level structure open while you add 17K+ child objects to it.  I hope that makes sense.  Not as convenient as cofeed, and you might have to pull a few tricks to avoid a mess with output namespaces, but it should get you over the memory hump for now.


--Uche

Augusto Herrmann

unread,
Mar 8, 2011, 3:38:14 PM3/8/11
to akara
Hi, Chime.

If what you want to write are RDF triples in RDF/XML serialization,
how about instead of using amara's structwriter you use rdflib to hold
all the triples in a graph and serialize it only at the end?
And if the triples are too many to hold the whole graph in memory,
maybe you could send them to a triple store and postpone the RDF/XML
serialization to the final step.

Cheers,
Augusto Herrmann

On 4 mar, 12:50, Uche Ogbuji <u...@ogbuji.net> wrote:
> OK so maybe there is indeed a memory problem with structwriter.  If you get
> to it before me could you create a ticket at:
>
> http://foundry.zepheira.com/projects/amara?
>
> My main idea for a workaround in that case is to see if it works to create
> the top element separately from its children.  So for example, at the
> beginning of things do:
>
> f.write('<rdf:RDF...>')
>
> then at the end do:
>
> f.write('</rdf:RDF>')
>
> And instead of structwriter.cofeed, create each separate rdf:Description or
> whatever as if it were a separate top-level document, to that same file
> stream f.  So you are not holding one top-level structure open while you add
> 17K+ child objects to it.  I hope that makes sense.  Not as convenient as
> cofeed, and you might have to pull a few tricks to avoid a mess with output
> namespaces, but it should get you over the memory hump for now.
>
> --Uche
>
>
>
> On Thu, Mar 3, 2011 at 7:50 PM, Chimezie Ogbuji <chime...@gmail.com> wrote:

Mark Baker

unread,
Mar 8, 2011, 5:13:00 PM3/8/11
to ak...@googlegroups.com, Chimezie Ogbuji
Hey Chime,

FWIW, I just ran into what appears to be a memory leak with pushtree.
Reported here;

https://foundry.zepheira.com/issues/1294

Mark.

Uche Ogbuji

unread,
Mar 8, 2011, 5:22:34 PM3/8/11
to ak...@googlegroups.com, Mark Baker, Chimezie Ogbuji
Hmm.  This could be a memory leak, but I would not necessarily rely on objgraph to determine that.  Do you actually run out of memory in that case?  Note that Chimezie did not run out of memory when he narrowed down his code to pushtree alone, as reported in this thread.  David Beazley did a bunch of testing of pore pushtree with huge files last year.  On the other hand, if we do learn that we can rely on objgraph, that would be a useful tool to have on hand.


-- 

Uche Ogbuji

unread,
Mar 8, 2011, 10:59:21 PM3/8/11
to ak...@googlegroups.com, Mark Baker, Chimezie Ogbuji
My skepticism turned out to be unfounded.  I've updated that issue with more info. It seems you have to be dealing with at least 1GB of XML for it to manifest, but there is a leak.

--Uche

Chimezie Ogbuji

unread,
Mar 18, 2011, 9:46:48 AM3/18/11
to Uche Ogbuji, ak...@googlegroups.com
Hey Uche. I had been under the gun to find workarounds for memory
issues and haven't had the chance to come back to this thread.

On Fri, Mar 4, 2011 at 9:50 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> OK so maybe there is indeed a memory problem with structwriter.  If you get
> to it before me could you create a ticket at:
> http://foundry.zepheira.com/projects/amara ?

Okay. I've also been able to confirm a possible memory leak with
pushtree for documents on the order of 10MBs or so. More about this
later.

> My main idea for a workaround in that case is to see if it works to create
> the top element separately from its children.  So for example, at the
> beginning of things do:
> f.write('<rdf:RDF...>')
> then at the end do:
> f.write('</rdf:RDF>')
> And instead of structwriter.cofeed, create each separate rdf:Description or
> whatever as if it were a separate top-level document, to that same file
> stream f.

Okay, so write each element under the root into a standalone string
and append that string within the rdf:RDF skeleton?

> So you are not holding one top-level structure open while you add
> 17K+ child objects to it.  I hope that makes sense.

I think it does. The first thing that comes to mind is all the book
keeping that would no longer be done.

> Not as convenient as
> cofeed, and you might have to pull a few tricks to avoid a mess with output
> namespaces, but it should get you over the memory hump for now.

Yes, this would be a bit tricky, but at an alternative to my current
workaround of XSLT, which does seem to be doing well despite the
massing memory usage.

Chimezie Ogbuji

unread,
Mar 18, 2011, 10:24:16 AM3/18/11
to ak...@googlegroups.com, Augusto Herrmann
Hello, Augusto,

On Tue, Mar 8, 2011 at 3:38 PM, Augusto Herrmann <hell...@gmail.com> wrote:
> If what you want to write are RDF triples in RDF/XML serialization,
> how about instead of using amara's structwriter you use rdflib to hold
> all the triples in a graph and serialize it only at the end?

I did try that, but the RDFLib IOMemory store doesn't seem to be very
efficient for large RDF graphs and the memory overhead is what I was
hoping to avoid by streaming directly from the source XML to the
resulting RDF/XML.

> And if the triples are too many to hold the whole graph in memory,
> maybe you could send them to a triple store and postpone the RDF/XML
> serialization to the final step.

Most triple stores, unfortunately, require an RDF document on hand in
order to load with maximum efficiency and live updates tend to not be
very efficient.


-- Chime

Chimezie Ogbuji

unread,
Mar 18, 2011, 10:40:02 AM3/18/11
to Uche Ogbuji, ak...@googlegroups.com, Mark Baker
I've also been able to confirm what appears to be a leak as well. The
other scenario in which I've been using pushtree is while processing
large SPARQL XML result documents. In particular, I seem to be
getting the same kind of failures when using pushtree alone to seek
sparql/results/result elements, produce bindery elements, fetch the
values, and post preccess them. Theoretically, since it is an
event-based system, all resources needed to handle each result
document (out of thousands that are streaming back and forth between
the SPARQL server and the application) should be relinquished from one
to the next. However, over time, I watched the memory footprint
monotonically increase on a machine with 32GB of memory and after some
time (in particular after several such documents between 20-50MB) it
failed with the same kind of error.

Eventually, I used a threshold to determine a size for XML documents
for which the (presumably DOM-based) Amara parser was used instead. I
set the value at about 10MB, but still kept getting memory failures.
Eventually, I had to resort to using the parser alone.

The difference between the two was the following:

@coroutine
def receive_nodes():
while True:
sparqlSoln = yield
handleSoln(sparqlSoln)
return

if asCoRoutine:
target = receive_nodes()
pushtree(sparqlResult, u'result', target.send,
entity_factory=entity_base)
target.close()
else:
print >> sys.stderr,"Parsing size %s SPARQL XML doc as
DOM"%(len(sparqlResult))
#Use DOM instead of event-based pushtree (memory issues)
doc = bindery.parse(sparqlResult,
prefixes={

u'sparql':u'http://www.w3.org/2005/sparql-results#'})
for soln in
doc.xml_select('/sparql:sparql/sparql:results/sparql:result'):
handleSoln(soln)

Using just the DOM parser, I ran a large set of jobs again and watched
the memory footprint monotonically increase but at a slower rate that
seemed to be the result of other data structures I was accumulating.
Eventually, it failed with the following (same) error:

Python(8403) malloc: *** mmap(size=262144) failed (error code=12)


*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug

Traceback (most recent call last):

..snip..
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/bindery/__init__.py",
line 17, in parse
doc = tree.parse(obj, uri, entity_factory=entity_factory,
standalone=standalone, validate=validate)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/tree.py",
line 69, in parse
return _parse(inputsource(obj, uri), flags,
entity_factory=entity_factory,rule_handler=rule_handler)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/bindery/nodes.py",
line 625, in xml_element_factory
eclass = type(class_name, (self.xml_element_base,),
dict(xml_child_pnames={}))
MemoryError

I searched for information on that particular error and found:

[[[
In your case malloc is attempting to allocate 384kb of memory using mmap
by mapping anonymous memory (swap) into the address space. (The
advantage of doing it this way is that the space can be unmapped when no
longer needed.)

With no appropriate address space left for this chunk, you get the
error. You are either trying to use too much stuff or there is a memory
leak. ]]] -- http://mail.python.org/pipermail/python-list/2009-January/1189377.html

Since the error is the same as the one I was getting with pushtree,
seems to be memory exhaustion related, and given the different
approaches to processing (and the rate of monotonic increase memory
usage), this seems to suggest that there is indeed a memory leak but
perhaps only with files large enough for the problem to manifest
itself after repeated processing.

Reply all
Reply to author
Forward
0 new messages