Message from discussion
Urgent issue: the Amara equivalent of streamed XSLT with massive XML as source and RDF as output
Received: by 10.142.211.21 with SMTP id j21mr207592wfg.2.1299174632687;
Thu, 03 Mar 2011 09:50:32 -0800 (PST)
X-BeenThere: akara@googlegroups.com
Received: by 10.143.178.9 with SMTP id f9ls996898wfp.0.p; Thu, 03 Mar 2011
09:50:31 -0800 (PST)
MIME-Version: 1.0
Received: by 10.142.193.14 with SMTP id q14mr97019wff.23.1299174532311; Thu,
03 Mar 2011 09:48:52 -0800 (PST)
Received: by t15g2000prt.googlegroups.com with HTTP; Thu, 3 Mar 2011 09:48:52
-0800 (PST)
Date: Thu, 3 Mar 2011 09:48:52 -0800 (PST)
X-IP: 192.5.109.34
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US)
AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.114 Safari/534.16,gzip(gfe)
Message-ID: <11873db5-869e-49ba-91fb-aa0ad4235f11@t15g2000prt.googlegroups.com>
Subject: Urgent issue: the Amara equivalent of streamed XSLT with massive XML
as source and RDF as output
From: Chimezie Ogbuji <chime...@gmail.com>
To: akara <akara@googlegroups.com>
Cc: u...@ogbuji.net
Content-Type: text/plain; charset=ISO-8859-1
Hello, Uche, and all.
I have a very large number of massive XML documents I need to convert
to RDF/XML using as much streaming capabilities as is possible in
Amara. Normally, I would use XSLT for that but given my desire to do
it efficiently, I need the processing pipeline to be fully streamed
and lazy. In fact, I began using XSLT originally and this was working
without memory issues but much slower than I imagine it could be if
setup to properly to stream the processing in a lazily fashion. In
addition, some of the things I need to do require capabilities that
XPath and XSLT do not have (robust REGEX matching primarily) and I
felt that porting the transform to Python would be the way to go if I
wanted to do this efficiently.
The documents are enormous and have on the order of 30,000 <Article>
elements 3 levels deep that are the primary target of the
transformation. The very large documents (15 MB on average) have the
following structure:
<MedlineCitationSet>
<MedlineCitation>
<Article>..etc..</>
</MedlineCitation>
<MedlineCitation> .. </>
</MedlineCitationSet>
My approach to setting up a streaming XML transformation pipeline
using the latest version of Amara 2 (version '2.0a5' from PyPi) is the
following basic script outline:
class ArticleHandler(..etc..):
def __init__(self,feed,..etc..):
self.feed = feed
..etc..
def handleArticle(node,..etc..):
..etc..
self.feed.send( ..etc.. )
self.feed.send( ..etc.. )
def main(..etc..):
..snip..
output = StringIO()
w = structwriter(indent=u"yes",stream=output)
feed = w.cofeed(
ROOT(
E_CURSOR(
(RDF.RDFNS, u'rdf:RDF'))))
handler = ArticleHandler(feed,..etc..)
@coroutine
def receive_nodes(..etc..):
while True:
node = yield
handler.handleArticle(node.Article,..etc..)
return
target = receive_nodes(..etc..)
pushtree(doc, u'MedlineCitation', target.send,
entity_factory=entity_base)
handler.feed.close()
target.close()
return output.getvalue()
Where handleArticle on ArticleHandler is the entry point to a series
of calls that use structwriter to generate the resulting XML
programmatically.
So, basically, I'm using pushtree to efficiently only dispatch events
for the inner elements I care about and using the coroutine-based
structwriter to generate content. My understanding of these two
approaches is that this should be the most efficient pipeline
available in Amara for streaming, python-based XML transformation.
The memory overhead should be very minimal as well. However, I'm
getting what looks to be a fatal memory error when I run this on a
single one of these gigantic files (below). Is there something I'm
doing wrong? Perhaps there is a circular loop that the use of
coroutines with pushtree and structwriters is causing that is not
obvious to me?
Thanks in advance.
Python(7528) malloc: *** mmap(size=262144) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
.. repeated hundreds of times ..
Traceback (most recent call last):
..etc..
File "..etc..", line ..etc.., in receive_nodes
handler.handleArticle(node.Article,..etc..)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 235, in cofeed
buf.send(val)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 270, in cofeed
buf.send(val)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 461, in do
sink.feed(obj)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 197, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 174, in feed
self.feed(lead, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
self.feed(subobj, prefixes)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/__init__.py", line 62,
in evaluate
self.compile(compiler)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/nodesets.py", line 39,
in _make_block
'STORE_ATTR', 'node',
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/compiler/assembler.py", line 51,
in emit
add(instr)
MemoryError