Account Options

  1. Sign in
The old Google Groups will be going away soon, but your browser is incompatible with the new version.
Google Groups Home
« Groups Home
Message from discussion Urgent issue: the Amara equivalent of streamed XSLT with massive XML as source and RDF as output

Received: by 10.142.211.21 with SMTP id j21mr207592wfg.2.1299174632687;
        Thu, 03 Mar 2011 09:50:32 -0800 (PST)
X-BeenThere: akara@googlegroups.com
Received: by 10.143.178.9 with SMTP id f9ls996898wfp.0.p; Thu, 03 Mar 2011
 09:50:31 -0800 (PST)
MIME-Version: 1.0
Received: by 10.142.193.14 with SMTP id q14mr97019wff.23.1299174532311; Thu,
 03 Mar 2011 09:48:52 -0800 (PST)
Received: by t15g2000prt.googlegroups.com with HTTP; Thu, 3 Mar 2011 09:48:52
 -0800 (PST)
Date: Thu, 3 Mar 2011 09:48:52 -0800 (PST)
X-IP: 192.5.109.34
User-Agent: G2/1.0
X-HTTP-UserAgent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_6_6; en-US)
 AppleWebKit/534.16 (KHTML, like Gecko) Chrome/10.0.648.114 Safari/534.16,gzip(gfe)
Message-ID: <11873db5-869e-49ba-91fb-aa0ad4235f11@t15g2000prt.googlegroups.com>
Subject: Urgent issue: the Amara equivalent of streamed XSLT with massive XML
 as source and RDF as output
From: Chimezie Ogbuji <chime...@gmail.com>
To: akara <akara@googlegroups.com>
Cc: u...@ogbuji.net
Content-Type: text/plain; charset=ISO-8859-1

Hello, Uche, and all.

I have a very large number of massive XML documents I need to convert
to RDF/XML using as much streaming capabilities as is possible in
Amara.  Normally, I would use XSLT for that but given my desire to do
it efficiently, I need the processing pipeline to be fully streamed
and lazy.  In fact, I began using XSLT originally and this was working
without memory issues but much slower than I imagine it could be if
setup to properly to stream the processing in a lazily fashion.  In
addition, some of the things I need to do require capabilities that
XPath and XSLT do not have (robust REGEX matching primarily) and I
felt that porting the transform to Python would be the way to go if I
wanted to do this efficiently.

The documents are enormous and have on the order of 30,000 <Article>
elements 3 levels deep that are the primary target of the
transformation.  The very large documents (15 MB on average) have the
following structure:

<MedlineCitationSet>
  <MedlineCitation>
    <Article>..etc..</>
  </MedlineCitation>
  <MedlineCitation> .. </>
</MedlineCitationSet>

My approach to setting up a streaming XML transformation pipeline
using the latest version of Amara 2 (version '2.0a5' from PyPi) is the
following basic script outline:

class ArticleHandler(..etc..):
    def __init__(self,feed,..etc..):
        self.feed       = feed
    ..etc..
   def handleArticle(node,..etc..):
       ..etc..
       self.feed.send( ..etc.. )
       self.feed.send( ..etc.. )

def main(..etc..):
    ..snip..
    output = StringIO()
    w = structwriter(indent=u"yes",stream=output)
    feed = w.cofeed(
            ROOT(
                E_CURSOR(
                    (RDF.RDFNS, u'rdf:RDF'))))
    handler = ArticleHandler(feed,..etc..)

    @coroutine
    def receive_nodes(..etc..):
        while True:
            node = yield
            handler.handleArticle(node.Article,..etc..)
        return

    target = receive_nodes(..etc..)
    pushtree(doc, u'MedlineCitation', target.send,
entity_factory=entity_base)
    handler.feed.close()
    target.close()
    return output.getvalue()

Where handleArticle on ArticleHandler is the entry point to a series
of calls that use structwriter to generate the resulting XML
programmatically.

So, basically, I'm using pushtree to efficiently only dispatch events
for the inner elements I care about and using the coroutine-based
structwriter to generate content.  My understanding of these two
approaches is that this should be the most efficient pipeline
available in Amara for streaming, python-based XML transformation.
The memory overhead should be very minimal as well.  However, I'm
getting what looks to be a fatal memory error when I run this on a
single one of these gigantic files (below).  Is there something I'm
doing wrong? Perhaps there is a circular loop that the use of
coroutines with pushtree and structwriters is causing that is not
obvious to me?

Thanks in advance.

Python(7528) malloc: *** mmap(size=262144) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
.. repeated hundreds of times ..
Traceback (most recent call last):
  ..etc..
  File "..etc..", line ..etc.., in receive_nodes
    handler.handleArticle(node.Article,..etc..)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 235, in cofeed
    buf.send(val)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 270, in cofeed
    buf.send(val)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 461, in do
    sink.feed(obj)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
    self.feed(subobj, prefixes)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 197, in feed
    self.feed(subobj, prefixes)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 174, in feed
    self.feed(lead, prefixes)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/writers/struct.py", line 176, in feed
    self.feed(subobj, prefixes)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/__init__.py", line 62,
in evaluate
    self.compile(compiler)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/expressions/nodesets.py", line 39,
in _make_block
    'STORE_ATTR', 'node',
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/
python2.6/site-packages/amara/xpath/compiler/assembler.py", line 51,
in emit
    add(instr)
MemoryError