> The documents are enormous and have on the order of 30,000 <Article>
> elements 3 levels deep that are the primary target of the
> transformation. The very large documents (15 MB on average) have the
> following structure:
Mike Kay, with Saxon, xslt2, estimates about four times memory
requirements. So < 100Mb isn't too bad?
XSLT 2 has the regex you need, if you feel you can leave Python
for this job?
HTH
--
Dave Pawson
XSLT XSL-FO FAQ.
Docbook FAQ.
http://www.dpawson.co.uk
--
You received this message because you are subscribed to the Google Groups "akara" group.
To post to this group, send email to ak...@googlegroups.com.
To unsubscribe from this group, send email to akara+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/akara?hl=en.
MedlineCitation at 0x2be43078: name u'MedlineCitation', 0 namespaces,
2 attributes, 17 children>
It finished without any memory issues.
On Thu, Mar 3, 2011 at 3:11 PM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> I'm wondering whether the memory issue could be with the structwriter,
> rather than pushtree. I've beaten pushtree pretty hard, and I think it
> should be able to handle your level of usage.
> Could you try replacing the handler.handleArticle line in receive_nodes to a
> simple
> print repr(node)
> Or the like to at least make sure pushtree is not the issue? If that does
> run through without memory issues, I have a couple of follow-up ideas.
> --Uche
-- Chime
Ok.
> The bad news is that I suspect it's structwriter. I don't think that's been
> tested to emit 30K elements, and sounds like it's failing such a test ;)
> ..snip..
> So you're creating a StringIO buffer to hold the entire XML output. That's
> never going to work unless you got a mainframe handy ;)
> Just so you know, structwriter with StringIO is almost never needed. Just
> use structencoder instead (it's also in the tutorial). But that would break
> in this case too. What you want to do is write directly to the file, so
> you're buffering on disk, not in memory
> f = open('output.xml', 'w')
> w = structwriter(indent=u"yes",stream=f)
Ahh yes. Thanks. I made the change so now structwriter takes a file
object instead as you showed, however, I'm getting the same error.
Although, the malloc traceback has only one entry (rather than
hundreds). I also added a counter on the number of elements being
triggered by pushtree up to the error and the final number is 17k in
this case.
Python(13743) malloc: *** mmap(size=16777216) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
..snip..
pushtree(doc, u'MedlineCitation', target.send, entity_factory=entity_base)
..snip..
for grantEl in articleNode.xml_select('GrantList/Grant'):
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/util.py",
line 52, in simple_evaluate
return ctx.evaluate(expr)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/__init__.py",
line 224, in evaluate
return parsed.evaluate(self)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/expressions/__init__.py",
line 64, in evaluate
docstring=unicode(self))
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/__init__.py",
line 45, in compile
firstlineno)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/assembler.py",
line 60, in assemble
stacksize = self._compute_stack_size()
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/xpath/compiler/assembler.py",
line 73, in _compute_stack_size
stack_effect = {
-- Chime
FWIW, I just ran into what appears to be a memory leak with pushtree.
Reported here;
https://foundry.zepheira.com/issues/1294
Mark.
On Fri, Mar 4, 2011 at 9:50 AM, Uche Ogbuji <uc...@ogbuji.net> wrote:
> OK so maybe there is indeed a memory problem with structwriter. If you get
> to it before me could you create a ticket at:
> http://foundry.zepheira.com/projects/amara ?
Okay. I've also been able to confirm a possible memory leak with
pushtree for documents on the order of 10MBs or so. More about this
later.
> My main idea for a workaround in that case is to see if it works to create
> the top element separately from its children. So for example, at the
> beginning of things do:
> f.write('<rdf:RDF...>')
> then at the end do:
> f.write('</rdf:RDF>')
> And instead of structwriter.cofeed, create each separate rdf:Description or
> whatever as if it were a separate top-level document, to that same file
> stream f.
Okay, so write each element under the root into a standalone string
and append that string within the rdf:RDF skeleton?
> So you are not holding one top-level structure open while you add
> 17K+ child objects to it. I hope that makes sense.
I think it does. The first thing that comes to mind is all the book
keeping that would no longer be done.
> Not as convenient as
> cofeed, and you might have to pull a few tricks to avoid a mess with output
> namespaces, but it should get you over the memory hump for now.
Yes, this would be a bit tricky, but at an alternative to my current
workaround of XSLT, which does seem to be doing well despite the
massing memory usage.
On Tue, Mar 8, 2011 at 3:38 PM, Augusto Herrmann <hell...@gmail.com> wrote:
> If what you want to write are RDF triples in RDF/XML serialization,
> how about instead of using amara's structwriter you use rdflib to hold
> all the triples in a graph and serialize it only at the end?
I did try that, but the RDFLib IOMemory store doesn't seem to be very
efficient for large RDF graphs and the memory overhead is what I was
hoping to avoid by streaming directly from the source XML to the
resulting RDF/XML.
> And if the triples are too many to hold the whole graph in memory,
> maybe you could send them to a triple store and postpone the RDF/XML
> serialization to the final step.
Most triple stores, unfortunately, require an RDF document on hand in
order to load with maximum efficiency and live updates tend to not be
very efficient.
-- Chime
Eventually, I used a threshold to determine a size for XML documents
for which the (presumably DOM-based) Amara parser was used instead. I
set the value at about 10MB, but still kept getting memory failures.
Eventually, I had to resort to using the parser alone.
The difference between the two was the following:
@coroutine
def receive_nodes():
while True:
sparqlSoln = yield
handleSoln(sparqlSoln)
return
if asCoRoutine:
target = receive_nodes()
pushtree(sparqlResult, u'result', target.send,
entity_factory=entity_base)
target.close()
else:
print >> sys.stderr,"Parsing size %s SPARQL XML doc as
DOM"%(len(sparqlResult))
#Use DOM instead of event-based pushtree (memory issues)
doc = bindery.parse(sparqlResult,
prefixes={
u'sparql':u'http://www.w3.org/2005/sparql-results#'})
for soln in
doc.xml_select('/sparql:sparql/sparql:results/sparql:result'):
handleSoln(soln)
Using just the DOM parser, I ran a large set of jobs again and watched
the memory footprint monotonically increase but at a slower rate that
seemed to be the result of other data structures I was accumulating.
Eventually, it failed with the following (same) error:
Python(8403) malloc: *** mmap(size=262144) failed (error code=12)
*** error: can't allocate region
*** set a breakpoint in malloc_error_break to debug
Traceback (most recent call last):
..snip..
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/bindery/__init__.py",
line 17, in parse
doc = tree.parse(obj, uri, entity_factory=entity_factory,
standalone=standalone, validate=validate)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/tree.py",
line 69, in parse
return _parse(inputsource(obj, uri), flags,
entity_factory=entity_factory,rule_handler=rule_handler)
File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/amara/bindery/nodes.py",
line 625, in xml_element_factory
eclass = type(class_name, (self.xml_element_base,),
dict(xml_child_pnames={}))
MemoryError
I searched for information on that particular error and found:
[[[
In your case malloc is attempting to allocate 384kb of memory using mmap
by mapping anonymous memory (swap) into the address space. (The
advantage of doing it this way is that the space can be unmapped when no
longer needed.)
With no appropriate address space left for this chunk, you get the
error. You are either trying to use too much stuff or there is a memory
leak. ]]] -- http://mail.python.org/pipermail/python-list/2009-January/1189377.html
Since the error is the same as the one I was getting with pushtree,
seems to be memory exhaustion related, and given the different
approaches to processing (and the rate of monotonic increase memory
usage), this seems to suggest that there is indeed a memory leak but
perhaps only with files large enough for the problem to manifest
itself after repeated processing.