Asynchronous XML Parsing

1,046 views
Skip to first unread message

MaZderMind

unread,
Apr 7, 2011, 2:43:47 PM4/7/11
to nodejs
Hi

is there a way to parse XML asynchronous? Ie I want to process one XML-
Tag a Second:

parser.on('tagStart', function(tag, attr, cb) {
console.log('currently processing', tag);
setTimeout(function() {
console.log('finished processing', tag);
cb();
}, 1000);
});

none of the XML-Parsers seem to support this. In a real world app I'd
like to replace the setTimeout with some database query.

Peter

Nicholas Campbell

unread,
Apr 7, 2011, 3:18:27 PM4/7/11
to nod...@googlegroups.com, MaZderMind
To do this you'd want to use something SAX related, not dom, and trigger a specific event when an element was found.

As for doing it on each second...not entirely sure. I don't see the purpose of doing one tag on some interval.

- Nick Campbell

http://digitaltumbleweed.com



--
You received this message because you are subscribed to the Google Groups "nodejs" group.
To post to this group, send email to nod...@googlegroups.com.
To unsubscribe from this group, send email to nodejs+un...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/nodejs?hl=en.


MaZderMind

unread,
Apr 7, 2011, 4:42:03 PM4/7/11
to nodejs
Hi Nick,

thank you for your answer.


I need something SAX related, yes, but none of the node/xml modules
uses a callback to send the next tag (which would allow async xml
processing) but rely on synchronous handlers:

https://github.com/robrighter/node-xml/blob/master/example.js
cb.onStartElementNS(function(elem, attrs, prefix, uri, namespaces) {
sys.puts("=> Started: " + elem + " uri="+uri +" (Attributes: " +
JSON.stringify(attrs) + " )");
});

https://github.com/isaacs/sax-js/blob/master/examples/pretty-print.jsprinter.onopentag
= function (tag) {
this.indent();
this.level ++;
sys.print("<"+tag.name);
for (var i in tag.attributes) {
sys.print(" "+i+"=\""+entity(tag.attributes[i])+"\"");
}
sys.print(">");
}

https://github.com/astro/node-expat/blob/master/test.js
p.addListener('startElement', function(name, attrs) {
evs_received.push(['startElement', name, attrs]);
});

https://github.com/shimondoodkin/node-expat/blob/master/test.js
p.addListener('startElement', function(name, attrs) {
evs_received.push(['startElement', name, attrs]);
});


As for doing it on each second... that was just an example. In a real
world application the timeout would be replaced with a database query.
And I'm talking about gigabytes of xml (http://planet.osm.org/full-
experimental/) so I can't read all tags and then wait for all tags
until I start to process the database queries..

Peter

On 7 Apr., 21:18, Nicholas Campbell <nicholas.j.campb...@gmail.com>
wrote:

Matt

unread,
Apr 7, 2011, 4:49:05 PM4/7/11
to nod...@googlegroups.com, MaZderMind
What you actually want is a pull parser by the sounds of things.


?

MaZderMind

unread,
Apr 7, 2011, 4:59:26 PM4/7/11
to nodejs
Yes, a pull parser would do, too:

function workWithTag(tag)
{
if(!tag) return workWithTag(parser.next());

console.log('currently processing', tag);
return setTimeout(function() {
console.log('finished processing', tag);

return workWithTag(parser.next());
});
}
workWithTag();

Now we need a port of expat with a pull interface. I'll give it a try,
thank you!

Peter

MaZderMind

unread,
Apr 7, 2011, 6:08:06 PM4/7/11
to nodejs
I forked the node-expat lib and added stop/resume methods. This solves
this issue. I sent a pull request to astro.

https://github.com/MaZderMind/node-expat

Peter

Laurie Harper

unread,
Apr 7, 2011, 6:50:21 PM4/7/11
to nod...@googlegroups.com
On 2011-04-07, at 4:42 PM, MaZderMind wrote:
> I need something SAX related, yes, but none of the node/xml modules
> uses a callback to send the next tag (which would allow async xml
> processing) but rely on synchronous handlers:


All your examples showed tags being passed via a callback... Do you mean you need a parser that doesn't send you the next tag until you ask for it?

Unless you're working with very large XML documents, you can achieve that easily enough by buffering the start-tag events in a queue and processing the queue sequentially. And if you are working with large documents, you can just pause/resume the input stream you're reading the XML from. No need for pause/resume support in the parser on top of that.
--
Laurie Harper
http://laurie.holoweb.net/

Matt

unread,
Apr 7, 2011, 7:29:45 PM4/7/11
to nod...@googlegroups.com, Laurie Harper
It depends on the XML parser - if it takes a filename (rather than a "parse_more" type interface) then you have no control over how fast it parses it.




--

mscdex

unread,
Apr 7, 2011, 10:18:16 PM4/7/11
to nodejs
On Apr 7, 4:59 pm, MaZderMind <goo...@mazdermind.de> wrote:
> Yes, a pull parser would do, too:

FWIW libxmljs has a push parser and it works very well. I use it any
time I need to parse XML async.
Message has been deleted
Message has been deleted

MaZderMind

unread,
Apr 8, 2011, 6:56:21 AM4/8/11
to nodejs
Yes, I work with very large (>50 GB in a single file) OpenStreetMap
dumps (linked above). Pausing the input stream is a posibility, but it
shen pausing stdin some hundered elements are processed until it
really halts. This requires me to send hundered parallel (!) queries
to the database until I can write back the read xml elements.
Nevertheless expat offers the functionality to stop/resume the parser
events and I successfully integrated them into the nodejs binding.

Astro is on his way to release a new version of the node-expat module
to npm.

Peter

MaZderMind

unread,
Apr 8, 2011, 6:57:48 AM4/8/11
to nodejs
> > Yes, a pull parser would do, too:
>
> FWIW libxmljs has a push parser and it works very well. I use it any
> time I need to parse XML async.
There is are a lot of *push* parsers for nodejs but no *pull* parser.
But it's possible to build one on the basis of the stop/resume methods
of the new node-expat.

Peter
Reply all
Reply to author
Forward
0 new messages