Fast XML parser?

Octavian Rasnita

unread,

Oct 25, 2012, 7:33:15 AM10/25/12

to begi...@perl.org

Hi,

Can you recommend an XML parser which is faster than XML::Twig?

I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.

I tried something like the code below, but I have also tried a version that just opens the file and parses it using regular expressions, however the unelegant regexp version is 25 times faster than the one which uses XML::Twig, and it also uses less memory.

If you think there is a module for parsing XML which would work faster than regular expressions, or if I can substantially improve the program which uses XML::Twig then please tell me about it. If regexp will still be faster, I will use regexp.

Thanks.

use XML::Twig;

my $xml = 'path/to/xml/file.xml';

my $t= XML::Twig->new( twig_handlers => {
Lexem => sub {
my( $t, $lexem )= @_;

my $id = $lexem->att( 'id' );
my $timestamp = $lexem->first_child( 'Timestamp')->text;
my $lexem_text = $lexem->first_child( 'Form' )->text;
my @inflected_form = $lexem->children( 'InflectedForm' );

for my $inflected_form ( @inflected_form ) {
my $inflection_id = $inflected_form->first_child( 'InflectionId' )->text;
my $inflection_text = $inflected_form->first_child( 'Form' )->text;
}

$t->purge;

return 1;
},
} );

$t->safe_parsefile( $xml );
$t->purge;

--Octavian

Shlomi Fish

unread,

Oct 25, 2012, 8:07:08 AM10/25/12

to Octavian Rasnita, begi...@perl.org

Hi Octavian,

On Thu, 25 Oct 2012 14:33:15 +0300
"Octavian Rasnita" <oras...@gmail.com> wrote:

> Hi,
>
> Can you recommend an XML parser which is faster than XML::Twig?
>
> I need to use an XML parser that can parse the XML files chunk by chunk and
> which works faster (much faster) than XML::Twig, because I tried using this
> module but it is very slow.

XML::LibXML contains several event-based parsers including the SAX parser and
the pull-parser. Can you try using them?

Regards,

Shlomi Fish

--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/
Interview with Ben Collins-Sussman - http://shlom.in/sussman

Modern Perl — the 3‐D Movie. In theatres near you.

Please reply to list if it's a mailing list post - http://shlom.in/reply .

Michiel Beijen

unread,

Oct 25, 2012, 11:30:11 AM10/25/12

to Octavian Rasnita, begi...@perl.org

Hi Octavian,

On Thu, Oct 25, 2012 at 1:33 PM, Octavian Rasnita <oras...@gmail.com> wrote:

> Can you recommend an XML parser which is faster than XML::Twig?

Did you try XML::LibXML ?
https://www.metacpan.org/module/XML::LibXML

--
Michiel

Michiel Beijen

unread,

Oct 25, 2012, 11:33:01 AM10/25/12

to Octavian Rasnita, begi...@perl.org

I'm sorry, I did not see Shlomi's reply, it was in my spam folder for
some reason.

Octavian Rasnita

unread,

Oct 28, 2012, 11:45:15 AM10/28/12

to Shlomi Fish, begi...@perl.org

From: "Shlomi Fish" <shl...@shlomifish.org>

Hi Octavian,

On Thu, 25 Oct 2012 14:33:15 +0300
"Octavian Rasnita" <oras...@gmail.com> wrote:

> Hi,
>
> Can you recommend an XML parser which is faster than XML::Twig?
>
> I need to use an XML parser that can parse the XML files chunk by chunk
> and
> which works faster (much faster) than XML::Twig, because I tried using
> this
> module but it is very slow.

XML::LibXML contains several event-based parsers including the SAX parser
and
the pull-parser. Can you try using them?

Regards,

Shlomi Fish

Hi Shlomi,

I tried to use XML::LibXML::Reader which uses the pool parser, and I read
that:

""
However, it is also possible to mix Reader with DOM. At every point the
user may copy the current node (optionally expanded into a complete
sub-tree) from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM tree
""

So I tried:

use XML::LibXML::Reader;

my $xml = 'path/to/xml/file.xml';

my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
read $xml";

while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine

my $doc = $reader->document;

my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work
well
my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work
fine

}

So I could get that attribute well, but I couldn't get the rest of the
sub-elements because for example when I printed the var $timestamp,
sometimes it printed its value twice or 3 times together.
I couldn't find an example of using XML::LibXML for reading the xml file
element by element, than read each element's elements directly.

The XML I want to parse looks like the one below. It is just much bigger.
I want to read one by one each <Lexem> element (and I've done this
successfully), then read its id attribute (also done this well), but I also
want to read its sub elements, using something like:

$reader->read_some_element('Form')
or
$reader->{Form}

which should read just the element <Form> right below the <Lexem> element,
but don't read the <Form> elements below the <InflectedForm>.

and then read the elements under the <InflectedForm> element using something
like:

$reader->read_another_element( '/InflectedForm/Form' )
or like
$reader->{InflectedForm}{Form}

or using the $doc object...

I tried to use a lot of methods for reading the elements of the current
<Lexem> element, but with no good results.

<?xml version="1.0" encoding="UTF-8"?>
<Lexems>
<Lexem id="1">
<Timestamp>1346826989</Timestamp>
<Form>aa</Form>
<InflectedForm>
<InflectionId>84</InflectionId>
<Form>aa</Form>
</InflectedForm>
</Lexem>
<Lexem id="2">
<Timestamp>1346826989</Timestamp>
<Form>aaa</Form>
<InflectedForm>
<InflectionId>84</InflectionId>
<Form>aaa</Form>
</InflectedForm>
</Lexem>
<Lexem id="3">
<Timestamp>1346826989</Timestamp>
<Form>aaleni'an</Form>
<InflectedForm>
<InflectionId>25</InflectionId>
<Form>aaleni'an</Form>
</InflectedForm>
<InflectedForm>
<InflectionId>26</InflectionId>
<Form>aaleni'an</Form>
</InflectedForm>
</Lexem>
</Lexems>

Thanks.

Octavian.

Shlomi Fish

unread,

Oct 29, 2012, 4:09:53 AM10/29/12

to begi...@perl.org

Hi Octavian,

On Sun, 28 Oct 2012 17:45:15 +0200
"Octavian Rasnita" <oras...@gmail.com> wrote:

> From: "Shlomi Fish" <shl...@shlomifish.org>
>
> Hi Octavian,
>
>
>

> Hi Shlomi,
>
> I tried to use XML::LibXML::Reader which uses the pool parser, and I read
> that:
>
> ""
> However, it is also possible to mix Reader with DOM. At every point the
> user may copy the current node (optionally expanded into a complete
> sub-tree) from the processed document to another DOM tree, or to
> instruct the Reader to collect sub-document in form of a DOM tree
> ""
>
> So I tried:
>
> use XML::LibXML::Reader;
>
> my $xml = 'path/to/xml/file.xml';
>
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
> read $xml";
>
> while ( $reader->nextElement( 'Lexem' ) ) {
> my $id = $reader->getAttribute( 'id' ); #works fine
>
> my $doc = $reader->document;
>
> my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work
> well
> my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work
> fine
>
> }
>

I'm not sure you should do ->document. I cannot tell you off-hand how to do it
right, but I can try to investigate when I have some spare cycles.

Regards,

Shlomi Fish

--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/

Funny Anti-Terrorism Story - http://shlom.in/enemy

What does “IDK” stand for? I don’t know.

Shlomi Fish

unread,

Oct 29, 2012, 4:18:48 AM10/29/12

to begi...@perl.org

OK, after a short amount of investigation, I found that this program works:

[CODE]

use strict;
use warnings;

use XML::LibXML::Reader;

my $xml = 'Lexems.xml';

my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot read
$xml";

while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine

my $doc = $reader->copyCurrentNode(1);

my $timestamp = $doc->getElementsByTagName( 'Timestamp' );

my @lexem_text = $doc->getElementsByTagName( 'Form' );
}

[/CODE]

Note that you can also use XPath for looking up XML information.

Regards,

Shlomi Fish

--
-----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/

List of Text Processing Tools - http://shlom.in/text-proc

Sophie: Let’s suppose you have a table with 2^n cups…
Jack: Wait a second! Is ‘n’ a natural number?

Rob Coops

unread,

Oct 29, 2012, 4:41:05 AM10/29/12

to begi...@perl.org

> --
> To unsubscribe, e-mail: beginners-...@perl.org
> For additional commands, e-mail: beginne...@perl.org
> http://learn.perl.org/
>
>
>
A little late I know but still...

Last year I was asked to process a large amount of XML files 2x 1.6M files
that needed to be compared on a element by element level and with some
fuzzy logic needed to be the same. Things like floating point precision
could change (1.00 = 1) and in some cases data could show up in a different
order (repeating elements for multiple items on an order). The whole idea
was system A that took flat text output from a mainframe and translated
this to XML for consumption by a web service was being replaced by system B
that did the same thing but on a entirely different software stack.

Of course this needed to go as fast as possible as we simply could not sit
around for a few days while the computer did it's thing. LibXML was my
saviour and using XPath was the fastest solution. Though it is possible to
do the DOM thing you end up with the DOM being translated to XPath under
the hood (at least the performance seemed to indicate that). After a lot of
testing and using pretty much any XML parser I could find using LibXML and
XPath was really the fastest.
If you are going for speed then you will want to avoid any copy operations
you can and you will want to as much as possible use references. Because
even though a memory copy of some 100 bytes is a very fast operation on a
few million files the the little time it takes kind of adds up to a lot
longer then you would like it to.

When you are looking at speed first and foremost try and avoid anything
that would slow you down. A copy of information is slow so don't do it if
you can avoid it. A reference to a memory location is slightly harder to
work with in programming but a lot faster. A translation from DOM to XPath
would take you time to do, the computer needs the same time. If it is pure
speed you are after avoid this as well.
If you are sure you are as fast as you can be add a benchmark to the code
and try individual optimisations that might or might not be faster... you
would be surprised how the perl internals sometimes are a lot faster with
some operations then with others even though feeling wise you would not
have expected this to be the case.

For my case as it was a once in every 25 years kind of major change I
didn't do to much benchmarking as the code would be discarded at the end of
the project. (well stored in a dusty old SVN repository for others to reuse
and never to be looked at again realistically) I got it to go fast enough
for a regular run of 1.6M files on a daily basis for as long as the project
needed to feel comfortable with the tests to go to production. But if your
code is to live a longer life it becomes worth the effort to do more
benchmarking with every few additional months that the code is expected to
be around and in regular use.

Regards,

Rob

Octavian Rasnita

unread,

Oct 29, 2012, 8:29:21 AM10/29/12

to Rob Coops, begi...@perl.org

From: "Rob Coops" <rco...@gmail.com>

On Mon, Oct 29, 2012 at 9:18 AM, Shlomi Fish <shl...@shlomifish.org> wrote:

> On Mon, 29 Oct 2012 10:09:53 +0200
> Shlomi Fish <shl...@shlomifish.org> wrote:
>
> > Hi Octavian,
> > On Sun, 28 Oct 2012 17:45:15 +0200
> > "Octavian Rasnita" <oras...@gmail.com> wrote:
> > > From: "Shlomi Fish" <shl...@shlomifish.org>
> > > Hi Octavian,

...

> OK, after a short amount of investigation, I found that this program works:
>
> [CODE]
>
> use strict;
> use warnings;
>
> use XML::LibXML::Reader;
>
> my $xml = 'Lexems.xml';
>
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
> read
> $xml";
>
> while ( $reader->nextElement( 'Lexem' ) ) {
> my $id = $reader->getAttribute( 'id' ); #works fine
>
> my $doc = $reader->copyCurrentNode(1);
> my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
> my @lexem_text = $doc->getElementsByTagName( 'Form' );
> }
>
> [/CODE]
>
> Note that you can also use XPath for looking up XML information.
>
> Regards,
>
> Shlomi Fish
> --
> -----------------------------------------------------------------
> Shlomi Fish http://www.shlomifish.org/

> A little late I know but still...

Unfortunately it is not so late. :-)

> LibXML was my saviour and using XPath was the fastest solution. Though it is possible to
do the DOM thing you end up with the DOM being translated to XPath under
the hood (at least the performance seemed to indicate that). After a lot of
testing and using pretty much any XML parser I could find using LibXML and
XPath was really the fastest.
If you are going for speed then you will want to avoid any copy operations
you can and you will want to as much as possible use references. Because
even though a memory copy of some 100 bytes is a very fast operation on a
few million files the the little time it takes kind of adds up to a lot
longer then you would like it to.

**
Can you gave or point me to some examples of using XPath with XML::LibXML?

I tried to use XML::XPath but it tries to load the entire document in memory so it is not a good way.

Octavian

Octavian Rasnita

unread,

Oct 29, 2012, 8:32:11 AM10/29/12

to Shlomi Fish, begi...@perl.org

From: "Shlomi Fish" <shl...@shlomifish.org>

I followed the way you suggested, and it works fine, however it is very slow.

I've done:

while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' );

my $doc = $reader->copyCurrentNode(1);

my $timestamp = $doc->findnodes( 'Timestamp' );
my $lexem_text = $doc->findnodes( 'Form' );

my $inflected_forms = $doc->findnodes( 'InflectedForm' );

for my $inflected_form ( $inflected_forms->get_nodelist ) {
my $inflection_id = $inflected_form->findnodes( './InflectionId' );
my $inflection_dia = $inflected_form->findnodes( './Form' );
}
}

I tried to find a way of using XPath but I couldn't find a good one, and it seems that copy of that node takes a pretty long time.

Octavian

Jenda Krynicky

unread,

Oct 30, 2012, 6:09:06 PM10/30/12

to begi...@perl.org

From: "Octavian Rasnita" <oras...@gmail.com>
To: <begi...@perl.org>
Subject: Fast XML parser?
Date sent: Thu, 25 Oct 2012 14:33:15 +0300

> Hi,
>
> Can you recommend an XML parser which is faster than XML::Twig?
>
> I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.
>
> I tried something like the code below, but I have also tried a version
> that just opens the file and parses it using regular expressions,
> however the unelegant regexp version is 25 times faster than the one
> which uses XML::Twig, and it also uses less memory.
>
> If you think there is a module for parsing XML which would work faster
> than regular expressions, or if I can substantially improve the
> program which uses XML::Twig then please tell me about it. If regexp
> will still be faster, I will use regexp.

You did not specify what do you want to do with the lexemes anyway
you might try something like this:

use strict;
use XML::Rules;
use Data::Dumper;

my $parser = XML::Rules->new(
stripspaces => 7,
rules => {
_default => 'content',
InflectedForm => 'as array',
Lexem => sub {
#print Dumper($_[1]);
print "$_[1]->{Form}\n";
foreach (@{$_[1]->{InflectedForm}}) {
print " $_->{InflectionId}: $_->{Form}\n";
}
},
}
);

$parser->parse(\*DATA);

__DATA__

<?xml version="1.0" encoding="UTF-8"?>
<Lexems>
<Lexem id="1">

...

XML::Rules sits on top of XML::Parser::Expat so I would not expect
this to be 25 times faster than XML::Twig, but it might be a bit
quicker. Or not.

Jenda
===== Je...@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery

Octavian Rasnita

unread,

Oct 31, 2012, 2:29:50 AM10/31/12

to Jenda Krynicky, begi...@perl.org

From: "Jenda Krynicky" <Je...@Krynicky.cz>

Hi Jenda,

I tried your program above, modified as below, but it gives the error:

Free to wrong pool 3967d8 not 20202020 at e:/usr/lib/XML/Parser/Expat.pm line 470.

I was able to install XML::Rules under Windows using cpanm with no problems, so it should be working...

The program:

use strict;
use XML::Rules;
use Data::Dumper;

my $parser = XML::Rules->new(
stripspaces => 7,
rules => {
_default => 'content',
InflectedForm => 'as array',
Lexem => sub {
#print Dumper($_[1]);

#print "$_[1]->{Form}\n";

foreach (@{$_[1]->{InflectedForm}}) {

#print " $_->{InflectionId}: $_->{Form}\n";
}
},
}
);

my $file = '/path/to/file.xml';

open my $xml, '<:utf8', $file or die "Cannot open $file: $!";

$parser->parse( $xml );

Thanks.

Octavian

Octavian Rasnita

unread,

Oct 31, 2012, 3:19:25 AM10/31/12

to Jenda Krynicky, begi...@perl.org

From: "Jenda Krynicky" <Je...@Krynicky.cz>

I forgot to say that the script I previously sent to the list also crashed Perl and it popped an error window with:

perl.exe - Application Error
The instruction at "0x7c910f20" referenced memory at "0x00000004". The memory could not be "read". Click on OK to terminate the program

I have created a smaller XML file with only ~ 100 lines and I ran agan that script, and it worked fine.

But it doesn't work with the entire xml file which has more than 200 MB, because it crashes Perl and I don't know why.

And strange, but I've seen that now it just crashes Perl, but it doesn't return that "Free to wrong pool" error.

Octavian

Jenda Krynicky

unread,

Oct 31, 2012, 12:39:51 PM10/31/12

to begi...@perl.org

From: "Octavian Rasnita" <oras...@gmail.com>

> I forgot to say that the script I previously sent to the list also crashed Perl and it popped an error window with:
>
> perl.exe - Application Error
> The instruction at "0x7c910f20" referenced memory at "0x00000004". The memory could not be "read". Click on OK to terminate the program
>
> I have created a smaller XML file with only ~ 100 lines and I ran agan that script, and it worked fine.
>
> But it doesn't work with the entire xml file which has more than 200 MB, because it crashes Perl and I don't know why.
>
> And strange, but I've seen that now it just crashes Perl, but it doesn't return that "Free to wrong pool" error.
>
> Octavian

That must be something either within your perl or the
XML::Parser::Expat. What versions of those two do you have? Any
chance you could update?

Rob Coops

unread,

Oct 31, 2012, 12:48:07 PM10/31/12

to begi...@perl.org

> --
> To unsubscribe, e-mail: beginners-...@perl.org
> For additional commands, e-mail: beginne...@perl.org
> http://learn.perl.org/
>
>
>

The memory issue is really an issue of the module it self I have had those
problems as well, the more complex the xml structure the more memory it
takes up and the faster you will run out. I simply moved on to other
modules as I could not afford to spend my time on trying to figure out a
workaround.

Regards,

Rob Coops

Octavian Rasnita

unread,

Oct 31, 2012, 2:02:12 PM10/31/12

to Jenda Krynicky, begi...@perl.org

From: "Jenda Krynicky" <Je...@Krynicky.cz>

> From: "Octavian Rasnita" <oras...@gmail.com>
>> I forgot to say that the script I previously sent to the list also
>> crashed Perl and it popped an error window with:
>>
>> perl.exe - Application Error
>> The instruction at "0x7c910f20" referenced memory at "0x00000004". The
>> memory could not be "read". Click on OK to terminate the program
>>
>> I have created a smaller XML file with only ~ 100 lines and I ran agan
>> that script, and it worked fine.
>>
>> But it doesn't work with the entire xml file which has more than 200 MB,
>> because it crashes Perl and I don't know why.
>>
>> And strange, but I've seen that now it just crashes Perl, but it doesn't
>> return that "Free to wrong pool" error.
>>
>> Octavian
>
> That must be something either within your perl or the
> XML::Parser::Expat. What versions of those two do you have? Any
> chance you could update?
>
>
> Jenda

> perl -v
This is perl 5, version 14, subversion 2 (v5.14.2) built for
MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)
Copyright 1987-2011, Larry Wall
Binary build 1402 [295342] provided by ActiveState
http://www.ActiveState.com
Built Oct 7 2011 15:49:44
...

> cpanm XML::Parser::Expat
Set up gcc environment - 3.4.5 (mingw-vista special r3)
XML::Parser::Expat is up to date. (2.41)

I think Perl is also new enough...

Anyway, I solved the problem by parsing the XML content using regular
expressions and it works very fast this way.
And the regexp solution is not uglier and harder to maintain than using an
XML parser...

Octavian