Can you recommend an XML parser which is faster than XML::Twig?
I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.
I tried something like the code below, but I have also tried a version that just opens the file and parses it using regular expressions, however the unelegant regexp version is 25 times faster than the one which uses XML::Twig, and it also uses less memory.
If you think there is a module for parsing XML which would work faster than regular expressions, or if I can substantially improve the program which uses XML::Twig then please tell me about it. If regexp will still be faster, I will use regexp.
Thanks.
use XML::Twig;
my $xml = 'path/to/xml/file.xml';
my $t= XML::Twig->new( twig_handlers => {
Lexem => sub {
my( $t, $lexem )= @_;
my $id = $lexem->att( 'id' );
my $timestamp = $lexem->first_child( 'Timestamp')->text;
my $lexem_text = $lexem->first_child( 'Form' )->text;
my @inflected_form = $lexem->children( 'InflectedForm' );
for my $inflected_form ( @inflected_form ) {
my $inflection_id = $inflected_form->first_child( 'InflectionId' )->text;
my $inflection_text = $inflected_form->first_child( 'Form' )->text;
}
> Can you recommend an XML parser which is faster than XML::Twig?
> I need to use an XML parser that can parse the XML files chunk by chunk and
> which works faster (much faster) than XML::Twig, because I tried using this
> module but it is very slow.
XML::LibXML contains several event-based parsers including the SAX parser and
the pull-parser. Can you try using them?
> I tried something like the code below, but I have also tried a version that
> just opens the file and parses it using regular expressions, however the
> unelegant regexp version is 25 times faster than the one which uses
> XML::Twig, and it also uses less memory.
> If you think there is a module for parsing XML which would work faster than
> regular expressions, or if I can substantially improve the program which uses
> XML::Twig then please tell me about it. If regexp will still be faster, I
> will use regexp.
> Thanks.
> use XML::Twig;
> my $xml = 'path/to/xml/file.xml';
> my $t= XML::Twig->new( twig_handlers => {
> Lexem => sub {
> my( $t, $lexem )= @_;
> my $id = $lexem->att( 'id' );
> my $timestamp = $lexem->first_child( 'Timestamp')->text;
> my $lexem_text = $lexem->first_child( 'Form' )->text;
> my @inflected_form = $lexem->children( 'InflectedForm' );
> for my $inflected_form ( @inflected_form ) {
> my $inflection_id = $inflected_form->first_child( 'InflectionId'
> )->text; my $inflection_text = $inflected_form->first_child( 'Form' )->text;
> }
> Can you recommend an XML parser which is faster than XML::Twig?
> I need to use an XML parser that can parse the XML files chunk by chunk > and
> which works faster (much faster) than XML::Twig, because I tried using > this
> module but it is very slow.
XML::LibXML contains several event-based parsers including the SAX parser and
the pull-parser. Can you try using them?
Regards,
Shlomi Fish
Hi Shlomi,
I tried to use XML::LibXML::Reader which uses the pool parser, and I read that:
""
However, it is also possible to mix Reader with DOM. At every point the
user may copy the current node (optionally expanded into a complete
sub-tree) from the processed document to another DOM tree, or to
instruct the Reader to collect sub-document in form of a DOM tree
""
So I tried:
use XML::LibXML::Reader;
my $xml = 'path/to/xml/file.xml';
my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot read $xml";
while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine
my $doc = $reader->document;
my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work well
my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work fine
}
So I could get that attribute well, but I couldn't get the rest of the sub-elements because for example when I printed the var $timestamp, sometimes it printed its value twice or 3 times together.
I couldn't find an example of using XML::LibXML for reading the xml file element by element, than read each element's elements directly.
The XML I want to parse looks like the one below. It is just much bigger.
I want to read one by one each <Lexem> element (and I've done this successfully), then read its id attribute (also done this well), but I also want to read its sub elements, using something like:
$reader->read_some_element('Form')
or
$reader->{Form}
which should read just the element <Form> right below the <Lexem> element, but don't read the <Form> elements below the <InflectedForm>.
and then read the elements under the <InflectedForm> element using something like:
$reader->read_another_element( '/InflectedForm/Form' )
or like
$reader->{InflectedForm}{Form}
or using the $doc object...
I tried to use a lot of methods for reading the elements of the current <Lexem> element, but with no good results.
> I tried to use XML::LibXML::Reader which uses the pool parser, and I read > that:
> ""
> However, it is also possible to mix Reader with DOM. At every point the
> user may copy the current node (optionally expanded into a complete
> sub-tree) from the processed document to another DOM tree, or to
> instruct the Reader to collect sub-document in form of a DOM tree
> ""
> So I tried:
> use XML::LibXML::Reader;
> my $xml = 'path/to/xml/file.xml';
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot > read $xml";
> while ( $reader->nextElement( 'Lexem' ) ) {
> my $id = $reader->getAttribute( 'id' ); #works fine
> my $doc = $reader->document;
> my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't work > well
> my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work > fine
> }
I'm not sure you should do ->document. I cannot tell you off-hand how to do it
right, but I can try to investigate when I have some spare cycles.
> > I tried to use XML::LibXML::Reader which uses the pool parser, and I read > > that:
> > ""
> > However, it is also possible to mix Reader with DOM. At every point the
> > user may copy the current node (optionally expanded into a complete
> > sub-tree) from the processed document to another DOM tree, or to
> > instruct the Reader to collect sub-document in form of a DOM tree
> > ""
> > So I tried:
> > use XML::LibXML::Reader;
> > my $xml = 'path/to/xml/file.xml';
> > my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot > > read $xml";
> > while ( $reader->nextElement( 'Lexem' ) ) {
> > my $id = $reader->getAttribute( 'id' ); #works fine
> > my $doc = $reader->document;
> > my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't
> > work well
> > my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work > > fine
> > }
> I'm not sure you should do ->document. I cannot tell you off-hand how to do it
> right, but I can try to investigate when I have some spare cycles.
OK, after a short amount of investigation, I found that this program works:
[CODE]
use strict;
use warnings;
use XML::LibXML::Reader;
my $xml = 'Lexems.xml';
my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot read
$xml";
while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine
my $doc = $reader->copyCurrentNode(1);
my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
my @lexem_text = $doc->getElementsByTagName( 'Form' );
}
[/CODE]
Note that you can also use XPath for looking up XML information.
On Mon, Oct 29, 2012 at 9:18 AM, Shlomi Fish <shlo...@shlomifish.org> wrote:
> On Mon, 29 Oct 2012 10:09:53 +0200
> Shlomi Fish <shlo...@shlomifish.org> wrote:
> > Hi Octavian,
> > On Sun, 28 Oct 2012 17:45:15 +0200
> > "Octavian Rasnita" <orasn...@gmail.com> wrote:
> > > I tried to use XML::LibXML::Reader which uses the pool parser, and I
> read
> > > that:
> > > ""
> > > However, it is also possible to mix Reader with DOM. At every point the
> > > user may copy the current node (optionally expanded into a complete
> > > sub-tree) from the processed document to another DOM tree, or to
> > > instruct the Reader to collect sub-document in form of a DOM tree
> > > ""
> > > So I tried:
> > > use XML::LibXML::Reader;
> > > my $xml = 'path/to/xml/file.xml';
> > > my $reader = XML::LibXML::Reader->new( location => $xml ) or die
> "cannot
> > > read $xml";
> > > while ( $reader->nextElement( 'Lexem' ) ) {
> > > my $id = $reader->getAttribute( 'id' ); #works fine
> > > my $doc = $reader->document;
> > > my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't
> > > work well
> > > my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't
> work
> > > fine
> > > }
> > I'm not sure you should do ->document. I cannot tell you off-hand how to
> do it
> > right, but I can try to investigate when I have some spare cycles.
> OK, after a short amount of investigation, I found that this program works:
> [CODE]
> use strict;
> use warnings;
> use XML::LibXML::Reader;
> my $xml = 'Lexems.xml';
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
> read
> $xml";
> while ( $reader->nextElement( 'Lexem' ) ) {
> my $id = $reader->getAttribute( 'id' ); #works fine
> my $doc = $reader->copyCurrentNode(1);
> my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
> my @lexem_text = $doc->getElementsByTagName( 'Form' );
> }
> [/CODE]
> Note that you can also use XPath for looking up XML information.
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
A little late I know but still...
Last year I was asked to process a large amount of XML files 2x 1.6M files
that needed to be compared on a element by element level and with some
fuzzy logic needed to be the same. Things like floating point precision
could change (1.00 = 1) and in some cases data could show up in a different
order (repeating elements for multiple items on an order). The whole idea
was system A that took flat text output from a mainframe and translated
this to XML for consumption by a web service was being replaced by system B
that did the same thing but on a entirely different software stack.
Of course this needed to go as fast as possible as we simply could not sit
around for a few days while the computer did it's thing. LibXML was my
saviour and using XPath was the fastest solution. Though it is possible to
do the DOM thing you end up with the DOM being translated to XPath under
the hood (at least the performance seemed to indicate that). After a lot of
testing and using pretty much any XML parser I could find using LibXML and
XPath was really the fastest.
If you are going for speed then you will want to avoid any copy operations
you can and you will want to as much as possible use references. Because
even though a memory copy of some 100 bytes is a very fast operation on a
few million files the the little time it takes kind of adds up to a lot
longer then you would like it to.
When you are looking at speed first and foremost try and avoid anything
that would slow you down. A copy of information is slow so don't do it if
you can avoid it. A reference to a memory location is slightly harder to
work with in programming but a lot faster. A translation from DOM to XPath
would take you time to do, the computer needs the same time. If it is pure
speed you are after avoid this as well.
If you are sure you are as fast as you can be add a benchmark to the code
and try individual optimisations that might or might not be faster... you
would be surprised how the perl internals sometimes are a lot faster with
some operations then with others even though feeling wise you would not
have expected this to be the case.
For my case as it was a once in every 25 years kind of major change I
didn't do to much benchmarking as the code would be discarded at the end of
the project. (well stored in a dusty old SVN repository for others to reuse
and never to be looked at again realistically) I got it to go fast enough
for a regular run of 1.6M files on a daily basis for as long as the project
needed to feel comfortable with the tests to go to production. But if your
code is to live a longer life it becomes worth the effort to do more
benchmarking with every few additional months that the code is expected to
be around and in regular use.
On Mon, Oct 29, 2012 at 9:18 AM, Shlomi Fish <shlo...@shlomifish.org> wrote:
> On Mon, 29 Oct 2012 10:09:53 +0200
> Shlomi Fish <shlo...@shlomifish.org> wrote:
> > Hi Octavian,
> > On Sun, 28 Oct 2012 17:45:15 +0200
> > "Octavian Rasnita" <orasn...@gmail.com> wrote:
> > > From: "Shlomi Fish" <shlo...@shlomifish.org>
> > > Hi Octavian,
> OK, after a short amount of investigation, I found that this program works:
> [CODE]
> use strict;
> use warnings;
> use XML::LibXML::Reader;
> my $xml = 'Lexems.xml';
> my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot
> read
> $xml";
> while ( $reader->nextElement( 'Lexem' ) ) {
> my $id = $reader->getAttribute( 'id' ); #works fine
> my $doc = $reader->copyCurrentNode(1);
> my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
> my @lexem_text = $doc->getElementsByTagName( 'Form' );
> }
> [/CODE]
> Note that you can also use XPath for looking up XML information.
> Regards,
> Shlomi Fish
> --
> -----------------------------------------------------------------
> Shlomi Fish http://www.shlomifish.org/ > A little late I know but still...
Unfortunately it is not so late. :-)
> LibXML was my saviour and using XPath was the fastest solution. Though it is possible to
do the DOM thing you end up with the DOM being translated to XPath under
the hood (at least the performance seemed to indicate that). After a lot of
testing and using pretty much any XML parser I could find using LibXML and
XPath was really the fastest.
If you are going for speed then you will want to avoid any copy operations
you can and you will want to as much as possible use references. Because
even though a memory copy of some 100 bytes is a very fast operation on a
few million files the the little time it takes kind of adds up to a lot
longer then you would like it to.
**
Can you gave or point me to some examples of using XPath with XML::LibXML?
I tried to use XML::XPath but it tries to load the entire document in memory so it is not a good way.
> > I tried to use XML::LibXML::Reader which uses the pool parser, and I read > > that:
> > ""
> > However, it is also possible to mix Reader with DOM. At every point the
> > user may copy the current node (optionally expanded into a complete
> > sub-tree) from the processed document to another DOM tree, or to
> > instruct the Reader to collect sub-document in form of a DOM tree
> > ""
> > So I tried:
> > use XML::LibXML::Reader;
> > my $xml = 'path/to/xml/file.xml';
> > my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot > > read $xml";
> > while ( $reader->nextElement( 'Lexem' ) ) {
> > my $id = $reader->getAttribute( 'id' ); #works fine
> > my $doc = $reader->document;
> > my $timestamp = $doc->getElementsByTagName( 'Timestamp' ); #Doesn't
> > work well
> > my @lexem_text = $doc->getElementsByTagName( 'Form' ); #Doesn't work > > fine
> > }
> I'm not sure you should do ->document. I cannot tell you off-hand how to do it
> right, but I can try to investigate when I have some spare cycles.
OK, after a short amount of investigation, I found that this program works:
[CODE]
use strict;
use warnings;
use XML::LibXML::Reader;
my $xml = 'Lexems.xml';
my $reader = XML::LibXML::Reader->new( location => $xml ) or die "cannot read
$xml";
while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' ); #works fine
my $doc = $reader->copyCurrentNode(1);
my $timestamp = $doc->getElementsByTagName( 'Timestamp' );
my @lexem_text = $doc->getElementsByTagName( 'Form' );
}
[/CODE]
Note that you can also use XPath for looking up XML information.
Regards,
Shlomi Fish
-- -----------------------------------------------------------------
Shlomi Fish http://www.shlomifish.org/
I followed the way you suggested, and it works fine, however it is very slow.
I've done:
while ( $reader->nextElement( 'Lexem' ) ) {
my $id = $reader->getAttribute( 'id' );
my $doc = $reader->copyCurrentNode(1);
my $timestamp = $doc->findnodes( 'Timestamp' );
my $lexem_text = $doc->findnodes( 'Form' );
my $inflected_forms = $doc->findnodes( 'InflectedForm' );
for my $inflected_form ( $inflected_forms->get_nodelist ) {
my $inflection_id = $inflected_form->findnodes( './InflectionId' );
my $inflection_dia = $inflected_form->findnodes( './Form' );
}
}
I tried to find a way of using XPath but I couldn't find a good one, and it seems that copy of that node takes a pretty long time.
From: "Octavian Rasnita" <orasn...@gmail.com>
To: <beginn...@perl.org>
Subject: Fast XML parser?
Date sent: Thu, 25 Oct 2012 14:33:15 +0300
> Hi,
> Can you recommend an XML parser which is faster than XML::Twig?
> I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.
> I tried something like the code below, but I have also tried a version
> that just opens the file and parses it using regular expressions,
> however the unelegant regexp version is 25 times faster than the one
> which uses XML::Twig, and it also uses less memory.
> If you think there is a module for parsing XML which would work faster
> than regular expressions, or if I can substantially improve the
> program which uses XML::Twig then please tell me about it. If regexp
> will still be faster, I will use regexp.
You did not specify what do you want to do with the lexemes anyway you might try something like this:
XML::Rules sits on top of XML::Parser::Expat so I would not expect this to be 25 times faster than XML::Twig, but it might be a bit quicker. Or not.
Jenda
===== Je...@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery
> From: "Octavian Rasnita" <orasn...@gmail.com>
> To: <beginn...@perl.org>
> Subject: Fast XML parser?
> Date sent: Thu, 25 Oct 2012 14:33:15 +0300
>> Hi,
>> Can you recommend an XML parser which is faster than XML::Twig?
>> I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.
>> I tried something like the code below, but I have also tried a version
>> that just opens the file and parses it using regular expressions,
>> however the unelegant regexp version is 25 times faster than the one
>> which uses XML::Twig, and it also uses less memory.
>> If you think there is a module for parsing XML which would work faster
>> than regular expressions, or if I can substantially improve the
>> program which uses XML::Twig then please tell me about it. If regexp
>> will still be faster, I will use regexp.
> You did not specify what do you want to do with the lexemes anyway > you might try something like this:
> use strict;
> use XML::Rules;
> use Data::Dumper;
> XML::Rules sits on top of XML::Parser::Expat so I would not expect > this to be 25 times faster than XML::Twig, but it might be a bit > quicker. Or not.
> Jenda
Hi Jenda,
I tried your program above, modified as below, but it gives the error:
Free to wrong pool 3967d8 not 20202020 at e:/usr/lib/XML/Parser/Expat.pm line 470.
I was able to install XML::Rules under Windows using cpanm with no problems, so it should be working...
> From: "Octavian Rasnita" <orasn...@gmail.com>
> To: <beginn...@perl.org>
> Subject: Fast XML parser?
> Date sent: Thu, 25 Oct 2012 14:33:15 +0300
>> Hi,
>> Can you recommend an XML parser which is faster than XML::Twig?
>> I need to use an XML parser that can parse the XML files chunk by chunk and which works faster (much faster) than XML::Twig, because I tried using this module but it is very slow.
>> I tried something like the code below, but I have also tried a version
>> that just opens the file and parses it using regular expressions,
>> however the unelegant regexp version is 25 times faster than the one
>> which uses XML::Twig, and it also uses less memory.
>> If you think there is a module for parsing XML which would work faster
>> than regular expressions, or if I can substantially improve the
>> program which uses XML::Twig then please tell me about it. If regexp
>> will still be faster, I will use regexp.
> You did not specify what do you want to do with the lexemes anyway > you might try something like this:
> use strict;
> use XML::Rules;
> use Data::Dumper;
> XML::Rules sits on top of XML::Parser::Expat so I would not expect > this to be 25 times faster than XML::Twig, but it might be a bit > quicker. Or not.
> Jenda
I forgot to say that the script I previously sent to the list also crashed Perl and it popped an error window with:
perl.exe - Application Error
The instruction at "0x7c910f20" referenced memory at "0x00000004". The memory could not be "read". Click on OK to terminate the program
I have created a smaller XML file with only ~ 100 lines and I ran agan that script, and it worked fine.
But it doesn't work with the entire xml file which has more than 200 MB, because it crashes Perl and I don't know why.
And strange, but I've seen that now it just crashes Perl, but it doesn't return that "Free to wrong pool" error.
> I forgot to say that the script I previously sent to the list also crashed Perl and it popped an error window with:
> perl.exe - Application Error
> The instruction at "0x7c910f20" referenced memory at "0x00000004". The memory could not be "read". Click on OK to terminate the program
> I have created a smaller XML file with only ~ 100 lines and I ran agan that script, and it worked fine.
> But it doesn't work with the entire xml file which has more than 200 MB, because it crashes Perl and I don't know why.
> And strange, but I've seen that now it just crashes Perl, but it doesn't return that "Free to wrong pool" error.
> Octavian
That must be something either within your perl or the XML::Parser::Expat. What versions of those two do you have? Any chance you could update?
Jenda
===== Je...@Krynicky.cz === http://Jenda.Krynicky.cz =====
When it comes to wine, women and song, wizards are allowed to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery
On Wed, Oct 31, 2012 at 5:39 PM, Jenda Krynicky <Je...@krynicky.cz> wrote:
> From: "Octavian Rasnita" <orasn...@gmail.com>
> > I forgot to say that the script I previously sent to the list also
> crashed Perl and it popped an error window with:
> > perl.exe - Application Error
> > The instruction at "0x7c910f20" referenced memory at "0x00000004". The
> memory could not be "read". Click on OK to terminate the program
> > I have created a smaller XML file with only ~ 100 lines and I ran agan
> that script, and it worked fine.
> > But it doesn't work with the entire xml file which has more than 200 MB,
> because it crashes Perl and I don't know why.
> > And strange, but I've seen that now it just crashes Perl, but it doesn't
> return that "Free to wrong pool" error.
> > Octavian
> That must be something either within your perl or the
> XML::Parser::Expat. What versions of those two do you have? Any
> chance you could update?
> Jenda
> ===== Je...@Krynicky.cz === http://Jenda.Krynicky.cz =====
> When it comes to wine, women and song, wizards are allowed
> to get drunk and croon as much as they like.
> -- Terry Pratchett in Sourcery
> --
> To unsubscribe, e-mail: beginners-unsubscr...@perl.org
> For additional commands, e-mail: beginners-h...@perl.org
> http://learn.perl.org/
The memory issue is really an issue of the module it self I have had those
problems as well, the more complex the xml structure the more memory it
takes up and the faster you will run out. I simply moved on to other
modules as I could not afford to spend my time on trying to figure out a
workaround.
> From: "Octavian Rasnita" <orasn...@gmail.com>
>> I forgot to say that the script I previously sent to the list also >> crashed Perl and it popped an error window with:
>> perl.exe - Application Error
>> The instruction at "0x7c910f20" referenced memory at "0x00000004". The >> memory could not be "read". Click on OK to terminate the program
>> I have created a smaller XML file with only ~ 100 lines and I ran agan >> that script, and it worked fine.
>> But it doesn't work with the entire xml file which has more than 200 MB, >> because it crashes Perl and I don't know why.
>> And strange, but I've seen that now it just crashes Perl, but it doesn't >> return that "Free to wrong pool" error.
>> Octavian
> That must be something either within your perl or the
> XML::Parser::Expat. What versions of those two do you have? Any
> chance you could update?
> Jenda
> perl -v
This is perl 5, version 14, subversion 2 (v5.14.2) built for MSWin32-x86-multi-thread
(with 1 registered patch, see perl -V for more detail)
Copyright 1987-2011, Larry Wall
Binary build 1402 [295342] provided by ActiveState http://www.ActiveState.com Built Oct 7 2011 15:49:44
...
> cpanm XML::Parser::Expat
Set up gcc environment - 3.4.5 (mingw-vista special r3)
XML::Parser::Expat is up to date. (2.41)
I think Perl is also new enough...
Anyway, I solved the problem by parsing the XML content using regular expressions and it works very fast this way.
And the regexp solution is not uglier and harder to maintain than using an XML parser...