Are there any options in parsing large XML files?

George Petasis

unread,

Jun 13, 2012, 3:08:45 AM6/13/12

to

Hi all,

Is there an extension that will help me parse a large (i.e. a 30 GB) XML
file?

George

arjenmarkus

unread,

Jun 13, 2012, 3:36:20 AM6/13/12

to

I do not know whether the performance will be acceptable but the usual
extensions for parsing XML are tclxml and tdom (see the Wiki for both).
However, given the size, you will need to use a SAX approach - that is:
parse it as you go along, rather than keep everything in memory.

I do not know the specific capabilities of these packages wrt SAX.

Regards,

Arjen

Georgios Petasis

unread,

Jun 13, 2012, 3:42:02 AM6/13/12

to arjen.m...@gmail.com

I know that tdom cannot support this (it loads everything into memory).
Does TclXML supports SAX?

George

arjenmarkus

unread,

Jun 13, 2012, 4:37:07 AM6/13/12

to

On 2012-06-13 09:42, Georgios Petasis wrote:

>
> I know that tdom cannot support this (it loads everything into memory).
> Does TclXML supports SAX?
>
> George

From the documentation I get the impression that tclxml does use a
SAX approach, as you register callbacks for the various "events" that
occur during the parsing process.

Regards,

Arjen

quiet_lad

unread,

Jun 13, 2012, 4:39:24 AM6/13/12

to

xml is evil
ask ww.cat-v.org
http://harmful.cat-v.org/software/xml/

Matthias Kraft

unread,

Jun 13, 2012, 10:42:15 AM6/13/12

to

Georgios Petasis wrote:
> I know that tdom cannot support this (it loads everything into memory).
> Does TclXML supports SAX?

Haven't used it myself, but from what I read you can use tdom's expat
command directly to implement SAX parsing. See here:

http://tdom.github.com/expat.html

kind regards
--
Matthias Kraft
Software AG, Germany

Georgios Petasis

unread,

Jun 13, 2012, 10:47:14 AM6/13/12

to Matthias Kraft

Στις 13/6/2012 17:42, ο/η Matthias Kraft έγραψε:
> Georgios Petasis wrote:
>> I know that tdom cannot support this (it loads everything into memory).
>> Does TclXML supports SAX?
>
> Haven't used it myself, but from what I read you can use tdom's expat
> command directly to implement SAX parsing. See here:
>
> http://tdom.github.com/expat.html
>
> kind regards

Hm, I didn't know about this, seems interesting.

Regards,

George

Georgios Petasis

unread,

Jun 13, 2012, 10:47:25 AM6/13/12

to Matthias Kraft

Στις 13/6/2012 17:42, ο/η Matthias Kraft έγραψε:

> Georgios Petasis wrote:
>> I know that tdom cannot support this (it loads everything into memory).
>> Does TclXML supports SAX?
>
> Haven't used it myself, but from what I read you can use tdom's expat
> command directly to implement SAX parsing. See here:
>
> http://tdom.github.com/expat.html
>
> kind regards

rene

unread,

Jun 13, 2012, 2:41:31 PM6/13/12

to

No extension but I have used sp (an sgml parser) myself some years ago
to parse 2GB files (see at www.jclark.com/sp/). It is in C++.
If you will use it I can dig for some more information.

HTH
rene

Georgios Petasis

unread,

Jun 21, 2012, 4:08:58 AM6/21/12

to

Finally I used tdom's expat command. It was easy to use, and processing
is fast enough (a few seconds for an 700 MB XML file).

George

arjenmarkus

unread,

Jun 21, 2012, 5:37:38 AM6/21/12

to

Hi George,

that sounds quite useable. I keep wondering why these files are so large
and if XML is the right format, but you probably have no choice ;).

Regards,

Arjen

Donal K. Fellows

unread,

Jun 21, 2012, 6:57:20 AM6/21/12

to

On 21/06/2012 10:37, arjenmarkus wrote:
> that sounds quite useable. I keep wondering why these files are so large
> and if XML is the right format, but you probably have no choice ;).

Remember, the alternative is often just a large proprietary-format file
instead. (Or, worse, ASN.1, which introduces the "wonderful" OID…)

Donal.

DrS

unread,

Jun 21, 2012, 9:46:49 AM6/21/12

to

On 6/21/2012 4:08 AM, Georgios Petasis wrote:
>
> Finally I used tdom's expat command. It was easy to use, and processing
> is fast enough (a few seconds for an 700 MB XML file).
>
> George
>

I would be interested in seeing how you configured the parser. The expat
help page indicates that you need to provide many commands that do the
actual parsing. Are there any default spec's that would work for any
xml file or do you need to customize these commands for each schema?

DrS

Georgios Petasis

unread,

Jun 21, 2012, 10:04:43 AM6/21/12

to

Yes, they are wikipedia dumps, they come in XML and are quite large...

George

Georgios Petasis

unread,

Jun 21, 2012, 10:07:55 AM6/21/12

to

No, you can configure only the ones you need. You can skip the others.
See how simple my code was for parsing wikipedia data:

proc handle_start {name attributes args} {
set ::CurrentTag $name
};# handle_start

proc handle_end {name args} {
set ::CurrentTag {}
};# handle_end

proc handle_text {data args} {
switch -exact $::CurrentTag {
title {set ::CurrentTitle [string trim $data]}
id {set ::CurrentId $data}
text {# do something with the text }
}
};# handle_text

## Create a streaming xml parser...
package require tdom
expat xml -elementstartcommand handle_start \
-elementendcommand handle_end \
-characterdatacommand handle_text

## Open the input file (always in utf-8)
set fd [open $in]
fconfigure $fd -encoding utf-8

## Parse the xml data...
xml parsechannel $fd

## Done! Free parser and close file...
close $fd
xml free

George

Arjen Markus

unread,

Jun 22, 2012, 3:23:02 AM6/22/12

to

Op donderdag 21 juni 2012 12:57:20 UTC+2 schreef Donal K. Fellows het volgende:

I myself work with such formats as NetCDF and HDF - designed for large simply-
structured data such as matrices. The libraries to access and store data in
these formats are quite general and flexible. But they are not suited to store
the kind of information George is dealing with (dumps of Wikipedia pages).

Regards,

Arjen