Splitting huge XML Files into fixsized wellformed parts

Malapha

unread,

Mar 17, 2008, 6:43:20 AM3/17/08

to

Hi,

I am kind of depressed :-) I want to split xml-files with sizes
greater than 2 gb into smaler chunks. As I dont want to end up with
billions of files, I want those splitted files to have configurable
sizes like 250 MB. Each file should be well formed having an exact
copy of the header (and footer as the closing of the header) from the
original file. Forthermore, a table should be generated were I can
see, that the File X is seperated into Part N with timestamp:

Table:

Orginalfilename|Name of PartN|Size of PartN|Timestamp

The Original XML-Files look like this:
<?xml ...>
<Headerelement with some infos to be copied 1to1>
<OfferInfo>
<OfferID></OfferID>
...
</OfferInfo>
<OfferInfo>
<OfferID></OfferID>
...
</OfferInfo>
<OfferInfo>
<OfferID></OfferID>
...
</OfferInfo>
</Headerelement>

All in all I ended up with reading the XML processing docus with gawk,
but as it seems I am lacking some deeper programming skills.. Could
someone please help?

Thx
Malapha

Janis Papanagnou

unread,

Mar 17, 2008, 8:37:27 AM3/17/08

to

Malapha wrote:
> Hi,
>
> I am kind of depressed :-) I want to split xml-files with sizes
> greater than 2 gb into smaler chunks. As I dont want to end up with
> billions of files, I want those splitted files to have configurable
> sizes like 250 MB. Each file should be well formed having an exact
> copy of the header (and footer as the closing of the header) from the
> original file. Forthermore, a table should be generated were I can
> see, that the File X is seperated into Part N with timestamp:

A nice and well described little homework with clear requirements.

I'd abstain from splitting the file according to file sizes in MB
but suggest to take a more simple measure for splitting, like number
of XML-blocks or number of lines.

>
> Table:
>
> Orginalfilename|Name of PartN|Size of PartN|Timestamp
>
>
>
> The Original XML-Files look like this:
> <?xml ...>
> <Headerelement with some infos to be copied 1to1>
> <OfferInfo>
> <OfferID></OfferID>
> ...
> </OfferInfo>
> <OfferInfo>
> <OfferID></OfferID>
> ...
> </OfferInfo>
> <OfferInfo>
> <OfferID></OfferID>
> ...
> </OfferInfo>
> </Headerelement>
>
>
>
> All in all I ended up with reading the XML processing docus with gawk,
> but as it seems I am lacking some deeper programming skills..

Given your data above you can solve that all with basic awk pattern
matching capabilities, no deeper skills required. What have you tried
so far?

> Could
> someone please help?

Since, apparently, you don't have a complex XML structure the use of
xgawk seems unnecessary. The quick way I'd go would be...

Save everything in a variable until you match the /Headerelement/.
Write that header to a file whose name contains a variable as number.
Write everything until the end of the block /<\/OfferInfo>/ to the
file whose name contains a variable as number, while counting lines.
If the number of lines exceeded some constant value write the constant
trailer, and close() the file, and increase the variable that counts
the files. To create a separate table just write out the information
you already have to a file with fixed name (use awk's date functions
or if unavailable an external date program and getline).

If you have concrete questions feel free to ask.
(Or did you mean to write that program for you?)

Janis

>
> Thx
> Malapha

Malapha

unread,

Mar 17, 2008, 9:35:37 AM3/17/08

to

On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com>
wrote:

> Malapha wrote:
> > Hi,
>
> > I am kind of depressed :-) I want to split xml-files with sizes
> > greater than 2 gb into smaler chunks. As I dont want to end up with
> > billions of files, I want those splitted files to have configurable
> > sizes like 250 MB. Each file should be well formed having an exact
> > copy of the header (and footer as the closing of the header) from the
> > original file. Forthermore, a table should be generated were I can
> > see, that the File X is seperated into Part N with timestamp:
>
> A nice and well described little homework with clear requirements.
>
> I'd abstain from splitting the file according to file sizes in MB
> but suggest to take a more simple measure for splitting, like number
> of XML-blocks or number of lines.
>

I totally agree with you. Using numbers of XML block as an
approximation for filesize is well enough.
The problem I see is, using linecounts works in cases where an EOL is
implemented in the xml document. In case the input data file has no
EOL I run into problems. So I came to the solution to use the xgawk
framework in order to make use of the "node hopping" technique. This
gives me the possibility to count the Offers without having to solve
the problems mentioned above.

>
> > All in all I ended up with reading the XML processing docus with gawk,
> > but as it seems I am lacking some deeper programming skills..
>
> Given your data above you can solve that all with basic awk pattern
> matching capabilities, no deeper skills required. What have you tried
> so far?

As I come from the VBA world - I tried to get familiar with awk. What
I do have is theoretical solution in form of a structured process
diagram :-)

Copy Header and Footer from Original to Var
Set Start_Offer = First Offer (from <Offer> to </Offer>)
Set End_Transaction = 0
Set Part = 0
Set FileSize = 0
Set MaxFileSize = 250
while not Start_Offer < EOF(OriginalXMLFile)
Part=part+1
Open NewFile OriginalXMLFileName + Part + ".xml"
Paste Header from Var to NewFile
While filesize(NewFile)<MaxFileSize do
Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
Start_Offer=Start_Offer + 1
wend
Paste Footer from Var to NewFile
wend

I am right now trying to translate this into awk.. Please dont ask me
how far i am, its frustrating :-)

> Save everything in a variable until you match the /Headerelement/.
> Write that header to a file whose name contains a variable as number.
> Write everything until the end of the block /<\/OfferInfo>/ to the
> file whose name contains a variable as number, while counting lines.
> If the number of lines exceeded some constant value write the constant
> trailer, and close() the file, and increase the variable that counts
> the files. To create a separate table just write out the information
> you already have to a file with fixed name (use awk's date functions
> or if unavailable an external date program and getline).

This looks very much like my approach - so I am quite happy that I am
not that wrong...

Hermann Peifer

unread,

Mar 17, 2008, 3:20:36 PM3/17/08

to

Malapha wrote:
> On 17 Mrz., 13:37, Janis Papanagnou <Janis_Papanag...@hotmail.com>
> wrote:
>> Malapha wrote:
>>> Hi,
>>> I am kind of depressed :-) I want to split xml-files with sizes
>>> greater than 2 gb into smaler chunks. As I dont want to end up with
>>> billions of files, I want those splitted files to have configurable
>>> sizes like 250 MB. Each file should be well formed having an exact
>>> copy of the header (and footer as the closing of the header) from the
>>> original file. Forthermore, a table should be generated were I can
>>> see, that the File X is seperated into Part N with timestamp:
>> A nice and well described little homework with clear requirements.
>>
>> I'd abstain from splitting the file according to file sizes in MB
>> but suggest to take a more simple measure for splitting, like number
>> of XML-blocks or number of lines.
>>
>
> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.
> The problem I see is, using linecounts works in cases where an EOL is
> implemented in the xml document. In case the input data file has no
> EOL I run into problems. So I came to the solution to use the xgawk
> framework in order to make use of the "node hopping" technique. This
> gives me the possibility to count the Offers without having to solve
> the problems mentioned above.
>

Missing line breaks could be added via a preprocessing step with
$ xmllint --format bigfile.xml > formatted_bigfile.xml

I don't know how xmllint performs with a 2G file. On my old laptop, I am
running out of memory when trying to re-format a 600M file. However, you
might have better hardware available.

There are also other XML command line tools around that have some
"pretty print" option. xmlstarlet is one of them.

>>> All in all I ended up with reading the XML processing docus with gawk,
>>> but as it seems I am lacking some deeper programming skills..
>> Given your data above you can solve that all with basic awk pattern
>> matching capabilities, no deeper skills required. What have you tried
>> so far?
>

Before going deeper into xgawk: try to reformat the file as suggested
above. Then, as suggested by Janis, you could make use regular awk for
the splitting task.

Hermann

Jürgen Kahrs

unread,

Mar 17, 2008, 4:33:46 PM3/17/08

to

Malapha schrieb:

> I totally agree with you. Using numbers of XML block as an
> approximation for filesize is well enough.

You may use the variable XMLLEN in xgawk.
Accumulate XMLLEN and you get a very precise
approximation for file size.

xgawk -lxml '{l+= XMLLEN};END{print l}' mssecure.xml
2309349

ll mssecure.xml
-rw-r--r-- 1 kahrs users 2309349 12. Jan 2005 mssecure.xml

Hermann Peifer

unread,

Mar 17, 2008, 7:01:43 PM3/17/08

to

Malapha wrote:
>
> As I come from the VBA world - I tried to get familiar with awk. What
> I do have is theoretical solution in form of a structured process
> diagram :-)
>
> Copy Header and Footer from Original to Var
> Set Start_Offer = First Offer (from <Offer> to </Offer>)
> Set End_Transaction = 0
> Set Part = 0
> Set FileSize = 0
> Set MaxFileSize = 250
> while not Start_Offer < EOF(OriginalXMLFile)
> Part=part+1
> Open NewFile OriginalXMLFileName + Part + ".xml"
> Paste Header from Var to NewFile
> While filesize(NewFile)<MaxFileSize do
> Copy Offer (Start_Offer) from OriginalXMLDatei to NewFile
> Start_Offer=Start_Offer + 1
> wend
> Paste Footer from Var to NewFile
> wend
>
> I am right now trying to translate this into awk.. Please dont ask me
> how far i am, its frustrating :-)
>
>

Below one solution for splitting in well-formed chunks, here: 100
OfferInfos each. There might be better solutions (I just don't know
them ;-) It only works if the XML data is in "pretty print format", as
the sample data you posted.

$ cat split_bigfile.awk

BEGIN { new_chunk = 1 ; size = 100 }

NR == 1 { header = $0 ; next }
NR == 2 { header = header ORS $0 ; footer = "</" substr($1,2) ">" ; next }

$0 !~ footer {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
print header > outfile
new_chunk = 0
}
print > outfile
}

/<\/OfferInfo>/ {
num = int(count++/size)
if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}

END { if (!new_chunk) print footer > outfile }

Malapha

unread,

Mar 18, 2008, 11:42:50 AM3/18/08

to

On 17 Mrz., 21:33, Jürgen Kahrs <Juergen.KahrsDELETET...@vr-web.de>
wrote:

wow. this looks promising!!

unfortunatelly I caught a cold so I am not able to test it in the
office. But thanks alot. Ill try to combine it with the suggestions
Hermann made.

Thanks
Malapha

Malapha

unread,

Mar 18, 2008, 11:43:54 AM3/18/08

to

Herman you are great. As I have written in to Jürgen, I am unable to
check it. But as soon as possible I ll give it a try!!

Thanks again
Malapha

Hermann Peifer

unread,

Mar 18, 2008, 3:49:03 PM3/18/08

to

Here the xgawk version of the same script. It works fine for me with
your testdata. No pre-formatting of bigfile.xml is needed. However, for
this solution you need to have xgawk and the library xmlcopy.awk
available. In xmlcopy.awk, I made a minor change at the very end:

# printf( "%s", token )
return token

Usage of the script: xgawk -f split_big_xmlfile.awk bigfile.xml

$ cat split_big_xmlfile.awk

# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy

BEGIN { new_chunk = 1 ; size = 100 }

# Remember XML declaration of bigfile.xml
XMLDECLARATION { header = XmlCopy() }

# Remember root element, define the footer
XMLSTARTELEM && XMLDEPTH == 1 {
header = header XmlCopy()
footer = "</" XMLSTARTELEM ">"
}

# Only care about OfferInfos and their children
XMLPATH ~ /OfferInfo/ {

if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"

printf "%s", header > outfile
new_chunk = 0
}
printf "%s", XmlCopy() > outfile
}

# Decide if it's time to add a footer and start a new chunk
XMLENDELEM == "OfferInfo" {
num = int(++count/size)

if (num > prev_num) {
print footer > outfile
new_chunk = 1
}
prev_num = num
}

# Avoid double footers, if at the end: count%size = 0

Hermann Peifer

unread,

Mar 19, 2008, 9:05:17 AM3/19/08

to

Just in case someone would be interested, here yet another version of
the same script, where chunk size is defined in bytes (and checked via
XMLLEN, as suggested by Juergen).

Hermann

$ cat split_big_xmlfile.awk

# Include the xmlcopy.awk library
# Make sure that xgawk finds it
@include xmlcopy

# new_chunk can be anything here, but not 0 or ""
# size value defines approx. chunk size in bytes
# you might have to worry about XMLCHARSET (or not)
BEGIN {
new_chunk = size = 250000000
# XMLCHARSET = "ISO-8859-1"
}

# Remember original XML declaration
XMLDECLARATION { header = XmlCopy() }

# Remember original root element, define the footer

XMLSTARTELEM && XMLDEPTH == 1 {

header = header ORS XmlCopy() ORS
footer = ORS "</" XMLSTARTELEM ">"
}

# Only care about these elements and their children

XMLPATH ~ /OfferInfo/ {
if (new_chunk) {
outfile = "chunk" sprintf("%07d", num) ".xml"
printf "%s", header > outfile
new_chunk = ""
}

printf "%s", XmlCopy() > outfile

chunk_size += XMLLEN
}

# Decide if it's time to add a footer and start with a new chunk
XMLENDELEM == "OfferInfo" && chunk_size > size {
printf "%s", footer > outfile
num++
new_chunk = "it's time now"
chunk_size = 0
}

END {
# Footer for the last chunk, but avoid double footers
if (!new_chunk) printf "%s", footer > outfile

# Print XMLERRORs, if any. Xgawk is somewhat lazy in
# this respect and might silently die, if you don't have:
if (XMLERROR)
printf("XMLERROR '%s' at row %d col %d len %d\n",
XMLERROR, XMLROW, XMLCOL, XMLLEN)
}

Malapha

unread,

Mar 19, 2008, 6:11:08 PM3/19/08

to

I am missing the words!.. Thanks alot. BTW I already searched for the
XMLCOPY.AWK Skript but without luck. XGAWK and the utils are installed
but not XMLCOPY. Do you have some url?

Malapha

Hermann Peifer

unread,

Mar 20, 2008, 4:52:22 AM3/20/08

to

On my Linux laptop, it is here: /usr/local/share/xgawk/xmlcopy.awk

It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gz
https://sourceforge.net/project/showfiles.php?group_id=133165

A third place is the source code repository, see here:
http://xmlgawk.cvs.sourceforge.net/xmlgawk/xmlgawk/awklib/xml/

Hermann

Malapha

unread,

Mar 25, 2008, 6:39:31 AM3/25/08

to

> It is part of the latest xgawk release: xgawk-3.1.6-20080101.tar.gzhttps://sourceforge.net/project/showfiles.php?group_id=133165

>
> A third place is the source code repository, see here:http://xmlgawk.cvs.sourceforge.net/xmlgawk/xmlgawk/awklib/xml/
>
> Hermann

Thanks again. I got everything up and running - and it worked :-) I
also modified XMLCOPY as suggested.

Here are some benchmarks:
Type Minutes Size
BYTESHRED_XMLCOPY 7,966666667 322 MB
COUNTSHRED 0,583333333 322 MB
COUNTSHRED_XMLCOPY 7,55 322 MB

Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
textbased method (Hermans first) is by ways the fastest. Having the
disadvantage, that the xml-input file has to be well formed. I am
still struggling which methodology to use. As I have filesizes of up
to 3 GB "COUNTSHRED" seems to be the one.

One more question: In my XML Files there is another tag next to the
<OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
the code, so that it also gets processed?

Many thanks
Mala

Hermann Peifer

unread,

Mar 25, 2008, 9:32:31 AM3/25/08

to

On Mar 25, 11:39 am, Malapha <mala...@gmx.net> wrote:
>
> Thanks again. I got everything up and running - and it worked :-) I
> also modified XMLCOPY as suggested.
>
> Here are some benchmarks:
> Type Minutes Size
> BYTESHRED_XMLCOPY 7,966666667 322 MB
> COUNTSHRED 0,583333333 322 MB
> COUNTSHRED_XMLCOPY 7,55 322 MB
>
> Countshred_XMLCOPY uses the xmlcopy method. As you can see - the
> textbased method (Hermans first) is by ways the fastest. Having the
> disadvantage, that the xml-input file has to be well formed. I am
> still struggling which methodology to use. As I have filesizes of up
> to 3 GB "COUNTSHRED" seems to be the one.
>

If you already have "nicely formatted" XML files (or manage to get
there via xmllint --format), then I'd recommend to use the faster
solution with regular awk.

If not... then you have to use xgawk in combination with XmlCopy. Some
performance tuning might be possible. I guess that Jürgen might have
some good ideas.

> One more question: In my XML Files there is another tag next to the
> <OfferInfo>, named <CancelOfferInfo>. Where do I need to place this in
> the code, so that it also gets processed?
>

There is no single answer to this question as are 3 scripts now with
slightly different code. However, these rules will find both:
OfferInfo and CancelOfferInfo elements:

/<.*OfferInfo>/ {do something with regular awk}

XMLPATH ~ /OfferInfo/ {do something with xgawk}

Another xgawk option could be to define the condition via XMLDEPTH,
e.g.:

XMLDEPTH > 2 {do something}

Hermann

Malapha

unread,

Mar 26, 2008, 1:01:38 PM3/26/08

to

After having decided to use the fasted way, please let me come back to
the original problem: I also want to have some logging about the
shredding-process at runtime, after each chunk is finished, so the
filesystems filesize after the shredding corresponds with the values
of the tables attributes in the logfile. Here is my idea:

logfile = "shredder_log.txt"
cmd_original_length = "ls -l " FILENAME " | gawk '{print $5;}'"
cmd_original_length | getline original_size
cmd_part_length = "ls -l " outfile " | gawk '{print $5;}'"
cmd_part_length | getline part_size
print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
%d%Y%H%M%S", systime()) ";" original_size ";" part_size >> logfile

so far so fine - but I got problems with the placing of that piece of
code. I tried several places in the script, but either its to early
(file does not yet exist -> size 0), in between the process (wrong
filesize) or to late..

Many regards
Mala

Hermann Peifer

unread,

Mar 26, 2008, 2:57:10 PM3/26/08

to

Malapha wrote:

> After having decided to use the fasted way, please let me come back to
> the original problem: I also want to have some logging about the
> shredding-process at runtime, after each chunk is finished, so the
> filesystems filesize after the shredding corresponds with the values
> of the tables attributes in the logfile. Here is my idea:
>
> logfile = "shredder_log.txt"
> cmd_original_length = "ls -l " FILENAME " | gawk '{print $5;}'"
> cmd_original_length | getline original_size
> cmd_part_length = "ls -l " outfile " | gawk '{print $5;}'"
> cmd_part_length | getline part_size
> print outfile ";" sprintf("%03d", num+1) ";" FILENAME ";" strftime("%m
> %d%Y%H%M%S", systime()) ";" original_size ";" part_size >> logfile
>

Does this really have to be logged at runtime? I would do this after the
splitting is done. Furthermore: "Using getline is almost always the
wrong approach", to quote one of the regulars in this group. So far, I
followed this advice and managed to avoid getline constructions in my
scripts.

Hermann