xml parsing in hadoop

geertvanlandeghem

unread,

Sep 13, 2011, 10:08:20 AM9/13/11

to bigdatabe

Hello all,

does anybody have some experience parsing xml files with hadoop?

We could find some information on the net about xadoop, mahout xml
parser, the pig xml loader... but before jumping into each I wanted to
ask you all if you have used one of the above to get results out of
big xml files (the file to be parsed is 19 Gb big - bzipped)

thanks in advance,

Geert Van Landeghem

Davy Suvee

unread,

Sep 15, 2011, 7:07:20 AM9/15/11

to bigdatabe

Hi Geert,

Haven't done this myself ... But, The “Hadoop: The Definitive Guide”-
book gives some pointers on how to achieve this ... In short:
- Your files are bzipped, hence splittable. This will allow you to use
several Hadoop nodes to process parts of the big file individually.
- Hadoop natively supports XML processing using the
StreamXMRecordReader. This class enables you to return XML-records,
although the content of an individual record may be situated in
different splits. You initialize the record reader with the specific
start and stop tag.

Hope this helps out ...

Davy

On 13 sep, 16:08, geertvanlandeghem <geertvanlandeghe...@gmail.com>
wrote:

Eric Charles

unread,

Sep 15, 2011, 11:53:00 AM9/15/11

to bigd...@googlegroups.com

Hi,
Never used it, but does not seem so obvious
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

Eric

On 15/09/11 04:07, Davy Suvee wrote:
> Hi Geert,
>

> Haven't done this myself ... But, The ï¿½Hadoop: The Definitive Guideï¿½-

Daan Gerits

unread,

Sep 16, 2011, 2:34:19 AM9/16/11

to bigd...@googlegroups.com

Geert, Davy, Eric,

There is something like a Pig XMLLoader to which you supply a tag and will split your XML file according to that tag. You still end up with XML fragments but since they are relatively small it shouldn't be too hard to write a Pig UDF to parse these fragments.

I think the main problem exists in defining the splits for the mapreduce processing. You will need to split the large XML file and the only way to do that is to devide by a given tag. XML isn't a great format for large files due to its structure.

Daan

On Thu, Sep 15, 2011 at 5:53 PM, Eric Charles <eric.umg...@gmail.com> wrote:

Hi,
Never used it, but does not seem so obvious
http://oobaloo.co.uk/articles/2010/1/20/processing-xml-in-hadoop.html

Eric

On 15/09/11 04:07, Davy Suvee wrote:

Hi Geert,

Haven't done this myself ... But, The “Hadoop: The Definitive Guide”-

Reply all

Reply to author

Forward