Apache Pig + Elephant-Bird + SequenceFiles

60 views
Skip to first unread message

samee...@gmail.com

unread,
Oct 23, 2013, 2:42:24 PM10/23/13
to elephant...@googlegroups.com
Hi All,
I have a lot of small (~2 to 3 MB) XML files that I would like to process. I was thinking along the following lines, please let me know if you have any thoughts on this.

1. Create SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files.
Is it possible to use elephand-bird for this task?

2. Write Pig Script to that loads the sequence file, then iterates over individual XML files and analyzes them.
I was planning to use Elephant-Bird to read sequencefiles. Any thoughts on this would be highly appreciated.

samee...@gmail.com

unread,
Oct 23, 2013, 2:50:15 PM10/23/13
to elephant...@googlegroups.com
I also wanted to mention that Each XML file is for a given entity and the XML file is quite nested. Any thoughts on parsing this complex structure -- most likely I need to write a UDF will also be great.
Reply all
Reply to author
Forward
0 new messages