Apache Pig + Elephant-Bird + SequenceFiles

60 προβολές
Παράβλεψη και μετάβαση στο πρώτο μη αναγνωσμένο μήνυμα

samee...@gmail.com

μη αναγνωσμένη,
23 Οκτ 2013, 2:42:24 μ.μ.23/10/13
ως elephant...@googlegroups.com
Hi All,
I have a lot of small (~2 to 3 MB) XML files that I would like to process. I was thinking along the following lines, please let me know if you have any thoughts on this.

1. Create SeqeunceFiles such that each sequence file between 60 to 64 MB and no XML file is split onto 2 Sequence Files.
Is it possible to use elephand-bird for this task?

2. Write Pig Script to that loads the sequence file, then iterates over individual XML files and analyzes them.
I was planning to use Elephant-Bird to read sequencefiles. Any thoughts on this would be highly appreciated.

samee...@gmail.com

μη αναγνωσμένη,
23 Οκτ 2013, 2:50:15 μ.μ.23/10/13
ως elephant...@googlegroups.com
I also wanted to mention that Each XML file is for a given entity and the XML file is quite nested. Any thoughts on parsing this complex structure -- most likely I need to write a UDF will also be great.
Απάντηση σε όλους
Απάντηση στον συντάκτη
Προώθηση
0 νέα μηνύματα