Hello,
I am newbie to Hadoop and I have a project I would like to improve by using Hadoop/Pig or Hive, I do not know which one is better and faster.
The idea would be to retrieve data from several sources that will give us a compressed file (tar.gz) with several xml files within it.
I would like to retrieve the data and unpack it in parallel, then by using Pig I guess I could get the information I am interested in from the xml files.
It will be needed to merge all outputs into one and load it on to a Oracle table, perhaps using sqoop.
Also each compressed file has several xml, but one xml file has information that currently is used to open others xml files, for example:
HHL0251.tar.gz :
- objects.xml -> has information and also has a field that can be used to open 100.xml like an Id
- 100.xml
..
- 149.xml
I wonder if it is possible to loop over objects.xml using pig and load each file 100.xml, etc.. I will need also to parse objects.xml and load it on to other table.
This data will be incremented in daily basis so I would need to get some scalability.
Currently this Extract, Transformation and Load is done by using scripts, looping over the sources we have to connect to, extract each file, parse it and load it into the database, where for each iteration we need to load into memory the previous loaded data to keep some consistency and append the next chunk of information from next sources..
I think that perhaps by using hadoop can speed up the process as there could be 4~5 sources, each one contains more than 30 xml files with more than 50MB each