ETL Telco data using Hadoop

Julian Reyes

unread,

Nov 9, 2015, 12:21:56 PM11/9/15

to Big Data Europe

Hello,

I am newbie to Hadoop and I have a project I would like to improve by using Hadoop/Pig or Hive, I do not know which one is better and faster.

The idea would be to retrieve data from several sources that will give us a compressed file (tar.gz) with several xml files within it.

I would like to retrieve the data and unpack it in parallel, then by using Pig I guess I could get the information I am interested in from the xml files.

It will be needed to merge all outputs into one and load it on to a Oracle table, perhaps using sqoop.

Also each compressed file has several xml, but one xml file has information that currently is used to open others xml files, for example:

HHL0251.tar.gz :

- objects.xml -> has information and also has a field that can be used to open 100.xml like an Id

- 100.xml

..

- 149.xml

I wonder if it is possible to loop over objects.xml using pig and load each file 100.xml, etc.. I will need also to parse objects.xml and load it on to other table.

This data will be incremented in daily basis so I would need to get some scalability.

Currently this Extract, Transformation and Load is done by using scripts, looping over the sources we have to connect to, extract each file, parse it and load it into the database, where for each iteration we need to load into memory the previous loaded data to keep some consistency and append the next chunk of information from next sources..

I think that perhaps by using hadoop can speed up the process as there could be 4~5 sources, each one contains more than 30 xml files with more than 50MB each

Tóth Zoltán

unread,

Nov 11, 2015, 3:30:53 AM11/11/15

to Julian Reyes, Big Data Europe

For this usecase Pig seems to be a good fit if you implement your own UDFS
https://pig.apache.org/docs/r0.15.0/udf.html

--

Zoltan C. Toth
https://www.linkedin.com/in/zoltanctoth

--
You received this message because you are subscribed to the Google Groups "Big Data Europe" group.
To unsubscribe from this group and stop receiving emails from it, send an email to big-data-euro...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Julian Reyes

unread,

Nov 12, 2015, 7:03:37 AM11/12/15

to Big Data Europe, julianr...@gmail.com, t...@looper.hu

I will have a look at how implement my own UDFs, I see it is even possible to do it using groovy as currently our app its using groovy/grails.

Would you know to tell me what the average time is for loading the data into an oracle DB using this approach?, currently its taking around 10 hours, so it seems quite long..

Thanks.

Julian Reyes

unread,

Nov 14, 2015, 1:23:01 PM11/14/15

to Big Data Europe, julianr...@gmail.com, t...@looper.hu

Hi,

I just was trying to get started using Pig and get familiar with it but I am getting problems while reading the XML.

My XML looks like the following (of course, its much bigger, I just added first entries):

<cn:bulkCmConfigDataFile xmlns:cn="details-CONFIG" xmlns:xt="nrmBase" xmlns:en="CLL-NB">

<cn:fileHeader fileFormatVersion="2.0.0" senderName="senderName" vendorName="vendorName"/>

<cn:configData>

<en:ManagementNode xmlns:en="CLL-NB">
<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>100</en:neId>
<en:neName>TK0005</en:neName>
<en:neIp>192.168.0.2</en:neIp>
</en:ManagementNode>
<en:ManagementNode xmlns:en="CLL-NB">

<en:neGroup>Group_1</en:neGroup>
<en:neVersion>2.1.0</en:neVersion>
<en:neId>101</en:neId>
<en:neName>TK0002</en:neName>
<en:neIp>192.168.0.3</en:neIp>

</en:ManagementNode>

</cn:configData>

<cn:fileFooter dateTime="2013-12-20T03:40:15+00:00"/>

</cn:bulkCmConfigDataFile>

And the Pig script I am trying to use is the following:

set pig.splitCombination false;

set tez.grouping.min-size 5242880;

set tez.grouping.max-size 5242880;

register '/usr/lib/tez/tez-0.7.0/tez-tfile-parser-0.7.0.jar';

DEFINE getDetails(raw) RETURNS void {

details = FOREACH raw GENERATE configData;

distinctDetails = DISTINCT details;

STORE distinctDetails INTO '$DETAILS' USING PigStorage(',');;

}

rmf $NODE_DETAILS

rawLogs = load '/user/hduser/test/test01/ManagementNode.xml' using org.apache.tez.tools.TFileLoader() as (configData:chararray, key:chararray, line:chararray);

raw = FOREACH rawLogs GENERATE ManagementNode,key,line;

getDetails(raw);

exec;

However, I am getting the following error:

ERROR 2998: Unhandled internal error. null

java.lang.StackOverflowError

at org.apache.tez.tools.TFileLoader.hashCode(TFileLoader.java:148)

at java.util.Arrays.hashCode(Arrays.java:3140)

...

Could it be because of the XML file?

Thanks.

Reply all

Reply to author

Forward