From: Julian Reyes
Sent: November 12, 2015 11:33:35am PST
To: cascading-user
Subject: ETL Telco XML data using Hadoop/Cascading
--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/05aa56e3-3fe6-4acb-8dab-235d72ddacb7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
From: Julian Reyes
Sent: November 13, 2015 2:03:33am PST
To: cascading-user
Subject: Re: ETL Telco XML data using Hadoop/Cascading
Hi Ken,
Thanks for your reply.
Currently we are connecting to external sources in order to get those xxx.tar.gz, its basically a shell script that connects to the external source, tar czf the XML files (objects.xml and 100.xml .. ) together, then SCP the tar.gz back to our system, we extract the information in a folder by using tar xzf, I guess then we can copy that folder into HDFS ? and start a job as soon as we get the first chunk of XML files ?
Sometimes there could be connectivity issues when retrieving the tar.gz from the sources, we have sort of a function that retries up to 3 times, if it fails it aborts and try later the process as its crucial to have all files from sources.
XML files are not well structured and quite big, so we need to kinda parse the files and get the information we need to create different outputs, using cascading is it possible to parse the XML files? I have seen quite a lot of Java examples but not Groovy, I do not know if it is supported?
From: Julian Reyes
Sent: November 14, 2015 10:25:10am PST