Hi Ivan,
With Pangool usage, at this moment we are moving out of Proof Of Concept stage and into production, for a couple of small modules. And it was quite successful.
My task was to streamline and optimise ingestion of raw logs into our Hive data warehouse. The original process had a lot of steps;
1. raw log files are in UNIX file system
2. Flume - converts raw data into structured data (csv) and performs some decoration
3. Run Oozie workflow that would do extra data enrichment by joining with other data sources (unfortunately requiring Reduce Side Join), deduplication, group by, rollups, etc. Some oozie actions were PIG, some standard Java Map Reduce jobs. Oozie was providing logic branching.
The problems with that process were:
1. Too many different tools and components, nearly impossible to automate the "end to end" testing. Also business logic is spread all over the place and finding issues could be a pain as over the years different developers did different undocumented modifications to configs. No birds eye view of the process;
2. For our purposes Flume was an overhead, slow, with some unexplained issues which required to many workarounds, so everybody wanted to exclude it from the deployment. There were unit tests for separate decorators, but not for the whole data flow;
3. oozie workflow.xmls were becoming to large and with to many branches. They were not unit tested.
4. No single IDE which could be used to jump smoothly from one peace of code to another.
There was a temptation to put everything into Apache PIG, write the Load Functions, UDFs etc, but implementing branching logic with some nested if ... else ... structures could be a pain and would require usage of oozie for that and once again would spread out the business logic which we wanted to avoid.
Doing it all in pure Java Map Reduce would solve lots of issues like:
1. all logic in one place and in one language
2. no restrictions on branching logic within the mappers and reducers
3. Main class can do workflow branching logic
4. easy to unit test components and automate the end to end testing of the workflow
The negative of pure Java approach is that it could be quite painful to implement inner joins, multiple inputs, multiple outputs, passing parameters to the mappers from the Main method, data schemas.
Pangool takes away that pain, in a way it introduces some PIG-ynnes into the Java:
1. All data schemas are in Java classes
2. multiple inputs is a breeze
3. multiple outputs are built in
4. joins are easy
5. you are not restricted to a paradigm like with PIG or HIVE, it is always possible to access Hadoop API
6. It is easy to use OOP patterns with Pangool and implement data flows
7. Custom serialisation gives you the tools to work with hierarchical data structures which is a pain in PIG and HIVE.
By implementing some classes to manage the temporary directories and determining the success or failure of a map reduce job it was also possible to reduce the amount of logic put into oozie and made it unit testable.
Also had to do some custom input/output formats for unit testing. It would have been easier to do if some of the properties in Pangool classes were not private or had some protected properties.
Now when we run "mvn package" Pangool based workflows are unit tested and it can be done on any machine without requiring deployment of Hadoop stack.
Detecting issues with data became quite simple, just put the input data into the unit test of the workflow and debug on your dev machine :D, if the input is too large, than having error codes which can be traced directly into Java code and the tuple/string which caused it is a life saver.
May be later I will share some patterns I came up, which had simplified the implementation of new data feeds and the workflow management in Java.
Thanks,
Alexei