Hi all,
LiveRamp is a very heavy user of Cascading (over 100k flows/day), and we've built a lot of tooling for our data processing applications. One of the things we've built and recently open-sourced (it's been used internally for years) is Workflow2, a data pipeline orchestrator:
https://github.com/liveramp/workflow2
One of the most useful features of this framework for us is the really tight Cascading integration (
https://github.com/liveramp/workflow2#hadoop-integration). We wanted to make it easy for devs to write Cascading workflows, and we wanted to use metrics from those jobs (primarily MapReduce counters) to give suggestions about how to tune applications to be more performant.
We open-sourced this project hoping other users in the Cascading space would find this useful. Some cautions, since this is a newly public project:
- This is built around the Cascading 2.x API; we have not tried to upgrade to 3 yet.
- The Cascading integration is built only for the Hadoop Cascading runner. We haven't tried to use any other runner.
- We've done our best to strip out assumptions about our environment, but I'm sure we missed a couple things.
I thought some people here might be interested -- hopefully this is interesting to some of you even with those caveats.
Thanks,
Ben