Cascading and Oozie

998 views
Skip to first unread message

Hadoop Inquirer

unread,
Apr 18, 2013, 4:57:59 PM4/18/13
to cascadi...@googlegroups.com
Hi,

We are trying to decide upon a workflow engine to use as a primary architecture for scheduling (with time and data dependencies) and gluing together a variety of Hive, Pig, custom Java M/R, Mahout, Sqoop etc. jobs that also allows us also to integrate ETL-type workflows in and out of HDFS/HBase/Hive from/into RDBMS, Netezza, Teradata, Solr, Tableau, etc.

Seems like Cascading / Scalding and Oozie are both alternatives, but we are doubtful of either's robust scheduling capabilities combined with the lack of a solid graphical user interface for us to be able to easily configure, manage, and monitor hundreds of jobs such jobs.

Also, it seems like Cascading's connectors to HBase and Solr have seem 0 commits in the last 2 years. Is there still active development and support for such sources/sinks?

If someone could share their experience in such a scenario, that would be great.

Thanks in advance,
H.I.


Oscar Boykin

unread,
Apr 18, 2013, 7:38:59 PM4/18/13
to cascadi...@googlegroups.com
We do use Hbase with scalding, but we use the tap in this repo:


Work fine for us.





--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 



--
Oscar Boykin :: @posco :: http://twitter.com/posco

Chris K Wensel

unread,
Apr 18, 2013, 11:45:49 PM4/18/13
to cascadi...@googlegroups.com
for the level of effort of gluing all that stuff together and making it something I would bet my business on, I'd just ignore all of those apps and just use the different parts of the Cascading tool chain. 

with Cascading, Cascalog, Scalding, and Lingual (ANSI-SQL + JDBC Driver), I can't imagine much more that you would need other than Talend (or cron) for time based submission (Nathan, the originator of Cascalog, wrote a sqoop replacement in hours with Cascading called db-migrate).

this way you are just debugging one thing (Cascading), not two dozen things. ok two things, Hadoop and Cascading.

besides, if you are gluing all that stuff together, is xml really the syntax (xml isn't a language) you would want to do that in? if so, i'd probably just use Ant, its been around longer.. grin

ckw

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user?hl=en.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Alex Dean

unread,
Apr 19, 2013, 5:07:46 AM4/19/13
to cascadi...@googlegroups.com
Agree with Chris - not sure what value something like Oozie adds. At Snowplow we script our Cascading (Scalding) jobs in Ruby using Elasticity, which I can't recommend enough:


If you want to script/schedule 100s of jobs using a graphical UI, then take a look at Chronos - a scheduling tool (non-MR-specific) Airbnb opensourced for this exact purpose:


A

Dean Wampler

unread,
Apr 19, 2013, 8:50:45 AM4/19/13
to cascadi...@googlegroups.com
For what it's worth (not a whole lot ;), if you go the bash route, you might look at my little weekend project earlier this year, Stampede, which provides some bash scripts to help implement workflows driven by make.


dean
Dean Wampler, Ph.D.
@deanwampler
http://polyglotprogramming.com

Ken Krugler

unread,
Apr 19, 2013, 12:22:41 PM4/19/13
to cascadi...@googlegroups.com
Hi there,

On Apr 18, 2013, at 1:57pm, Hadoop Inquirer wrote:

Hi,

We are trying to decide upon a workflow engine to use as a primary architecture for scheduling (with time and data dependencies) and gluing together a variety of Hive, Pig, custom Java M/R, Mahout, Sqoop etc. jobs that also allows us also to integrate ETL-type workflows in and out of HDFS/HBase/Hive from/into RDBMS, Netezza, Teradata, Solr, Tableau, etc.

Seems like Cascading / Scalding and Oozie are both alternatives, but we are doubtful of either's robust scheduling capabilities combined with the lack of a solid graphical user interface for us to be able to easily configure, manage, and monitor hundreds of jobs such jobs.

Also, it seems like Cascading's connectors to HBase and Solr have seem 0 commits in the last 2 years. Is there still active development and support for such sources/sinks?

Don't know about HBase, but about a month ago I updated cascading.solr for Solr 4.1 (see https://github.com/ScaleUnlimited/cascading.solr).

If someone could share their experience in such a scenario, that would be great.

We've created workflows that combine Sqoop and Cascading, using a driver app written in Java.

This gets run using our client's internal cron-like system, but that could essentially be anything that's capable of firing off commands and reporting results returned via stdout/stderr.

It processes upwards of 1T DB records, so it certainly scales :)

And they've been using it for the past two years.

We've also connected Cascading jobs with Mahout jobs, using the MapReduceFlow class.

HTH,

-- Ken

--------------------------
Ken Krugler
custom big data solutions & training
Hadoop, Cascading, Cassandra & Solr

Hadoop Inquirer

unread,
Apr 19, 2013, 4:49:46 PM4/19/13
to cascadi...@googlegroups.com
Hi guys,

Thanks a lot for your VERY helpful replies and perspectives. Really appreciate it.

We can be much more confident of our choices now!

Regards,
H.I.



--
You received this message because you are subscribed to a topic in the Google Groups "cascading-user" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/cascading-user/ZbknHKo8Z3I/unsubscribe?hl=en.
To unsubscribe from this group and all its topics, send an email to cascading-use...@googlegroups.com.

Danielle Felder

unread,
May 8, 2018, 4:53:29 AM5/8/18
to cascading-user
If you are still looking, you might find real user reviews for a variety of scheduling tools on IT Central Station to be helpful.

As an example, users interested in Cascading or Oozie also read reviews for Automic Workload Automation. This user writes, "There's a lot of flexibility because the product allows you to do many tasks, in multiple ways, so you can choose the way that works best for your environment." You can read the rest of his review here.

I hope this is helpful.
Reply all
Reply to author
Forward
0 new messages