Scalding vs Cascading

106 views
Skip to first unread message

Daniel Yanos

unread,
Apr 15, 2015, 9:52:19 AM4/15/15
to cascadi...@googlegroups.com
I'm interested in understanding how Scalding compares to Cascading. Specifically does Scalding have a subset of the functionality that is available in Cascading? Or is it the case that Scalding and Cascading both have the exact same set of functionality? 

Thanks, 

- dan

Oscar Boykin

unread,
Apr 15, 2015, 3:45:09 PM4/15/15
to cascadi...@googlegroups.com
Scalding is arguably a super-set of cascading, because you can use cascading within scalding jobs without any real penalties.

What scalding adds mostly is an easy way to do the most common things, so that the amount of code is much smaller. Also, it leverages the idioms already present in scala (and functional programming in general) to make your code feel like you are just working with collections with most method names coming from collection methods in the standard API. So learning scalding means you can work with plain scala or spark, which all feel very similar.

In the Typed-API of scalding (see: TypedPipe[T]: http://twitter.github.io/scalding/ ), there is a lot of value add:

1) It is type-safe, so when jobs compile, they almost always run to completion (whether your logic is what you intend is up to you, but you won't get class cast exceptions, and the compiler will give you hints when you've made an error). This is the standard at Twitter now, and we are in the process of migrating our last Pig jobs to this API.

2) The typed-API can do a lot of "compositions" for you so you don't have to think hard about how to fit together cascading constructs in a way that makes sense and is fast. If it compiles: it is going to make sense and run. This allows us to do things like automatically composing reduce operations with joins, which reduces the number of map/reduce steps and is almost always a win (this is an example of where a good type system can help: because we encode the rules of how things fit together in the type system. Scala's type system is considerably more powerful than Java, so it helps here).

3) We have a type to model a flow, called Execution[T]. This means you run some flow (or flows), and get some result of type T. This type allows us to loop in a really clean way as you can compose Executions in interesting and powerful ways.  This helps implement machine learning algorithms which often need this kind of looping. See an example of k-means: https://twitter.com/argyris/status/525423799425335297

--
You received this message because you are subscribed to the Google Groups "cascading-user" group.
To unsubscribe from this group and stop receiving emails from it, send an email to cascading-use...@googlegroups.com.
To post to this group, send email to cascadi...@googlegroups.com.
Visit this group at http://groups.google.com/group/cascading-user.
To view this discussion on the web visit https://groups.google.com/d/msgid/cascading-user/adbac617-4184-4dfb-95e7-dfd272ae6713%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--
Oscar Boykin :: @posco :: http://twitter.com/posco
Reply all
Reply to author
Forward
0 new messages