cascading success story: hopefully :-)

41 views
Skip to first unread message

bradford cross

unread,
Aug 14, 2009, 3:26:04 PM8/14/09
to cascadi...@googlegroups.com
Repub'd from clojure group because I thought folks might be interested in what we have done with cascading.

We have just released flightcaster.com which uses statistical inference and machine learning to predict flight delays in advance of airlines (initial results appear to do so with 85 - 90 % accuracy.)

The webserver and webapp are all rails running on the Heroku platform; which also serves our blackberry and iphone apps.

The research and data heavy lifting is all in Clojure

Distributed data mining is done via a custom layer on top of cascading (which is a layer on top of hadoop.)  All run on EC2 and S3 using the very nice cloudera AMIs and deployment scripts.

In addition to the machine learning, the layer atop cascading performs all the complex data data filtering and transformation operations; including distributed joins from heterogeneous data sources and transformations into a time series view that is fed to the machine learning computations that are rolled into mappers and reducers.  Remember, this is data from airlines and the FAA, it is not pretty.  Web data is messy but we have lots of good frameworks, libs and sanitizers for web data.

We wrapped cascading in a thin layer that we use to wrap clojure functions in the cascading function objects and inject those into individual steps in the workflows.  This gets us very close to normal function composition for the client code.  Ultimately, we want to be able to do normal function composition to compose cascading workflows in the same way as we would would do vanilla function composition for small test runs on our local machines.  This is an execution agnostic programming model; client code doesn't bear the signs of distributed execution. 

As a beneficial side effect, we found that this model forces us to have more fine grained abstractions - because each operation must be ultimately be injectable into a map-reduce phase, otherwise your paralleizm will be unnecessarily course grained.  This steers us clear of monolithic uber-expressions.

Another aspect of the design that allows us to do this is that the data transformations write out clojure data structure literals, so we are entirely insulated from the normal hadoop input/output formats...the wrapper layer just uses the normal clojure reader to read in the strings from hadoop and apply the vanilla clojure functions to the data structures.  But we are not limited to only clojure data structure literals.  We also inject other readers that can read other strings to clojure data structures, for example. we use Dan Larkin's wonderful json lib for the initial reads of the raw json data we store.

All the analytical code is custom, so we don't use many 3rd party libs outside of cascading, hadoop, the invaluable jets3t for working with s3.  Oh, and of course, - since we do so much with temporal analysis - joda-time is the only way to work with dates in a sane way on the jvm. :-)

If you travel a lot, check us out: flightcaster.com ... we have iphone and blackberry apps.  Unfortunately this is domestic US air travel only at the moment due to the difficulty of of obtaining data for international carriers and aviation agencies.

Chris K Wensel

unread,
Aug 14, 2009, 3:39:59 PM8/14/09
to cascadi...@googlegroups.com
Simply Brilliant.

ckw
Reply all
Reply to author
Forward
0 new messages