> Repub'd from clojure group because I thought folks might be
> interested in what we have done with cascading.
> We have just released flightcaster.com which uses statistical
> inference and machine learning to predict flight delays in advance
> of airlines (initial results appear to do so with 85 - 90 % accuracy.)
> The webserver and webapp are all rails running on the Heroku
> platform; which also serves our blackberry and iphone apps.
> The research and data heavy lifting is all in Clojure.
> Distributed data mining is done via a custom layer on top of
> cascading (which is a layer on top of hadoop.) All run on EC2 and
> S3 using the very nice cloudera AMIs and deployment scripts.
> In addition to the machine learning, the layer atop cascading
> performs all the complex data data filtering and transformation
> operations; including distributed joins from heterogeneous data
> sources and transformations into a time series view that is fed to
> the machine learning computations that are rolled into mappers and
> reducers. Remember, this is data from airlines and the FAA, it is
> not pretty. Web data is messy but we have lots of good frameworks,
> libs and sanitizers for web data.
> We wrapped cascading in a thin layer that we use to wrap clojure
> functions in the cascading function objects and inject those into
> individual steps in the workflows. This gets us very close to
> normal function composition for the client code. Ultimately, we
> want to be able to do normal function composition to compose
> cascading workflows in the same way as we would would do vanilla
> function composition for small test runs on our local machines.
> This is an execution agnostic programming model; client code doesn't
> bear the signs of distributed execution.
> As a beneficial side effect, we found that this model forces us to
> have more fine grained abstractions - because each operation must be
> ultimately be injectable into a map-reduce phase, otherwise your
> paralleizm will be unnecessarily course grained. This steers us
> clear of monolithic uber-expressions.
> Another aspect of the design that allows us to do this is that the
> data transformations write out clojure data structure literals, so
> we are entirely insulated from the normal hadoop input/output
> formats...the wrapper layer just uses the normal clojure reader to
> read in the strings from hadoop and apply the vanilla clojure
> functions to the data structures. But we are not limited to only
> clojure data structure literals. We also inject other readers that
> can read other strings to clojure data structures, for example. we
> use Dan Larkin's wonderful json lib for the initial reads of the raw
> json data we store.
> All the analytical code is custom, so we don't use many 3rd party
> libs outside of cascading, hadoop, the invaluable jets3t for working
> with s3. Oh, and of course, - since we do so much with temporal
> analysis - joda-time is the only way to work with dates in a sane
> way on the jvm. :-)
> If you travel a lot, check us out: flightcaster.com ... we have
> iphone and blackberry apps. Unfortunately this is domestic US air
> travel only at the moment due to the difficulty of of obtaining data
> for international carriers and aviation agencies.