Repub'd from clojure group because I thought folks might be interested in what we have done with cascading.
We have just released
flightcaster.com
which uses statistical inference and machine learning to predict flight
delays in advance of airlines (initial results appear to do so with 85
- 90 % accuracy.)
The
webserver and
webapp are all rails running on the
Heroku platform; which also serves our blackberry and
iphone apps.
The
research and data heavy lifting is all in
Clojure.
Distributed data mining is done via a
custom layer on top of cascading (which is a layer on top of
hadoop.)
All run on EC2 and S3 using the very nice
cloudera AMIs and deployment scripts.
In addition to the machine learning, the layer atop cascading
performs all the complex data data filtering and transformation
operations; including distributed joins from heterogeneous data sources
and transformations into a time series view that is fed to the machine
learning computations that are rolled into mappers and reducers.
Remember, this is data from airlines and the FAA, it is not pretty.
Web data is messy but we have lots of good frameworks, libs and
sanitizers for web data.
We wrapped cascading in a thin layer that we use to wrap
clojure functions in the cascading function objects and inject those into individual steps in the
workflows.
This gets us very close to normal function composition for the client
code. Ultimately, we want to be able to do normal function composition
to compose cascading
workflows
in the same way as we would would do vanilla function composition for
small test runs on our local machines. This is an execution agnostic
programming model; client code doesn't bear the signs of distributed
execution.
As a beneficial side effect, we found that this
model forces us to have more fine grained abstractions - because each
operation must be ultimately be
injectable into a map-reduce phase, otherwise your
paralleizm will be unnecessarily course grained. This steers us clear of monolithic
uber-expressions.
Another aspect of the design that allows us to do this is that the data transformations write out
clojure data structure literals, so we are entirely insulated from the normal
hadoop input/output formats...the wrapper layer just uses the normal
clojure reader to read in the strings from
hadoop and apply the vanilla
clojure functions to the data structures. But we are not limited to only
clojure data structure literals. We also inject other readers that can read other strings to
clojure data structures, for example. we use Dan
Larkin's wonderful
json lib for the initial reads of the raw
json data we store.
All the analytical code is custom, so we don't use many 3rd party libs outside of cascading,
hadoop, the invaluable jets3t for working with s3. Oh, and of course, - since we do so much with temporal analysis -
joda-time is the only way to work with dates in a sane way on the
jvm. :-)
If you travel a lot, check us out:
flightcaster.com ... we have
iphone
and blackberry apps. Unfortunately this is domestic US air travel only
at the moment due to the difficulty of of obtaining data for
international carriers and aviation agencies.