Google Groups

Haskell for BigData


Andrei Varanovich Mar 16, 2012 1:01 AM
Posted in group: parallel-haskell
Hello all,

Haskell brings us a very rich toolkit for all kinds of concurrency/
parallelism,
including classic Thread-based concurrency; Data-parallel Haskell
(DPH) and, recently, Cloud Haskell[2]
However, if we look at this ecosystem through the BigData
perspective[1] (i.e. distributed parallel computing), the following
components are missing:

* Integration with a distributed file system, such as HDFS (Hadoop
distributed file system[5]); That would allow to perform distributed
computations on a distributed data.

* Data aggregation framework on top of it (I would not call it
MapReduce framework, just because in Haskell we'd definitely expect
richer set of primitives).

The most closest examples are Hadoop[3] and DryadLINQ[4].

I was thinking about writing a Google summer code project proposal; is
there anybody in this group potentially interested in mentoring?

Currently I think about the project scope as follows:
1. Build haskell APIs for HDFS. This project can be used for an
inspiration https://github.com/kim/hdfs-haskell Basically it's a
binding to a native libhdfs.
2. Use Cloud Haskell primitives to build an execution plan for
distributed data aggregation. This requires some research; for example
DPH can be used to parallelize local computations on a single node.
3. Build high-level API (such as map / reduce) to be automatically
load-balanced and distributed across the cluster.
4. Performance benchmarks. Comparison with Hadoop/Dryad.

Thanks,
Andrei

[1] http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
[2] https://github.com/jepst/CloudHaskell
[3] http://hadoop.apache.org/
[4] http://research.microsoft.com/en-us/projects/dryadlinq/
[5] http://hadoop.apache.org/hdfs/