Haskell for BigData

2,124 views
Skip to first unread message

Andrei Varanovich

unread,
Mar 16, 2012, 4:01:48 AM3/16/12
to parallel-haskell
Hello all,

Haskell brings us a very rich toolkit for all kinds of concurrency/
parallelism,
including classic Thread-based concurrency; Data-parallel Haskell
(DPH) and, recently, Cloud Haskell[2]
However, if we look at this ecosystem through the BigData
perspective[1] (i.e. distributed parallel computing), the following
components are missing:

* Integration with a distributed file system, such as HDFS (Hadoop
distributed file system[5]); That would allow to perform distributed
computations on a distributed data.

* Data aggregation framework on top of it (I would not call it
MapReduce framework, just because in Haskell we'd definitely expect
richer set of primitives).

The most closest examples are Hadoop[3] and DryadLINQ[4].

I was thinking about writing a Google summer code project proposal; is
there anybody in this group potentially interested in mentoring?

Currently I think about the project scope as follows:
1. Build haskell APIs for HDFS. This project can be used for an
inspiration https://github.com/kim/hdfs-haskell Basically it's a
binding to a native libhdfs.
2. Use Cloud Haskell primitives to build an execution plan for
distributed data aggregation. This requires some research; for example
DPH can be used to parallelize local computations on a single node.
3. Build high-level API (such as map / reduce) to be automatically
load-balanced and distributed across the cluster.
4. Performance benchmarks. Comparison with Hadoop/Dryad.

Thanks,
Andrei

[1] http://www.mckinsey.com/Insights/MGI/Research/Technology_and_Innovation/Big_data_The_next_frontier_for_innovation
[2] https://github.com/jepst/CloudHaskell
[3] http://hadoop.apache.org/
[4] http://research.microsoft.com/en-us/projects/dryadlinq/
[5] http://hadoop.apache.org/hdfs/

Jeff Epstein

unread,
Mar 16, 2012, 9:04:37 AM3/16/12
to dotn...@gmail.com, parallel-haskell
Hi Andrei,

I think that building a big data framework for Haskell is a great idea.

Since you've expressed an interest in building on the Cloud Haskell
API, I'd also like to suggest taking a look at CH's Remote.Task
module. Although it's implemented on top of CH's messaging primtives,
and therefore not good at dealing with big data, I think the Task API
could be productively ported to another backend, such as HDFS or
what-have-you. This should be easier since the new CH implementation
supports plugable backends.

Jeff

Andrei Varanovich

unread,
Mar 16, 2012, 11:39:34 AM3/16/12
to Jeff Epstein, parallel-haskell
Hi Jeff,

Yes, I am currently thinking of using Cloud Haskell API - great job has been done so far!

Today it is really difficult to compare CH even with Hadoop just because lack of distributed file-system support (I think you admit that in your paper as well).
Plus I strongly convinced, that Haskell should provide a generic distributed data aggregation framework, not just map/reduce. This is where Haskell would definitely score, because if you look closer at Hadoop MapReduce,  to build type-safe aggregation primitives  is a research topic by itself, because "In Hadoop, the connection between the two phases of a MapReduce computation is unsafe(we recently had a guest talk about this http://softlang.wikidot.com/talkdoerre0911 )

I will think about "pluggable backends", thanks for the pointer.

Thanks,
Andrei

Johan Tibell

unread,
Mar 16, 2012, 2:04:35 PM3/16/12
to dotn...@gmail.com, parallel-haskell
Hi Andrei!

I think BigData is an area where Haskell could really shine and I'd
definitely like to see more work in that area.

It's important to scope summer of code projects well. The projects
that tend to do well are those which are well-understood, incremental
improvements on top of existing libraries and/or infrastructure.
Designing new APIs from scratch is much more difficult, requiring good
knowledge both of the application area (e.g. Hadoop) and of Haskell
API design*. It's by no means impossible, but if you want to design
something from scratch I would try to tackle a small piece of the
whole ecosystem and do it well**.

Wrapping Hadoop streaming might just be feasible in one summer. Note
that it's by no means trivial though. Just coming up with an API (ala
Accelerate) for distributed computations is a difficult feat.

* Have a look at anything designed by Bryan O'Sullivan for inspiration.
** I just had a look at the HDFS API. It's quite large!
http://hadoop.apache.org/hdfs/docs/r0.22.0/api/index.html The S3 REST
API looks more reasonable to tackle.

Cheers,
Johan

Andrei Varanovich

unread,
Mar 19, 2012, 11:45:19 AM3/19/12
to Johan Tibell, parallel-haskell
Hi Johan,

Thanks for your feedback! Actually I was looking for a proper research challenge in this area as well (which should be useful for my PhD); but you're right, the project idea should be decomposed in a way, that there are very concrete deliverables can be expected from the "GSoC" part of it.

As to HDSF API, you're right - it's a huge task to cover it completely. But I think here only some part is needed: to have access to the files, and be able to read them in a streaming fashion. Like this: http://wiki.apache.org/hadoop/HadoopDfsReadWriteExample

There is a C wrapper (http://hadoop.apache.org/common/docs/current/libhdfs.html) for HDFS APIs (which covers an essential part, and does not aim to cover administering API, for instance).
Here is a *h file declares HDFS interfaces http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native/hdfs.h?view=markup exposed via C.
For example, this is how "open file" operation looks like

hdfsFile hdfsOpenFile(hdfsFS fs, const char* path, int flags, int bufferSize, short replication, tSize blockSize) {

/*
JAVA EQUIVALENT:
File f = new File(path);
FSData{Input|Output}Stream f{is|os} = fs.create(f);
return f{is|os};
*/
......

Using FFI this C library can we exposed to Haskell.

Providing Haskell wrappers for Hadoop streaming (http://wiki.apache.org/hadoop/HadoopStreaming) might be indeed a good starting point for the project in general. Hadoop streaming is essentially about creating and distributing jobs (mapper/reducer/composer) which can themselves be written in other languages. I already had a look into Remote.Task package from Cloud Haskell; it can be used as a Haskell-based backend to replace Hadoop jobs.

I agree that this might be non-typical GSoC project, just because there is a *very* big "design" part involved. For now I tend the think that it would be more reasonable to put this effort as a research project with a proper community involvement (as soon as certain community members already expressed an interest to join this effort)

Thanks,
Andrei

Johan Tibell

unread,
Mar 19, 2012, 1:23:30 PM3/19/12
to Andrei Varanovich, parallel-haskell
Hi Andrei,

On Mon, Mar 19, 2012 at 8:45 AM, Andrei Varanovich <dotn...@gmail.com> wrote:
> There is a C wrapper (http://hadoop.apache.org/common/docs/current/libhdfs.html) for HDFS APIs (which covers an essential part, and does not aim to cover administering API, for instance).
> Here is a *h file declares HDFS interfaces http://svn.apache.org/viewvc/hadoop/common/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/native/hdfs.h?view=markup exposed via C.

I took a look at how much work it would be to wrap this header file.
Should be feasible. I took a stab at the API design:

https://gist.github.com/2119955

I would support a GSoC project that implemented this, on the condition
that the student was willing to do the polish needed to make this
*the* hdfs API in Haskell. There are too many so-so quality packages
on Hackage. We don't need another.

Here's what makes a library *the* library in some category:

* Good API design.
* Awesome documentation.
* Tests!
* Benchmarks.
* Blog posts, mailing list posts, StackOverflow posts, etc. that
spreads awareness of the library.

-- Johan

Kim Altintop

unread,
Mar 19, 2012, 2:12:16 PM3/19/12
to johan....@gmail.com, Andrei Varanovich, parallel-haskell
On Mon, Mar 19, 2012 at 6:23 PM, Johan Tibell <johan....@gmail.com> wrote:
> I took a look at how much work it would be to wrap this header file.
> Should be feasible. I took a stab at the API design:

As my own humble attempt (which I eventually decided not to release on
hackage) has been mentioned on this thread, I'd like to chime in here:
are you guys aware that libhdfs is not really "native", but a JNI
wrapper which runs the actual networking code inside a JVM? I'm not
/so/ sure how useful a binding would be in practice, or what the
advantages would be over just an API wrapper around WebHDFS[0].

Also, there's the more fundamental problem that you don't get data
locality over either interface, which may or may not pose a problem
depending on the size of your dataset. Admittedly, that'd be a much
harder problem to solve, and could only be tackled from the Java side
of things. Sadly.

-- kim

[0] http://hadoop.apache.org/common/docs/r1.0.0/webhdfs.html

Andrei Varanovich

unread,
Mar 19, 2012, 3:28:40 PM3/19/12
to Kim Altintop, johan....@gmail.com, parallel-haskell
Hi Kim,

On Mar 19, 2012, at 7:12 PM, Kim Altintop <kim.al...@gmail.com> wrote:

On Mon, Mar 19, 2012 at 6:23 PM, Johan Tibell <johan....@gmail.com> wrote:
I took a look at how much work it would be to wrap this header file.
Should be feasible. I took a stab at the API design:

As my own humble attempt (which I eventually decided not to release on
hackage) has been mentioned on this thread, I'd like to chime in here:
are you guys aware that libhdfs is not really "native", but a JNI
wrapper which runs the actual networking code inside a JVM? I'm not
/so/ sure how useful a binding would be in practice, or what the
advantages would be over just an API wrapper around WebHDFS[0].

There is no big difference, except that I think using FFI would be just simpler. 
And perhaps a bit faster, as soon as REST works HTTP which is a text-based protocol.

Also, there's the more fundamental problem that you don't get data
locality over either interface, which may or may not pose a problem
depending on the size of your dataset. Admittedly, that'd be a much
harder problem to solve, and could only be tackled from the Java side
of things. Sadly.

I think this is mostly because a huge number of factors affect how the data is partitioned.
E.g. you can use the balancer (org.apache.hadoop.hdfs.server.balancer package of hadoop source code). [1]
There is also some work on data-redistribution based on the computing capacity of the nodes. [2]

What would be the use case, when we clearly benefit from the knowledge of the data block location? (that's what you mean by getting a data locality from the API, right?).

Kim Altintop

unread,
Mar 19, 2012, 5:09:29 PM3/19/12
to Andrei Varanovich, johan....@gmail.com, parallel-haskell
Andrei,

On Mon, Mar 19, 2012 at 8:28 PM, Andrei Varanovich <dotn...@gmail.com> wrote:
> There is no big difference, except that I think using FFI would be just
> simpler.

Simpler it what sense? As a library user there wouldn't be much of a
difference, but deployment would certainly be harder, since you need
to make sure your computing nodes have a JVM installed. In addition to
any Haskell packages your program might depend on. It might also be
harder to debug if parts of your program run on a different runtime
platform altogether.

> And perhaps a bit faster, as soon as REST works HTTP which is a text-based
> protocol.

That might or might not be true, I haven't seen any benchmarks yet.
Consider that we're talking about transferring relatively large chunks
of data here, so the parsing overhead is most likely negligible.


> What would be the use case, when we clearly benefit from the knowledge of
> the data block location? (that's what you mean by getting a data locality
> from the API, right?).

Well, the basic idea of MapReduce is to "run the code near the data".
Accordingly, Hadoop ships the mappers to the data nodes and feeds them
only the portions of the input local to their respective node (or at
least prefers to do so). The reduce phase is a little more
complicated, but this simplification already illustrates how data
processing jobs can be run over huge amounts of data without becoming
IO-bound by the need to transfer that data over the network.

That being said, and depending on what you're trying to do, you might
still get away with doing just that (transferring the whole input over
the network), or you could parallelize by having each computing node
pull a distinct byte-range of the input. But such approximation will
likely not scale to really large datasets (and clusters, where
saturating your network becomes something you'll really want to avoid
:)).

-- kim

Andrei Varanovich

unread,
Mar 21, 2012, 2:48:19 PM3/21/12
to Kim Altintop, johan....@gmail.com, parallel-haskell
Minor remarks below:

On Mar 19, 2012, at 10:09 PM, Kim Altintop <kim.al...@gmail.com> wrote:

> Andrei,
>
> On Mon, Mar 19, 2012 at 8:28 PM, Andrei Varanovich <dotn...@gmail.com> wrote:
>> There is no big difference, except that I think using FFI would be just
>> simpler.
>
> Simpler it what sense? As a library user there wouldn't be much of a
> difference, but deployment would certainly be harder, since you need
> to make sure your computing nodes have a JVM installed. In addition to
> any Haskell packages your program might depend on. It might also be
> harder to debug if parts of your program run on a different runtime
> platform altogether.

Do you normally separate computing and data nodes?
As soon as HDFS is up-and-running, that means JVM is installed as well.
Not sure I follow the reasons for potential issues with JVM.

>
>> What would be the use case, when we clearly benefit from the knowledge of
>> the data block location? (that's what you mean by getting a data locality
>> from the API, right?).
>
> Well, the basic idea of MapReduce is to "run the code near the data".
> Accordingly, Hadoop ships the mappers to the data nodes and feeds them
> only the portions of the input local to their respective node (or at
> least prefers to do so). The reduce phase is a little more
> complicated, but this simplification already illustrates how data
> processing jobs can be run over huge amounts of data without becoming
> IO-bound by the need to transfer that data over the network.
>
> That being said, and depending on what you're trying to do, you might
> still get away with doing just that (transferring the whole input over
> the network), or you could parallelize by having each computing node
> pull a distinct byte-range of the input. But such approximation will
> likely not scale to really large datasets (and clusters, where
> saturating your network becomes something you'll really want to avoid
> :)).
>

As we discuss in the private thread, hdfsGetHosts provides some info about the data block location; so this will be useful for the future implementation of the data aggregation framework in haskell. Such a framework (hopefully) will benefit from this information. Unfortunately, there is no way to attach a metadata to the files -- which seems more generic approach (say, to attach some computation units, which re-balance the data blocks based on this metadata)

> -- kim

Reply all
Reply to author
Forward
0 new messages