Clojure for big data

Ray Miller

unread,

Oct 18, 2017, 7:03:11 AM10/18/17

to clo...@googlegroups.com

Hi,

Here at Metail we have been using Clojure for medium-sized data processing on AWS EMR. We started out with Cascalog about 5 years ago, switched to Parkour 2 years ago, and are now considering a move to Spark.

My question is: is Clojure still a good choice for medium/large data processing on EMR?

This is partly prompted by the lack of activity on the Github repos. Are the Parkour, Flambo and Sparkling libraries rock solid, or simply not getting enough use to trigger bugs and feature requests?

The #bigdata channel over on Clojurians slack is also suspiciously quiet, as are many of the Google groups.

Ray.

Christopher Small

unread,

Oct 19, 2017, 1:53:44 AM10/19/17

to Clojure

Hi Ray

> This is partly prompted by the lack of activity on the Github repos.

Maybe you have higher standards here than I do... last commits on Flambo and Sparkling were 3 and 2 months ago, respectively. That doesn't raise any alarm bells for me personally. Moreover, looking at the contributor graphs, I don't particularly get the impression that the projects have ground to a halt:

https://github.com/gorillalabs/sparkling/graphs/contributors

https://github.com/yieldbot/flambo/graphs/contributors

I haven't used either, but I've heard good things from folks who've used Flambo. Last I heard, Flambo was a pretty key component of Yieldbot's infrastructure, and they seem to be doing well, so I wouldn't expect the project to go away any time soon. I don't know as much about Sparkling, but it seems to have actually started as a fork of Flambo, so I'd imagine the APIs are at least somewhat similar, and if one went defunct, you'd probably have a migration path towards the other.

You may also want to take a look at Onyx: https://github.com/onyx-platform/onyx. It's written from the ground up in Clojure, and is really wonderfully designed, with a very data-centric (Clojuric) API. They have a very active Gitter chat room (https://gitter.im/onyx-platform/onyx), and the developers are very friendly and helpful folks. You should know ahead of time that in contrast with Spark and MR, which are "batch centric" technologies, Onyx is foundationally a built on a streaming model, with support for typical batch processes built on top of this streaming base. IIRC, this is modeled after some of the Dataflow work Google has been doing, and due to the shifting economics around the cost of data transmission, this approach ends up being pretty competitive for batch workflows, while also offering a path towards more seamless streaming workflows should such a setup benefit you.

I haven't spent a ton of time on the Clojurians slack channel, or any big data Google groups, but there is a Clojure Datascience site/chat room that I host which has at least some activity. Most of the chatter there has been more on the side of statistics, machine learning, data viz and such, and less specifically on big data per se, but we'd welcome you to join and broaden the discussion: https://gitter.im/metasoarous/clojure-datascience. There's actually been an uptick in activity there since the Conj, and I'd love to see that momentum continue.

Good luck

Chris

William Parker

unread,

Oct 19, 2017, 10:20:03 AM10/19/17

to Clojure

Perhaps it is an obvious point, but I'll mention that like other Java libraries, it is possible to use libraries from the Java Big Data ecosystem e.g. Spark directly from Clojure using interop, or to consume Clojure code as part of processing infrastructure written in other JVM languages. We've had considerable success using both approaches. The experience with interop has been pretty smooth; the only significant hassles I recall having related to having to write serialization/deserialization logic that we would have gotten out of the box with Java classes and with AOT-compiled code interacting in strange ways with Hadoop's classloading (we never satisfactorily diagnosed this and eventually just stopped using AOT).

Erik Assum

unread,

Oct 19, 2017, 10:27:00 AM10/19/17

to clo...@googlegroups.com

FWIW, I filed an issue and submitted a PR against flambo this summer.

It was merged on the same day and released two weeks later, so flambo seems active when needed.

Erik.

Christopher Penrose

unread,

Oct 19, 2017, 4:12:24 PM10/19/17

to Clojure

The #bigdata channel over on Clojurians slack is also suspiciously quiet, as are many of the Google groups.

Ray.

I worked with Sparkling and Flambo about a year ago, while Mr. Macbeth is a fellow Portlander and has a solid API, I found Sparkling to be somewhat more direct and compact. For ETL via Hadoop I wouldn't hesitate to try either of these libraries. I found them to be stable and preferable to using Spark in Scala. However, I used Powderkeg (https://github.com/HCADatalab/powderkeg) a bit and found it the most intriguing. Christophe Grand last updated PowderKeg three hours ago (from time of my posting obviously). Powderkeg relies heavily on Clojure transducers and is the only Clojure Spark library I am aware of which doesn't require AOT compilation -- you can use a Clojure repl to directly spawn jobs on a Spark cluster. If you are interested in Clojure interoperability with Spark, I would look at Powderkeg first.

If you require Spark Streaming, you might be better off writing Scala, or considering another streaming solution such as Storm. The closest I have come to getting Spark Streaming to work in Clojure was with Powderkeg. It might be worth seeing if Powderkeg has made progress in this area.

Christopher Small

unread,

Oct 19, 2017, 5:01:52 PM10/19/17

to clo...@googlegroups.com

Thanks for the helpful information Christopher. I'll have to look at Powderkeg.

The AOT issue is a big one. Being able to launch things from the REPL is huge. That's actually one of the many advantages of Onyx over Storm (if you're looking at the streaming side of things). Towards the end of my using Storm I became increasingly frustrated with the project. At the time, it was an Apache Incubator project, and development had slowed to a grind. The Clojure API became woefully incompatible with more recent Clojures, preventing us from upgrading for some time. They also began shifting focus away from the Clojure API, and in turn the documentation became woefully out of date. I've heard that some of these issues got a bit better as the project came out of Incubator status, but others remain. In contrast, Onyx has been very well maintained, has excellent documentation, and doesn't suffer any of the AOT issues.

Chris

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+unsubscribe@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/ESkUu0Tmqmg/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alex Miller

unread,

Oct 19, 2017, 5:56:13 PM10/19/17

to Clojure

Check out https://www.youtube.com/watch?v=OxUHgP4Ox5Q for his talk about it.

Christopher Penrose

unread,

Oct 19, 2017, 7:56:55 PM10/19/17

to Clojure

And I will have to look at Onyx much more closely :)

Reply all

Reply to author

Forward