Clojure interop with Spark

83 views
Skip to first unread message

Tim Clemons

unread,
Jul 9, 2020, 5:36:41 PM7/9/20
to Clojure
I'm putting together a big data system centered around using Spark Streaming for data ingest and Spark SQL for querying the stored data.  I've been investigating what options there are for implementing Spark applications using Clojure.  It's been close to a decade since sparkling or flambo have received any updates and it doesn't look like either will accommodate recent distributions of Spark.  I've found powderkeg an interesting option, and I like how it supports remote REPLs and the use of tranducers rather than wrapped Scala fns.  However, it looks like it's also seen a few years without commits and I've heard loose talk that the developers have moved on to other pursuits.

Part of the problem seems to be Spark.  The project seem unapologetic about breaking interfaces and seems willing to sacrifice third-party code that tries to track Spark's development.

So my options seem to be the following:

1. Deploy an older version of Spark that's compatible with one of the above mentioned libraries.  While we don't need to be bleeding edge, deploying a three year old version just to accommodate my preferred language is hard to justify.

2. Create a merge to update one of those libraries to more recent versions of Spark and be prepared to maintain it internally for the lifespan of this project.  This may be vastly overestimating my personal heroics.

3. Code my own solution from scratch using Java/Scala interop, sketching out just enough of a Clojure wrapper to suit my ends.

4. Learn Scala.

I realize that Spark isn't the only game in town (Onyx, for example).  However, I'm working with a team of developers who are not familiar with Clojure (though I'm working to be an advocate). I choose Spark as an established solution that supports multiple languages and handles both streaming and batch processing.

Any insights?  Any solutions I'm overlooking?



Jeff Stokes

unread,
Jul 9, 2020, 5:52:48 PM7/9/20
to Clojure
Hey Tim,

We at Amperity have used Sparkling for our Clojure Spark interop in the past. After a few years of fighting, we eventually ended up with sparkplug (https://github.com/amperity/sparkplug), which we now use to run all of our production Spark jobs. There is built in support for proper function serialization including wrappers around the Java RDD APIs. We also have some basic support for REPL interaction, but this is fairly limited. We also run on a newer versions of Spark (2.4.4), and haven't had issues with the library when upgrading or changing Spark versions.

Let me know if I can help if you're interested!

-Jeff

Alex Ott

unread,
Jul 10, 2020, 2:22:30 AM7/10/20
to clo...@googlegroups.com
From Spark perspective, I would really advise to use Dataframe API as much as possible, including the Spark Structured Streaming instead of Spark Streaming - the main reason is more optimized execution of the code because of all optimizations that Catalyst is able to make. But I really don't see libraries that wrap dataframe API

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to the Google Groups "Clojure" group.
To unsubscribe from this group and stop receiving emails from it, send an email to clojure+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/clojure/259f5ff6-dd66-4688-aa80-439fed88ab39o%40googlegroups.com.


--
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Dominic Parry

unread,
Jul 10, 2020, 2:28:34 AM7/10/20
to clo...@googlegroups.com
Another option is Apache Beam. We use it quite extensively. There are a few options for Clojure wrappers (we use datasplash), and beam has libraries for a number of popular languages.

 
Kind Regards,
Dom Parry
Reply all
Reply to author
Forward
0 new messages