[ANN] Onyx: Distributed data processing in Clojure

930 views
Skip to first unread message

Michael Drogalis

unread,
Sep 19, 2014, 4:24:12 PM9/19/14
to clo...@googlegroups.com
I'm happy to open source Onyx, a new kind of distributed data processing framework for Clojure and the JVM.


GitHub: https://github.com/MichaelDrogalis/onyx

Thanks!
-- @MichaelDrogalis

Rangel Spasov

unread,
Sep 19, 2014, 11:26:53 PM9/19/14
to clo...@googlegroups.com
Looks interesting! Curious about differences, advantages/disadvantages of HornetQ vs ZeroMQ?

Mike Drogalis

unread,
Sep 20, 2014, 9:39:27 PM9/20/14
to clo...@googlegroups.com
Daniel: Haha, yes! Shame that I tried to be smooth in open sourcing it, and managed to botch it in the worst possible manner.

Rangel: I can't speak to ZeroMQ, but I chose HornetQ because of its performance, support for transactions, and support for clustering. That being said, everything that touches HornetQ does so through an interface, and I'd be interested in making that part pluggable, too.

--
You received this message because you are subscribed to the Google
Groups "Clojure" group.
To post to this group, send email to clo...@googlegroups.com
Note that posts from new members are moderated - please be patient with your first post.
To unsubscribe from this group, send email to
clojure+u...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
---
You received this message because you are subscribed to a topic in the Google Groups "Clojure" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/clojure/OmHzAEfYe9U/unsubscribe.
To unsubscribe from this group and all its topics, send an email to clojure+u...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Christopher Small

unread,
Sep 21, 2014, 2:21:45 AM9/21/14
to clo...@googlegroups.com
Hi Daniel

This looks like a great project!

Is it possible with onyx to define a workflow with transformers that take values from multiple sources (inputs or other transformers)?

Chris

Henrik Eneroth

unread,
Sep 21, 2014, 4:59:28 AM9/21/14
to clo...@googlegroups.com
Cool beans! 

Mike Drogalis

unread,
Sep 21, 2014, 12:36:32 PM9/21/14
to clo...@googlegroups.com
Thanks Christopher!

At the moment, elements of a workflow need to strictly be keywords. I'm planning to allow for sets of keywords in the roots of the tree to enable that expression. For now, you can do the following:

If you have inputs A, B, and C that need to pipe to transformation D, and output E, one could say...

{:A {:D :E}
 :B {:D :E}
 :C {:D :E}}

But also, since workflows are *just* data, you can write a "microcompiler":


I plan on using the same compilation approach in the future to support datalog a la Cascalog. Using data gives us really good reach.

--

Christopher Small

unread,
Sep 21, 2014, 1:33:14 PM9/21/14
to clo...@googlegroups.com
Beautiful :-) Thanks.

Chris

Michael Drogalis

unread,
Sep 22, 2014, 9:29:37 AM9/22/14
to clo...@googlegroups.com
And finally, the StrangeLoop talk: https://www.youtube.com/watch?v=vG47Gui3hYE&feature=youtu.be&a

If you have an example you'd like me to add, just make an issue on this repository:


Thanks everyone!


On Friday, September 19, 2014 1:24:12 PM UTC-7, Michael Drogalis wrote:

Huahai Yang

unread,
Sep 30, 2014, 12:55:01 AM9/30/14
to clo...@googlegroups.com
HI MIchael,

Great work. I enjoyed your strangeloop talk.

In the README.md, you stated that Onyx "competes against Storm, Cascading, Map/Reduce, Dryad, Apache Sqoop, Twitter Crane".  Could you please shed some light on a comparison with Spark?  Apparently Spark is on the road to become a favorite among some data scientists.

In a broader sense, I am wondering what would be the Clojure's answer for Spark, having seen a huge boost to Scala by the popularity of Spark. It would be great to hear some opinions here. Thanks.

-huahai


On Friday, September 19, 2014 1:24:12 PM UTC-7, Michael Drogalis wrote:

Christopher Small

unread,
Sep 30, 2014, 1:05:34 AM9/30/14
to clo...@googlegroups.com
Regarding the broader sense, I've heard good things about flambo (https://github.com/yieldbot/flambo), but haven't tried it. Of course, it's always nice to have something that's written in the language your working with; as nice as JVM interop is, it can have it's warts. So, the question stands.

Another thing to check out are the Parallel Universe offerings. Their Galaxy cluster work distributed data capabilies bare similarity to some (not all) of the offerings of Spark. And while that particular project doesn't seem to be Clojure ready (so far as I know), they have some other work that is; in particular Pulsar, which is an actor system with light weight threads closely modeled after Erlang. Perhaps with their clear interest in Clojure, they could be coaxed into creating Clojure bindings :-)

Chris

Marshall Bockrath-Vandegrift

unread,
Sep 30, 2014, 9:26:12 AM9/30/14
to clo...@googlegroups.com
I've been reading the docs and am having some trouble understanding exactly how data moves around within a pipeline, specifically regarding data locality and serialization.  It reads like every data segment is de/serialized (as EDN?) by Onyx itself and bounced through HornetQ between steps, or am I missing something?

Mike Drogalis

unread,
Sep 30, 2014, 4:46:19 PM9/30/14
to clo...@googlegroups.com
Hey Marshall,

No problem, there aren't docs on this yet - developer blindness. I'll write them shortly. I just wrote a bit about how this works on the Onyx mailing list: https://groups.google.com/d/msg/onyx-user/xniQcgCPEn8/oCOZX77vZH4J

What you described is mostly accurate, though. Data is serialized with Fressian by default. Between tasks, data goes on the wire to HornetQ. I highly recommend using a 10g switch to connect nodes in/across racks in your data center. As I said in the mailing list, at the moment the data hits the disk on each task. I'm going to change this in the future, but it's the best I could do for now.

--

Mike Drogalis

unread,
Sep 30, 2014, 5:04:55 PM9/30/14
to clo...@googlegroups.com
Hello Huahai!

I appreciate the kind words. :) I'm not a Spark user, but my understanding is that it's extremely fast, offers batch and streaming via mirrored APIs, and presents a functional interface to express computation. Onyx diverges in its aggressive use of data structures to express computations as maps and vectors, rather than expression as functions over collections. It also builds batching operations on top of streaming operations. I believe Spark does it the other way around. Also, Spark is much faster than Onyx. That's a result of their team being more talented than me. :P

I'm trying to accomplish two things with Onyx. Primarily, I'm trying to rip apart all of the different things that contemporary distributed computation frameworks do, and put them back together in a composable manner. That's really what the heart of my talk was about, and I'm going to be blogging about this a lot in the next few weeks. The critical thing to take apart is the structure of the computation - as a simple data structure! I don't know of a lot of frameworks that do this.

The second thing that I'm trying to do is get reach to the browser. When you describe your computation as data, you can create it in JavaScript - something a lot of customer solutions need at the moment. Couple that with using regular functions to describe your computations, and you get the magical ability to use something like Cljx to cross-compile and do computational sampling in the browser. I think Clojure has the raw power to be a serious competitor in large scale data processing, given these goals. If not Onyx, something else for sure.

So that's what I'm trying to do. Time will tell if any of this is a good idea. :P

--
Reply all
Reply to author
Forward
0 new messages