How does Druid compare to Apache Spark?

Prashant Deva

unread,

Aug 13, 2014, 10:09:00 AM8/13/14

to druid-de...@googlegroups.com

It seems Spark is designed for fast analytics too.

While the comparison docs have comparisons against a bunch of different systems, there isnt one for Spark.

I would love to get some idea of how Druid and Spark compare.

Eric Tschetter

unread,

Aug 13, 2014, 11:51:07 AM8/13/14

to druid-de...@googlegroups.com

Prashant,

Spark is a back-office analytics platform. It started as
infrastructure to support faster iterative algorithms, like machine
learning type stuff where you keep applying an operation over and over
again until you converge. It is fantastic at providing analysts with
the ability to run queries and analyze large amounts of data with a
wide array of different algorithms.

We've heard feedback from others who have tried Spark and chosen Druid
that Spark is not currently the greatest choice for powering
applications. That is, if you have a web UI that is powered entirely
by Spark queries, you are going to be reluctant to expose that to
users outside of your analytics team. Instead, you would use Spark to
generate some other data set, load that into a key-value store or
database or something like Druid and serve your application from that.

Druid is infrastructure built to power an application. It does not
have the query flexibility that Spark provides and it doesn't
maintain/update state, so it's not currently the greatest choice for
high scale iterative machine learning algorithms that need to update
large amounts of state between iterations.

For the queries that it does provide, however, it was designed to be
able to provide answers to those queries quickly and in a highly
concurrent environment. More specifically, it is designed to allow
query latency and concurrency requirements to be dictated by how much
$$ you want to throw at the problem. If you want to hit a specific
query latency or level of concurrent queries, you can do that by
increasing the amount of hardware available.

Does that make sense?

--Eric

> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/d1a50948-63a1-4b44-9747-f3d9fd8f9eaf%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Kiran Patchigolla

unread,

Aug 13, 2014, 12:15:44 PM8/13/14

to druid-de...@googlegroups.com

Though would love to see a Spark SQL - Druid integration (similar to Spark SQL - Parquet http://databricks.com/blog/2014/03/26/spark-sql-manipulating-structured-data-using-spark-2.html).

Prashant,

We use Druid to power our query based dashboard and analysis and we use spark streaming to power our alerting..

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/CAB8U%2Bh3PkrpHmF4OBKUG-MxOBF8pSdWbojiQuUKWir%2BETEN3Rw%40mail.gmail.com.

Paul Otto

unread,

Nov 11, 2014, 12:01:14 PM11/11/14

to druid-de...@googlegroups.com

Eric,

I do DevOps consulting with a big data / data science team at TWC where we were just having this very discussion. I'd love to get your thoughts on using Tachyon with Spark and Druid.

FWIW, it would be really cool to work with you since you'd clearly fit right in. ;)

Regards,

Paul Otto

Otto Ops

Fangjin Yang

unread,

Nov 11, 2014, 12:39:11 PM11/11/14

to druid-de...@googlegroups.com

Hi Paul, we'd discussed before about using Tachyon as a potential deep storage. HY (creator of Tachyon) is a good friend of mine and tells me all the time I should integrate Druid and Tachyon :P. I need to find some time to do this integration and hopefully get it done before the end of the year. One key thing to understand is about the architecture of Druid is that it bakes computation together with storage, unlike many other solutions out there (Impala, Presto, Spark SQL, etc) where the data is stored on a distributed file system (HDFS), and the loaded into the query layer before computation is done. The tradeoff is that Druid queries are (by rudimentary benchmarks, orders of magnitude) faster, at the cost of needing to build the immutable segments that Druid scans for computations.

Paul Otto

unread,

Nov 11, 2014, 12:53:13 PM11/11/14

to druid-de...@googlegroups.com

Hi Fangjin,

Thanks for your response and the information on this! I'm pretty intimately familiar with the architecture of Druid - one of the things I do in my spare time is read your many commits to keep up with what the latest advancements in Druid will be. ;) Our team has been running it with production-level loads since the first quarter of 2014. We're using MapR-FS + NFS to provide access to deep storage on the Historicals and Indexers. Spent many months optimizing our architecture thanks to your many hints on here, and have built a Puppet module to handle all the configurations throughout our Druid clusters.

I'd love to help as you get moving on the Tachyon integration in the coming weeks.

Regards,

Paul

Fangjin Yang

unread,

Nov 11, 2014, 12:55:28 PM11/11/14

to druid-de...@googlegroups.com

Hi Paul, that's awesome to hear about your deployment! Hopefully I can find a lull of time between Thanksgiving and Christmas to create a Tachyon module.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/c69f0630-8298-458a-a418-723578745221%40googlegroups.com.

Ethan Wolf

unread,

Nov 11, 2014, 1:04:53 PM11/11/14

to druid-de...@googlegroups.com

Hi Fangjin,

The ingest we run uses Spark Streaming to create json that is written to Kafka for ingest by druid realtime indexers. (We unfortunately haven't gotten time to play around with Tranquility.)

One thing Paul and I have briefly discussed is using Tachyon to pass the data between the Spark process and the Druid realtime indexers.

Do you think that idea has any wheels or is it just a bad idea. We haven't had time to really consider details since it is all in a future dream world where we have more time :)

Another thing we've discussed is integrating the Tranquility Beams into our Spark Streaming process the same way you guys have it in your Storm processes.

But it is hard to find time when the current solution is working beautifully. Do you know if anyone else has worked on Spark Streaming/Tranquility integration?

Ethan

Fangjin Yang

unread,

Nov 11, 2014, 1:42:38 PM11/11/14

to druid-de...@googlegroups.com

Hi Ethan, see inline.

On Tuesday, November 11, 2014 10:04:53 AM UTC-8, Ethan Wolf wrote:

Hi Fangjin,
The ingest we run uses Spark Streaming to create json that is written to Kafka for ingest by druid realtime indexers. (We unfortunately haven't gotten time to play around with Tranquility.)

Cool! You guys should blog about it :)

One thing Paul and I have briefly discussed is using Tachyon to pass the data between the Spark process and the Druid realtime indexers.

Do you mean have data in Spark Streaming feed into Druid? If yes, then this is very similar to the setup we have in house. If no, I'd have to understand what you are thinking a bit more.

Do you think that idea has any wheels or is it just a bad idea. We haven't had time to really consider details since it is all in a future dream world where we have more time :)

Another thing we've discussed is integrating the Tranquility Beams into our Spark Streaming process the same way you guys have it in your Storm processes.

Tranquility is a library that can be embedded into different setups, and the main value add is that it automatically manages working with the indexing service (more thoughts here:https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20%22thoughts%22/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ). We use Tranquility with Storm in production but we are evaluating Samza.

But it is hard to find time when the current solution is working beautifully. Do you know if anyone else has worked on Spark Streaming/Tranquility integration?

I'm not sure about this. We evaluated Spark Streaming in the early days and choose Storm because of its maturity at that time.

Ethan Wolf

unread,

Nov 11, 2014, 5:36:02 PM11/11/14

to druid-de...@googlegroups.com

Thanks Fangjin. Also inlined:

On Tuesday, November 11, 2014 11:42:38 AM UTC-7, Fangjin Yang wrote:

Hi Ethan, see inline.

On Tuesday, November 11, 2014 10:04:53 AM UTC-8, Ethan Wolf wrote:
Hi Fangjin,
The ingest we run uses Spark Streaming to create json that is written to Kafka for ingest by druid realtime indexers. (We unfortunately haven't gotten time to play around with Tranquility.)

Cool! You guys should blog about it :)

I'm extraordinarily shy and withdrawn, so I leave all the blogging to others. (I actually have to adhere to company policies and prefer novel writing to blogging!)

But yes, at some point before I die I'll get approval to talk about things.

Druid has been key in our architecture. If I actually got around to finding time to go to strata in my life, it would be cool to fill you in.

One thing Paul and I have briefly discussed is using Tachyon to pass the data between the Spark process and the Druid realtime indexers.

Do you mean have data in Spark Streaming feed into Druid? If yes, then this is very similar to the setup we have in house. If no, I'd have to understand what you are thinking a bit more.

Yes-- Spark streaming is currently serializing json to kafka, and then the druid indexers read it.

Long term, using Tachyon seemed like it might be more efficient and remove a small bit of hassle since it is all running on the same cluster and kafka offsets can be finicky beasts.

Do you think that idea has any wheels or is it just a bad idea. We haven't had time to really consider details since it is all in a future dream world where we have more time :)

Another thing we've discussed is integrating the Tranquility Beams into our Spark Streaming process the same way you guys have it in your Storm processes.

Tranquility is a library that can be embedded into different setups, and the main value add is that it automatically manages working with the indexing service (more thoughts here:https://groups.google.com/forum/#!searchin/druid-development/fangjin$20yang$20%22thoughts%22/druid-development/aRMmNHQGdhI/muBGl0Xi_wgJ). We use Tranquility with Storm in production but we are evaluating Samza.

Ya, we use Spark Streaming with some of our stateful computations. I wouldn't be surprised if a Samza->Druid architecture ends up with similarities to what we have, although my knowledge of Samza is very periphery. There are too many options.

Reply all

Reply to author

Forward