Using Druid together with Elasticsearch (newbie)

6,110 views
Skip to first unread message

Richard Siebeling

unread,
Mar 30, 2014, 3:49:50 PM3/30/14
to druid-de...@googlegroups.com
Hi,

we're in the process of developing a new BI system based on Apache Spark, D3 and possibly Druid and / or Elasticsearch.
Apache Spark is used to process the data coming from different sources and I would like to use Druid to store the data that's used to explore in the frontend.
However it would be great to use Elasticsearch to index all text fields. So that all dimensions and measures are computed by Apache Spark, stored in memory by Druid and all (relevant) text fields are indexed by Elasticsearch. 

An example use case would be that the user would enter a searchstring which is handled by Elasticsearch, Elasticsearch comes back with all the relevant documentID's, these documentID's are used to filter the data in Druid and the user can use Druid to explore the data that's initially filtered by Elasticsearch.

Is something like this possible?

thanks in advance,
Richard

Fangjin Yang

unread,
Mar 31, 2014, 2:15:16 PM3/31/14
to druid-de...@googlegroups.com
Hi Richard, this is definitely possible.

Netflix has written up a blog post of a similar system using Druid.

shrinidhi chaudhari

unread,
Sep 4, 2014, 10:07:50 AM9/4/14
to druid-de...@googlegroups.com
Hi Fangjin,
I do not understand the thought behind Suro's (Netflix) setup. If all the logs are already indexed in Elasticsearch, why would one need to have them in Druid?
I plan to study the architecture of Druid to answer the above question, but if you could give me some major areas where Druid has an advantage over Elasricsearch, it would be very helpful.

Eric Tschetter

unread,
Sep 4, 2014, 10:57:10 AM9/4/14
to druid-de...@googlegroups.com
Shrindhi,

The simple answer is that Elastic Search's hardware requirements to
ingest large amounts of data and provide fast aggregates on top of it
are significantly higher. "Significant" is relative, but it was
enough to introduce different infrastructure for different cases.

The thing Elastic Search does well that Druid does not do is provide
access to the raw event-level data.

--Eric
> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/86f9d5c7-b169-48a5-b828-306c95f15f6f%40googlegroups.com.
>
> For more options, visit https://groups.google.com/d/optout.

shrinidhi chaudhari

unread,
Sep 4, 2014, 3:14:49 PM9/4/14
to druid-de...@googlegroups.com
Thanks Eric,
Wish-list item: Elasticsearch in the "Druid vs ...." section in the docs.

* starts reading druid white paper *



You received this message because you are subscribed to a topic in the Google Groups "Druid Development" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/druid-development/nlpwTHNclj8/unsubscribe.
To unsubscribe from this group and all its topics, send an email to druid-developm...@googlegroups.com.

To post to this group, send email to druid-de...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Shrinidhi Chaudhari.

Bae, Jae Hyeon

unread,
Sep 4, 2014, 4:09:38 PM9/4/14
to druid-de...@googlegroups.com
Elasticsearch aggregation is very expensive operation compared to Druid GroupBy query. This is the main reason why we(Netflix) started using Druid together with Elasticsearch.

In 2012, elasticsearch version was 0.20.x and its facet search was really slow and consumed much memory. In the worst case, facet search brought down the whole cluster.

Now, elasticsearch remodeled facet search as 'aggregation' but I still observed 4-fold or 5-fold aggregation brought the cluster with OOM error. 4-fold or 5-fold aggregation means the number of dimensions is 4 or 5 in Druid GroupBy query.

So, I recommend using druid for the faster aggregation and using elasticsearch for the full text search and retrieve full documents.


Eric Tschetter

unread,
Sep 4, 2014, 4:17:24 PM9/4/14
to druid-de...@googlegroups.com
Jae,

So the ingestion rate/volume isn't an issue for you guys anymore? I
seem to remember you saying before that you could only get ~1k
events/second indexed per node with ES.

--Eric
> https://groups.google.com/d/msgid/druid-development/CAKe7ALdg4sWwMUMAB4C4wm44xMvjLiseNrFv57Q2S7RTa-0zNw%40mail.gmail.com.

Bae, Jae Hyeon

unread,
Sep 4, 2014, 4:21:42 PM9/4/14
to druid-de...@googlegroups.com
I forgot to mention ingestion rate. With the same spec of hardware, indexing throughput is not even comparable between druid and ES. With m1.xlarge AWS EC2 instance type, its ingestion rate is 10 times difference even more.

If our data flow rate is 10k messages per second, only single instance of druid realtime node can consume all messages without any delay but I need to allocate 10 m1.xlarge instance with one replica and the enough number of shards. With 10 m1.xlarge instances, its facet search performance was terrible.

I am not blaming ES itself, Inverted index is not optimized structure for aggregation.


Eric Tschetter

unread,
Sep 4, 2014, 4:24:12 PM9/4/14
to druid-de...@googlegroups.com
Btw, thanks for adding info straight from the horses mouth Jae!

This will make its way into a "Druid vs. ElasticSearch" post now :).

--Eric
> https://groups.google.com/d/msgid/druid-development/CAKe7ALd6WgSm%2B1NYq%2BAPt0e0NWDNRUj1CNnkMd1DOQc4utX%3DEA%40mail.gmail.com.

Arjun Iyer

unread,
Oct 5, 2014, 8:51:06 PM10/5/14
to druid-de...@googlegroups.com
Jae, Is this situation mitigated if you use docValues for Elasticsearch ?

Arjun

google...@fullscale180.com

unread,
Nov 7, 2014, 10:48:13 AM11/7/14
to druid-de...@googlegroups.com
What version of Elasticsearch was used?  A lot has changed in Elasticsearch to better support this type of workload since 0.90 and yet more in the recent 1.4 release.

I would do some load testing, you might be surprised at how well Elaticsearch handles metrics today.

Fangjin Yang

unread,
Nov 7, 2014, 10:52:24 AM11/7/14
to druid-de...@googlegroups.com
Druid public benchmarks are here: http://druid.io/blog/2014/03/17/benchmarking-druid.html

Would be really interesting to see a comparison. I'm an architecture guy, so I'm a lot more curious about what architecture changes in Lucene allow for fast aggregates?

Reply all
Reply to author
Forward
0 new messages