Query performance aka "why is it slow" - count(distinct()) + group by

Matt Dantas-McCutcheon

unread,

Apr 16, 2018, 11:58:36 AM4/16/18

to Druid Development

Hello!

I'm trying to debug a count(distinct(..)) + groupBy query wrt performance. Specifically, I'm getting 502s from the broker when running this query (using Druid's SQL layer http://druid.io/docs/latest/querying/sql.html): 

 select
  id, 
  count(distinct(other_id))
from some_table
where (id = 'a' or id = 'b' or id = 'c'...) --ellipse because there are ~100 of these  
group by id

To give some further context:

Each record in some_table with a given id has ~1MM other_ids of which 98% of those are unique (so I guess other_id is a high cardinality column?).
Reiterating ^^ to some extent, the total # of records in the filtered set is ~100MM (i.e. the query below yields ~100MM) select count(0)
from some_table
where (id = 'a' or id = 'b' or id = 'c'...) --ellipse because there are ~100 of thesed

My thoughts of what things might be going wrong:
Count distinct is by and far the likely culprit? I understand that count(distinct()) is a well-known problem in big data, w (it appears to me) HLL being the canonical solution.
Maybe group by is slowing things down? I understand that groupBy has it's own tuning complexities and concerns in Druid. I don't know whether my cluster is well tuned in this regard.
Maybe the bottleneck is I need more segments loaded into historical nodes than they have capacity for (we're using memory mapped storage)?

My debugging question parses into:

What aspect(s) of my query is (are) likely causing the slowness?
Based on your thoughts re ^^, what are some possible paths forward? (e.g. scaling historical node JVM heap memory? preprocess data before loading it into druid to avoid the count(distinct())?)
Sort of related to #2, is this more:

My Druid cluster is probably not well tuned to the task or
This is an inherently hard problem and the actual right thing to do is use HLL.

And, of course, please let me know if you have any thoughts on how to more granularly debug, etc (FWIW I tried running explain on the queries using count(distinct()) and approx_count_distinct but don't know enough at this point to salient differences betw the two plans... happy to provide those if they would be helpful).

Thanks in advance for your patience and time in helping a n00b along.

Matt

Matt Dantas-McCutcheon

unread,

Apr 16, 2018, 12:01:05 PM4/16/18

to Druid Development

Some further details - "id" and "other_id" are guids.

Matt Dantas-McCutcheon

unread,

Apr 16, 2018, 12:05:01 PM4/16/18

to Druid Development

Further detail #2: we doubled our # of historical nodes and that does not seem to have helped anything.

Gian Merlino

unread,

May 3, 2018, 3:00:05 PM5/3/18

to druid-de...@googlegroups.com

Hi Matt,

This would be a better question for the druid-user list (druid...@googlegroups.com).

One thing to look at is indexing the fields as hyperUniques rather than dimensions (or, in addition to indexing them as dimensions). The algorithm is the same (we use HLL for count distinct either way) but this essentially pushes some of the computation to ingestion time rather than query time.

If you've got some additional questions after trying this then please move the thread to the user list. Thanks!

Gian

On Mon, Apr 16, 2018 at 9:05 AM, Matt Dantas-McCutcheon <mateu...@gmail.com> wrote:

Further detail #2: we doubled our # of historical nodes and that does not seem to have helped anything.

--
You received this message because you are subscribed to the Google Groups "Druid Development" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-development+unsubscribe@googlegroups.com.
To post to this group, send email to druid-development@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-development/52b85f67-e4dc-416b-9d95-dcc9c9048011%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Reply all

Reply to author

Forward