Hello!
select
id,
count(distinct(other_id))
from some_table
where (id = 'a' or id = 'b' or id = 'c'...) --ellipse because there are ~100 of these
group by id
To give some further context:
- Each record in some_table with a given id has ~1MM other_ids of which 98% of those are unique (so I guess other_id is a high cardinality column?).
- Reiterating ^^ to some extent, the total # of records in the filtered set is ~100MM (i.e. the query below yields ~100MM)
select count(0)
from some_table
where (id = 'a' or id = 'b' or id = 'c'...) --ellipse because there are ~100 of thesed
My thoughts of what things might be going wrong:
- Count distinct is by and far the likely culprit? I understand that count(distinct()) is a well-known problem in big data, w (it appears to me) HLL being the canonical solution.
- Maybe group by is slowing things down? I understand that groupBy has it's own tuning complexities and concerns in Druid. I don't know whether my cluster is well tuned in this regard.
- Maybe the bottleneck is I need more segments loaded into historical nodes than they have capacity for (we're using memory mapped storage)?
My debugging question parses into:
- What aspect(s) of my query is (are) likely causing the slowness?
- Based on your thoughts re ^^, what are some possible paths forward? (e.g. scaling historical node JVM heap memory? preprocess data before loading it into druid to avoid the count(distinct())?)
- Sort of related to #2, is this more:
- My Druid cluster is probably not well tuned to the task or
- This is an inherently hard problem and the actual right thing to do is use HLL.
And, of course, please let me know if you have any thoughts on how to more granularly debug, etc (FWIW I tried running explain on the queries using count(distinct()) and approx_count_distinct but don't know enough at this point to salient differences betw the two plans... happy to provide those if they would be helpful).
Thanks in advance for your patience and time in helping a n00b along.
Matt