Druid DistinctCount Extensions Issues with Multiple Dimensions

Tom Harnasz

unread,

Nov 26, 2024, 10:01:34 AM11/26/24

to Druid User

Hi there,

When using `druid/extensions-contrib/distinctcount`, we noticed that the returned counts differ when grouping by single and multiple dimensions.

We noticed that single-dimension counts are always correct, whereas multiple-dimension counts are sometimes correct or vastly out. We haven't figured out a pattern to this so far.

Is anyone else experiencing the same issues? We will be sure to follow up with reproducible examples shortly.

Cheers,

Tom

Tom Harnasz

unread,

Nov 28, 2024, 4:49:40 AM11/28/24

to Druid User

To follow up on the thread above, we've not found the root cause of the issue yet, but we have isolated it to some tuning options.

When we increase either "bufferGrouperInitialBuckets" or reduce "bufferGrouperMaxLoadFactor," the count becomes accurate for higher cardinality result sets.

Whilst we haven't investigated too much into the source of the code where these tuning options are used (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/BufferHashGrouper.java#L137)

Our first line of enquiry is to determine whether the Grouper has collisions and how the extension aggregator interacts with it.

We'd appreciate it if anyone could give any pointers or direction!

Just a note for anyone who may stumble across this: at the time of writing, it states in the Contrib extension docs:

"There are some limitations, when used with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment."

However, maxIntermediateRows is a config setting that doesn't apply to the V2 GroupyBy enabled by default; we tried tuning this, and it doesn't have any influence.

Peter Marshall

unread,

Dec 2, 2024, 2:31:07 AM12/2/24

to Druid User

Hey Tom! So this extension hasn't been maintained in... ermmm... (checks source) ... 6 years.

You do have other options nowadays:

1) For the interactive query API:

- HLL / Thetasketches for approximates

- Turn off approximate count distinct in the query context

2) Do the asynchronous queries thing.

There's a public python notebook on COUNT DISTINCT here - it focuses on the interactive use cases.

https://github.com/implydata/learn-druid/blob/main/notebooks/03-query/03-approx-count-distinct.ipynb

Hope this helps...

Tom Harnasz

unread,

Dec 20, 2024, 11:29:36 AM12/20/24

to Druid User

Hi Peter!

Thanks for your input. I appreciate it. Unfortunately, those suggestions won't work for us.

We're happy to assist in patching this, but it would be helpful to understand the problem.

Many thanks,

Tom

Paul Oyston

unread,

Jan 7, 2025, 7:14:26 AMJan 7

to Druid User

Hi Peter,

I just wanted to follow up on this, unfortunately for our queries we just can't rely on a approximation given our reporting needs, and given the volume of the data using count distinct (with approximations turned off) we can't complete the queries in a reasonable time and in many cases this will time out on our frontend before the druid query can complete.

Our pattern is as follows, we have a dimension which we can guarantee will be unique per segment and as such we want to bypass the need to pass back all of these values to a central process to distinct count. Given what the extension does this seems like a perfect fit, and again we're happy to assist in fixing this if we could get a bit of an explanation into why this is happening and what a possible route to remediation would look like.

Alternatively is there a way to trick the SQL engine into doing this instead? My thought being that if you select a non-estimated distinct count grouped by the segment size then an additional step wouldn't be required?

Ben Krug

unread,

Jan 7, 2025, 2:02:26 PMJan 7

to druid...@googlegroups.com

I don't know this extension, but here's a WAG in case it helps.

If the counts are over-counts, then I'd ask whether your partitioning hash is built on all the dimensions you're counting?

The extension doc page says that if you're not partitioned by hash on the dimension, it can overcount.

Paul Oyston

VP Architecture

This email may contain confidential material; unintended recipients must not disseminate, use, or act upon any information in it. If you received this email in error, please contact the sender and permanently delete the email.
Performance Horizon Group Limited | Registered in England & Wales 07188234 | Level 8, West One, Forth Banks, Newcastle upon Tyne, NE1 3PA

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/d27c51d3-c328-4af1-a3f7-ebde2528d108n%40googlegroups.com.

Reply all

Reply to author

Forward