Druid DistinctCount Extensions Issues with Multiple Dimensions

70 views
Skip to first unread message

Tom Harnasz

unread,
Nov 26, 2024, 10:01:34 AMNov 26
to Druid User
Hi there,

When using `druid/extensions-contrib/distinctcount`, we noticed that the returned counts differ when grouping by single and multiple dimensions.

We noticed that single-dimension counts are always correct, whereas multiple-dimension counts are sometimes correct or vastly out. We haven't figured out a pattern to this so far.

Is anyone else experiencing the same issues? We will be sure to follow up with reproducible examples shortly.

Cheers,

Tom

Tom Harnasz

unread,
Nov 28, 2024, 4:49:40 AMNov 28
to Druid User
To follow up on the thread above, we've not found the root cause of the issue yet, but we have isolated it to some tuning options.

When we increase either "bufferGrouperInitialBuckets" or reduce "bufferGrouperMaxLoadFactor," the count becomes accurate for higher cardinality result sets.

Whilst we haven't investigated too much into the source of the code where these tuning options are used (https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/BufferHashGrouper.java#L137)

Our first line of enquiry is to determine whether the Grouper has collisions and how the extension aggregator interacts with it.

We'd appreciate it if anyone could give any pointers or direction!

Just a note for anyone who may stumble across this: at the time of writing, it states in the Contrib extension docs:

"There are some limitations, when used with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment."

However, maxIntermediateRows is a config setting that doesn't apply to the V2 GroupyBy enabled by default; we tried tuning this, and it doesn't have any influence. 

Peter Marshall

unread,
Dec 2, 2024, 2:31:07 AMDec 2
to Druid User
Hey Tom! So this extension hasn't been maintained in... ermmm... (checks source) ... 6 years.
You do have other options nowadays:

1) For the interactive query API:
 -  HLL / Thetasketches for approximates
 - Turn off approximate count distinct in the query context
2) Do the asynchronous queries thing.

There's a public python notebook on COUNT DISTINCT here - it focuses on the interactive use cases.

Hope this helps...

Tom Harnasz

unread,
Dec 20, 2024, 11:29:36 AM (6 days ago) Dec 20
to Druid User
Hi Peter!

Thanks for your input. I appreciate it.  Unfortunately, those suggestions won't work for us. 

We're happy to assist in patching this, but it would be helpful to understand the problem.

Many thanks,

Tom
Reply all
Reply to author
Forward
0 new messages