To follow up on the thread above, we've not found the root cause of the issue yet, but we have isolated it to some tuning options.
When we increase either "bufferGrouperInitialBuckets" or reduce "bufferGrouperMaxLoadFactor," the count becomes accurate for higher cardinality result sets.
Whilst we haven't investigated too much into the source of the code where these tuning options are used (
https://github.com/apache/druid/blob/master/processing/src/main/java/org/apache/druid/query/groupby/epinephelinae/BufferHashGrouper.java#L137)
Our first line of enquiry is to determine whether the Grouper has collisions and how the extension aggregator interacts with it.
We'd appreciate it if anyone could give any pointers or direction!
Just a note for anyone who may stumble across this: at the time of writing, it states in the Contrib extension docs:
"There are some limitations, when used with groupBy, the groupBy keys' numbers should not exceed maxIntermediateRows in every segment."
However, maxIntermediateRows is a config setting that doesn't apply to the V2 GroupyBy enabled by default; we tried tuning this, and it doesn't have any influence.