Discrepancy in COUNT(DISTINCT) and DISTINCT Queries in Apache Druid

aparna bala

unread,

Feb 17, 2025, 2:01:45 AMFeb 17

to Druid User

Hi Druid Community,

We are experiencing an issue while querying distinct values in Apache Druid. When running the following query:

SELECT DISTINCT Country FROM "MunicipalWaste1";

It returns 98 records.

However, when running:

SELECT COUNT(DISTINCT Country) FROM "MunicipalWaste1";

It returns 96 records.

We are unsure why there is a mismatch between the two results.

Additionally, in another dataset where the column has unique values, using COUNT(DISTINCT) sometimes returns a higher count than expected.

SELECT count(DISTINCT "CountryID") FROM "MunicipalWaste1"

It returns 99 records. But actually there is only 98 records in total.

We have attached screenshots and dataset for reference. Could you please help us understand why this discrepancy occurs and how we can resolve it?

Thanks in advance for your support!

Screenshot 2025-02-17 122415.png

MuncipalWaste_data.csv

Screenshot 2025-02-17 120926.png

Screenshot 2025-02-17 121008.png

Screenshot 2025-02-17 122441.png

Tamilselvan Murugesan

unread,

Feb 17, 2025, 3:35:19 AMFeb 17

to druid...@googlegroups.com

Subject: Clarification on COUNT(DISTINCT) Discrepancy

Dear Aparna,

Thank you for your query. The discrepancy arises due to the difference in how COUNT(DISTINCT) and SELECT DISTINCT are processed in Apache Druid. To ensure an exact count of distinct values, you can use the following query:

select count (cnt) from (select distinct "country" as cnt from "municipalwaste")

This query retrieves all distinct Country values in a subquery and then counts them, avoiding approximation. It should return the exact count of 98 records.

Please let us know if you need further clarification.

Best regards,

Tamil Selvan Murugesan.

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/5590fbb5-aa8c-45b4-aa51-9bf9b2191802n%40googlegroups.com.

John Kowtko

unread,

Feb 17, 2025, 10:14:06 AMFeb 17

to Druid User

As Tamil mentions, Druid by default uses an approximate distinct count algorithm, even for SQL "count(distinct)" ... this is due to the origin of the product. You can supply the query context parameter useApproximateCountDistinct to FALSE to force it to use the exact algorithm (which turns it into a nested group by)

https://druid.apache.org/docs/latest/querying/sql-query-context#sql-query-context-parameters.

If you have any more questions about this let us know.

Thanks. John

aparna bala

unread,

Feb 18, 2025, 12:23:52 AMFeb 18

to Druid User

Dear Team,

Thank you for the detailed explanations and helpful suggestions. The insights on approximate distinct count and the alternative approach to ensure accuracy were very useful. I appreciate the prompt support and guidance.

Best regards,
Aparna

Peter Marshall

unread,

Feb 20, 2025, 3:08:18 AMFeb 20

to Druid User

There's a Python notebook in learn-druid that covers this topic, too - so you can see what happens.

https://github.com/implydata/learn-druid/blob/main/notebooks/03-query/03-approx-count-distinct.ipynb

Reply all

Reply to author

Forward