Hi Druid Community,
We are experiencing an issue while querying distinct values in Apache Druid. When running the following query:
It returns 98 records.
However, when running:
It returns 96 records.
We are unsure why there is a mismatch between the two results.
Additionally, in another dataset where the column has unique values, using COUNT(DISTINCT) sometimes returns a higher count than expected.
SELECT count(DISTINCT "CountryID") FROM "MunicipalWaste1"
It returns 99 records. But actually there is only 98 records in total.
We have attached screenshots and dataset for reference. Could you please help us understand why this discrepancy occurs and how we can resolve it?
Thanks in advance for your support!
Dear Aparna,
Thank you for your query. The discrepancy arises due to the difference in how COUNT(DISTINCT) and SELECT DISTINCT are processed in Apache Druid. To ensure an exact count of distinct values, you can use the following query:
select count (cnt) from (select distinct "country" as cnt from "municipalwaste")
This query retrieves all distinct Country
values in a subquery and then counts them, avoiding approximation. It should return the exact count of 98 records.
Please let us know if you need further clarification.
Best regards,
Tamil Selvan Murugesan.
--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/druid-user/5590fbb5-aa8c-45b4-aa51-9bf9b2191802n%40googlegroups.com.
Thank you for the detailed explanations and helpful suggestions. The insights on approximate distinct count and the alternative approach to ensure accuracy were very useful. I appreciate the prompt support and guidance.
Best regards,
Aparna