Counting Distinct Values in a TimeSeries Query

Joshua Schumacher

unread,

Jan 28, 2015, 4:06:54 PM1/28/15

to druid-de...@googlegroups.com

I am really new to Druid and have been searching madly for info on doing a Count(Distinct(roomName)) in a TimeSeries Query.

My current Query is:

{  
   "queryType": "timeseris",
   "dataSource": "projector",
   "granularity":"fifteen_minute",
   "filter":{"type":"selector","dimension":"powerStatus","value":"00vP1"},
   "aggregations":[
          {"type":"count", "name":"projectorsOn"}
    ],
    "intervals":["{{{UTC ISO Timestamp}}}/ {{{UTC ISO Timestamp}}}"]
}

The data is flying in via kafka and most of the time it only logs a projector's status once every 15 minutes. So it worked. Then I realized that it didn't always do this, and I couldn't easily clean the data. So I looked up and found something about HLL: http://stackoverflow.com/questions/22875651/how-to-apply-hyperloglog-to-a-timeseries-stream

Could someone please walk me through how to pull this off. I have a feeling that my system will be using Count(Distinct)) for timeseries queries a lot?

Eric Tschetter

unread,

Jan 28, 2015, 4:20:31 PM1/28/15

to druid-de...@googlegroups.com

Josh,

Check out the Cardinality aggregator, I believe it'll give you what you want:

http://druid.io/docs/0.6.171/Aggregations.html

--Eric

> --
> You received this message because you are subscribed to the Google Groups
> "Druid Development" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to druid-developm...@googlegroups.com.
> To post to this group, send email to druid-de...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/druid-development/37fa4dca-d605-44f6-b9d2-d6957d514833%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Joshua Schumacher

unread,

Jan 28, 2015, 4:42:10 PM1/28/15

to druid-de...@googlegroups.com

Could you check me on this? I can't be wrong on this query. This will return the number of rows that have powerStatus="00vP1" for 15-minute buckets, but won't count the same roomName more than once.

{ 
 "queryType": "timeseris",
 "dataSource": "projector",
 "granularity":"fifteen_minute",
 "filter":{"type":"selector","dimension":"powerStatus","value":"00vP1"},
 "aggregations":[


      {"type":"cardinality", 
       "name":"projectorsOn",
       "fieldNames":["roomName"],
       "byRow":true


      }
    ],
 "intervals":["{{{UTC ISO Timestamp}}}/ {{{UTC ISO Timestamp}}}"]
}

Fangjin Yang

unread,

Jan 28, 2015, 4:53:17 PM1/28/15

to druid-de...@googlegroups.com

Hi Joshua,

I believe you want to set "byRow" to false (the default).

Druid supports two ways of computing uniques. The first is to use the cardinality aggregator like what you are doing. This computes the approximate cardinality of a set of strings, where that set of strings is included in the segment file. One other option is to look at the "hyperUnique" aggregator. Using this aggregator, you can create hyperloglog indexes at ingestion time that can be used later on for queries. This method avoids storing the actual strings in the set you wish to run distinct counts over and should generally reduce your data storage sizes.

Let me know if this makes sense.

-- FJ

Eric Tschetter

unread,

Jan 28, 2015, 10:00:20 PM1/28/15

to druid-de...@googlegroups.com

Note also that the cardinality computation is approximate. It is not
exact. It's within a decent error (~2%), but it's not going to be
exactly correct.

--Eric

> https://groups.google.com/d/msgid/druid-development/af95ee6a-fe4f-4c62-b1c1-491555b68b80%40googlegroups.com.

Joshua Schumacher

unread,

Jan 28, 2015, 11:21:05 PM1/28/15

to druid-de...@googlegroups.com

Thanks you guys really helped me out with this. I'm sure I'll be back soon with more questions. Google is only so good.

Reply all

Reply to author

Forward