Precision of cardinality aggregator

md...@skyportsystems.com

unread,

Sep 23, 2015, 12:05:17 PM9/23/15

to Druid User

Hi all,

I'm attempting to use a cardinality aggregator to count distinct values for one dimension within groups constructed across another dimension. I notice that the output of the cardinality aggregator is a non-integer - is there a way to get exact counts (even at the cost of some performance)?

For example, this query:

{
    "queryType": "groupBy",
    "dataSource": "observations",
    "granularity": "fifteen_minute",
    "aggregations": [
        { "type": "cardinality", "name": "ip_count", "fieldNames": [ "remote_ip" ] },
        { "type": "cardinality", "name": "fqdn_count", "fieldNames": [ "remote_fqdn" ] },
        { "type": "count", "name": "count" }
    ],
    "intervals": "2015-08-03T17:00:00Z/2015-08-03T18:00:00Z",
    "dimensions": [ { "type": "default", "dimension": "remote_ip", "outputName": "remote_ip" } ]
}

Returns this response:

[
    {
        "version": "v1",
        "timestamp": "2015-08-03T17:45:00.000Z",
        "event": {
            "ip_count": 1.0002442201269182,
            "count": 3,
            "remote_ip": null,
            "fqdn_count": 2.000977198748901
        }
    },
    {
        "version": "v1",
        "timestamp": "2015-08-03T17:45:00.000Z",
        "event": {
            "ip_count": 1.0002442201269182,
            "count": 7,
            "remote_ip": "10.10.194.190",
            "fqdn_count": 1.0002442201269182
        }
    }
]

I guess "count" is the number of actual underlying data source events in each group, which isn't what I want. The other counts are almost right, but I am hesitant to just round/truncate and treat it as accurate...

Thanks,

Mike

Gian Merlino

unread,

Sep 23, 2015, 12:56:16 PM9/23/15

to Druid User

Hey Mike,

If you want the *number of events ingested* you can get an exact count by doing a "longSum" at query time over a "count" metric that you created at indexing time. See "Counting the number of ingested events" here: http://druid.io/docs/latest/ingestion/schema-design.html

On the other hand, if you want the *number of unique values of some column* then you have three options.

1) You can get an fast approximate count with the "cardinality" aggregator. The error rate of the approximation generally averages 3%, give or take a couple percent.

2) You can get an even faster approximate count, and possibly avoid the need to store the original column altogether (if you weren't using it for anything but counting uniques) by using a "hyperUnique" aggregator at both query time and at ingestion time. This uses the same algorithm as (1) so the error rate is the same.

3) You can get an exact count of unique values by using a nested groupBy. The inner query should group by the column you want to count unique values of, and the outer query should simply do a "count" aggregation. This will be exact but will be slower, and use more memory, than (1) or (2). It may not be viable for columns that have a very large number of unique values (more than can fit into memory on your query broker).

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/9fe51ef6-878f-4f67-88e1-9dfdb48eefd7%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

md...@skyportsystems.com

unread,

Sep 23, 2015, 1:14:51 PM9/23/15

to Druid User

Hi Gian,

Thanks for the tip about the nested groupBy query - I'll take a look. I'm interested in the number of unique values in a column here.

I didn't include this initially to keep the question simple, but in reality I'm running a query that uses a JavaScript aggregator to group IP addresses by subnet - e.g. with a mask size of 8, and distinct values of 4.4.4.4, 10.1.2.55 and 10.10.100.1 I'd like the output to contain two groups, one with subnet 4.0.0.0 and count of 1, and another with subnet 10.0.0.0 and count of 2.

Is there a way to leverage the fact that I'm already using a javascript aggregator on the column in question to get unique counts for each group?

Thanks,

Mike

Fangjin Yang

unread,

Sep 24, 2015, 8:32:05 PM9/24/15

to Druid User

The best ways to get approximate distinct counts are with the cardinality aggregator (which runs the hyperloglog algorithm over a string column), or the faster approach which is to index a column as a hyperloglog and run native hyperUnique queries over the column.

Reply all

Reply to author

Forward