How do I utilize the HyperUnique aggregator?

Mark

unread,

May 27, 2016, 5:53:39 PM5/27/16

to Druid User

I am not quite sure on how to link the "hyperUnique" metric to the computation of cardinality. Has anyone had any experience with this and able to provide an example of this? Any help appreciated.

From the documentation here: http://druid.io/docs/latest/querying/aggregations.html#cardinality-aggregator

Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this aggregator will be much slower than indexing a column with the hyperUnique aggregator.

From the documentation here: http://druid.io/docs/0.8.3/querying/aggregations.html#hyperunique-aggregator

Uses HyperLogLog to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.

{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }

Below is a quick outline of my current thinking. Thoughts?

Druid Hadoop-based Batch Ingestion JSON:

{
  "type": "index_hadoop",
  "spec": {
    "dataSchema": {
      "dataSource": "special_report-V1",
      "parser": {
        "type": "string",
        "parseSpec": {
          "format": "csv",
          "columns" : ["dim1","dim2","dim3","dim4","dim5"],
          "timestampSpec": {
            "column": "msgDate",
            "format": "auto"
          },
          "dimensionsSpec": {
            "dimensions": ["dim1","dim2","dim3","dim4","dim5"],
            "dimensionExclusions": [],
            "spatialDimensions": []
          }
        }
      },
      "metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }],
      "granularitySpec" : {
        "type" : "uniform",
        "segmentGranularity" : "HOUR",
        "queryGranularity" : "HOUR",
        "intervals": ["2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z"]
      }
    },
    "ioConfig": {
      "type": "hadoop",
      "inputSpec": {
        "type": "granularity",
        "dataGranularity": "HOUR",
        "inputPath": "/tmp/special-reports",
        "filePattern": ".*.csv"
      }
    },
    "tuningConfig": {
        "type": "hadoop",
        "partitionsSpec": {
          "targetPartitionSize": 0
        }
    }
  }
}


Druid Query JSON:

{
  "queryType": "groupBy",
  "dataSource": "special_report-V1",
  "granularity": "day",
  "dimensions": ["dim1"],
  "aggregations": [
    { "type": "cardinality", "name": "dim2Count1", "fieldNames": ["dim2"], "byRow":false },
    { "type": "cardinality", "name": "dim2Count2", "fieldNames": ["dim2_count"], "byRow":false },
    {"type": "count","name": "count"}
  ],
  "postAggregations": [
  ],
  "intervals": [ "2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z" ]
}

Gian Merlino

unread,

May 27, 2016, 6:17:01 PM5/27/16

to druid...@googlegroups.com

Hey Mark,

If you use a hyperUnique at ingestion time, you should use a hyperUnique at query time too. At query time, "hyperUnique" works on columns created with "hyperUnique" and "cardinality" works on regular string columns.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/1790c283-9d5c-4411-a60e-3cd2e5924506%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark

unread,

May 27, 2016, 6:28:50 PM5/27/16

to Druid User

So if I wish to utilize the "hyperUnique" aggregator, does the following make sense based on my previous Druid Ingestion JSON?

Druid Ingestion JSON Snippet:

"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }]

Druid Query JSON:

{
  "queryType": "groupBy",
  "dataSource": "special_report-V1",
  "granularity": "day",
  "dimensions": ["dim1"],
  "aggregations": [

{ "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldNames": ["dim2_count"], "byRow":false },

Mark

unread,

May 31, 2016, 3:28:14 PM5/31/16

to Druid User

Thanks @Gian for your input.

Unfortunately, when I run my now updated Druid Indexing Spec and Query below, I get a hyperUnique value of 0. Any suggestions?

I found the following helpful: https://groups.google.com/forum/#!msg/druid-user/DrAGNRUTtEg/kA39sCstBAAJ

Jonathan Wei

unread,

May 31, 2016, 8:28:24 PM5/31/16

to druid...@googlegroups.com

Hi Mark,

Can you try changing:

   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldNames": ["dim2_count"], "byRow":false },

to:

   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldName": "dim2_count", "byRow":false },

The hyperUnique aggregator only accepts a single field

Thanks,

Jon

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/d13ca031-0dbf-4e16-afd6-a3bafcf0b2da%40googlegroups.com.

Fangjin

unread,

May 31, 2016, 8:31:00 PM5/31/16

to Druid User

Also byRow isn't a field in the hyperUnique aggregator.

To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/CAG0p_PEmDK_-VrLSsT0GUL9ZRY-%3DkU988pyE%3DBNjuLEOMiyfWw%40mail.gmail.com.

Mark

unread,

May 31, 2016, 10:18:51 PM5/31/16

to Druid User

Thanks it worked! I missed the subtle fieldName declaration difference between this aggregation declaration and the others (http://druid.io/docs/latest/querying/aggregations.html). Nice catch.

My updated Query JSON is given below.

Druid Ingestion JSON Snippet:

"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }]

Druid Query JSON:

{
 "queryType": "groupBy",
 "dataSource": "special_report-V1",
 "granularity": "day",
 "dimensions": ["dim1"],
 "aggregations": [

{ "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldName": "dim2_count" },

   {"type": "count","name": "count"}
 ],
 "postAggregations": [
 ],
 "intervals": [ "2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z" ]
}

For those reading this post, I thought I would also include a link to a helpful article on aggregations: https://theza.ch/2015/04/05/introduction-to-indexing-aggregation-and-querying-in-druid/ .

Bingo

unread,

Dec 30, 2020, 4:02:10 AM12/30/20

to Druid User

Thanks, Really helpful

Reply all

Reply to author

Forward