How do I utilize the HyperUnique aggregator?

1,915 views
Skip to first unread message

Mark

unread,
May 27, 2016, 5:53:39 PM5/27/16
to Druid User
I am not quite sure on how to link the "hyperUnique" metric to the computation of cardinality.  Has anyone had any experience with this and able to provide an example of this?  Any help appreciated.



Computes the cardinality of a set of Druid dimensions, using HyperLogLog to estimate the cardinality. Please note that this aggregator will be much slower than indexing a column with the hyperUnique aggregator.



Uses HyperLogLog to compute the estimated cardinality of a dimension that has been aggregated as a "hyperUnique" metric at indexing time.

{ "type" : "hyperUnique", "name" : <output_name>, "fieldName" : <metric_name> }

 
 
Below is a quick outline of my current thinking.  Thoughts?


Druid Hadoop-based Batch Ingestion JSON:


{
 
"type": "index_hadoop",
 
"spec": {
   
"dataSchema": {
     
"dataSource": "special_report-V1",
     
"parser": {
       
"type": "string",
       
"parseSpec": {
         
"format": "csv",
         
"columns" : ["dim1","dim2","dim3","dim4","dim5"],
         
"timestampSpec": {
           
"column": "msgDate",
           
"format": "auto"
         
},
         
"dimensionsSpec": {
           
"dimensions": ["dim1","dim2","dim3","dim4","dim5"],
           
"dimensionExclusions": [],
           
"spatialDimensions": []
         
}
       
}
     
},
     
"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }],
     
"granularitySpec" : {
       
"type" : "uniform",
       
"segmentGranularity" : "HOUR",
       
"queryGranularity" : "HOUR",
       
"intervals": ["2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z"]
     
}
   
},
   
"ioConfig": {
     
"type": "hadoop",
     
"inputSpec": {
       
"type": "granularity",
       
"dataGranularity": "HOUR",
       
"inputPath": "/tmp/special-reports",
       
"filePattern": ".*.csv"
     
}
   
},
   
"tuningConfig": {
       
"type": "hadoop",
       
"partitionsSpec": {
         
"targetPartitionSize": 0
       
}
   
}
 
}
}


Druid Query JSON:


{
  "queryType": "groupBy",
  "dataSource": "special_report-V1",
  "granularity": "day",
  "dimensions": ["dim1"],
  "aggregations": [
    { "type": "cardinality", "name": "dim2Count1", "fieldNames": ["dim2"], "byRow":false },
    { "type": "cardinality", "name": "dim2Count2", "fieldNames": ["dim2_count"], "byRow":false },
    {"type": "count","name": "count"}
  ],
  "postAggregations": [
  ],
  "intervals": [ "2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z" ]
}




Gian Merlino

unread,
May 27, 2016, 6:17:01 PM5/27/16
to druid...@googlegroups.com
Hey Mark,

If you use a hyperUnique at ingestion time, you should use a hyperUnique at query time too. At query time, "hyperUnique" works on columns created with "hyperUnique" and "cardinality" works on regular string columns.

Gian

--
You received this message because you are subscribed to the Google Groups "Druid User" group.
To unsubscribe from this group and stop receiving emails from it, send an email to druid-user+...@googlegroups.com.
To post to this group, send email to druid...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/druid-user/1790c283-9d5c-4411-a60e-3cd2e5924506%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Mark

unread,
May 27, 2016, 6:28:50 PM5/27/16
to Druid User

So if I wish to utilize the "hyperUnique" aggregator, does the following make sense based on my previous Druid Ingestion JSON?

Druid Ingestion JSON Snippet:

"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }]


Druid Query JSON:

{
 "queryType": "groupBy",
 "dataSource": "special_report-V1",
 "granularity": "day",
 "dimensions": ["dim1"],
 "aggregations": [
   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldNames": ["dim2_count"], "byRow":false },

Mark

unread,
May 31, 2016, 3:28:14 PM5/31/16
to Druid User
Thanks @Gian for your input.

Unfortunately, when I run my now updated Druid Indexing Spec and Query below, I get a hyperUnique value of 0.  Any suggestions?

Jonathan Wei

unread,
May 31, 2016, 8:28:24 PM5/31/16
to druid...@googlegroups.com
Hi Mark,

Can you try changing:

   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldNames": ["dim2_count"], "byRow":false },

to:

   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldName": "dim2_count", "byRow":false },



The hyperUnique aggregator only accepts a single field

Thanks,
Jon


Fangjin

unread,
May 31, 2016, 8:31:00 PM5/31/16
to Druid User
Also byRow isn't a field in the hyperUnique aggregator.

Mark

unread,
May 31, 2016, 10:18:51 PM5/31/16
to Druid User
Thanks it worked!  I missed the subtle fieldName declaration difference between this aggregation declaration and the others (http://druid.io/docs/latest/querying/aggregations.html).  Nice catch.

My updated Query JSON is given below.

Druid Ingestion JSON Snippet:

"metricsSpec" : [{"type": "count", "name": "count"}, { "type" : "hyperUnique", "name" : "dim2_count", "fieldName" : "dim2" }]


Druid Query JSON:

{
 "queryType": "groupBy",
 "dataSource": "special_report-V1",
 "granularity": "day",
 "dimensions": ["dim1"],
 "aggregations": [
   { "type": "hyperUnique", "name": "dim2_HyperUniqueCount", "fieldName": "dim2_count" },
   {"type": "count","name": "count"}
 ],
 "postAggregations": [
 ],
 "intervals": [ "2016-05-19T10:00:00.000Z/2016-05-19T12:00:00.000Z" ]
}


For those reading this post, I thought I would also include a link to a helpful article on aggregations: https://theza.ch/2015/04/05/introduction-to-indexing-aggregation-and-querying-in-druid/ .

Bingo

unread,
Dec 30, 2020, 4:02:10 AM12/30/20
to Druid User
Thanks, Really helpful
Reply all
Reply to author
Forward
0 new messages