count(distinct(dimension)) cannot translate to hyperUnique

Zha Rui

unread,

Aug 30, 2016, 12:11:15 PM8/30/16

to sparklinedata

Hi experts:

I created a batch ingestion spec and defined a hyperUnique metric in it. When I query data from dataSource using sparkline, I found the count(distinct(dimension)) cannot translate to hyperUnique aggregation. Is it a bug or the misuse I made? or this feature is not supported for now?

Zha Rui

harish

unread,

Aug 30, 2016, 12:28:26 PM8/30/16

to sparklinedata

We haven't added support for hll based approximate count distinct. So far we have focused on exact count distinct.

I think it is relatively easy to add, can you provide some details about your use case.

Also can you share your indexing spec.

Harish.

Zha Rui

unread,

Aug 30, 2016, 11:28:37 PM8/30/16

to sparklinedata

Hi Harish:

Thank you for reply. My application scenarios is eval the number of unique visitors among 100+ million events, and the cardinality of unique visitors is very large, so the performance is not good if use the exact count distinct.

Besides, I read TPCH Benchmark you wrote and you said the Count-Distinct aggregation in case TPCH Q1 use Cardinality Aggregator in Druid. And I also noticed there's a "pushHLLTODruid" option in Druid datasource options.

I also check the source code, and I found that unapply method of DruidNativeAggregator class would init CountDistinctAggregate, and in CountDistinctAggregate, just HyperLogLogPlusPlus type of AggregateFunction would make CountDistinctAggregate returns a not none object, but AggregateFunction is not HyperLogLogPlusPlus obviously.

在 2016年8月31日星期三 UTC+8上午12:28:26，harish写道：

Zha Rui

unread,

Aug 31, 2016, 4:56:07 AM8/31/16

to sparklinedata

If spark cannot support pushing down aggregate to external datasource I this this issue is not easy to resolve

在 2016年8月31日星期三 UTC+8上午12:28:26，harish写道：

We haven't added support for hll based approximate count distinct. So far we have focused on exact count distinct.

harish

unread,

Aug 31, 2016, 12:02:45 PM8/31/16

to sparklinedata

We push down aggregates to Druid, which is one of our many unique features.

At the translation level this maybe an easy addition of supporting one more function, Spark' HyperLogLogPlusPlus and mapping it to Druid's CardinalityAggregationSpec

See DruidNativeAggregator on how we map simple Aggregate Expressions like(Sum/Min/Max etc) to Druid.

But need to spend time to ensure we cover all the cases, and give the user the right options to set.

I think the semantics of Spark's HyperLogLogPlus and Druid's Cardinality Agg is the same, but have to spent more time on this.

One of the things we guarantee is that you get the same answers when we rewrite queries to use Druid. So if this is not the case, have to

ensure the User explicitly chooses this behavior. Maybe we should introduce a new Agg Function in Spark to surface Druid's approximate aggregation.

Also have to make sure this works with Grouping Sets/Cube/Rollup and in the case of Joins.

I need sometime to work through the above, currently busy getting release 0.3 ready. Will look at this next week.

But if you are interested in taking a stab at developing this feature happy to work with you.

Harish.

Zha Rui

unread,

Sep 1, 2016, 3:25:21 AM9/1/16

to sparklinedata

Thank you so much Harish! I just implement hyperUnique in sparklinedata. What I've done are:

1. Add a option called "hyperUniqueMapping" in Druid database options.

2. Add a HyperUniqueAggregate object implementation in AggregateTransform.scala.

3. Add a HyperUniqueAggregationSpec case class in DruidQuerySpec.scala.

4. Add HyperUniqueAggregate pattern matching in DruidNativeAggregate.

Finally I can do some thing like "select time, approx_count_distinct(visitor) from tbl group by time". Thank you again!

Zha Rui

在 2016年9月1日星期四 UTC+8上午12:02:45，harish写道：

harish

unread,

Sep 1, 2016, 10:46:54 AM9/1/16

to sparklinedata

great! If you are ok with this, send a pull request.

I think we should support both HyperUniqueAggregationSpec and CardinalityAggregationSpec. The mapping should depend on the datatype of the metric.

Reply all

Reply to author

Forward