We push down aggregates to Druid, which is one of our many unique features.
At the translation level this maybe an easy addition of supporting one more function, Spark' HyperLogLogPlusPlus and mapping it to Druid's CardinalityAggregationSpec
See DruidNativeAggregator on how we map simple Aggregate Expressions like(Sum/Min/Max etc) to Druid.
But need to spend time to ensure we cover all the cases, and give the user the right options to set.
I think the semantics of Spark's HyperLogLogPlus and Druid's Cardinality Agg is the same, but have to spent more time on this.
One of the things we guarantee is that you get the same answers when we rewrite queries to use Druid. So if this is not the case, have to
ensure the User explicitly chooses this behavior. Maybe we should introduce a new Agg Function in Spark to surface Druid's approximate aggregation.
Also have to make sure this works with Grouping Sets/Cube/Rollup and in the case of Joins.
I need sometime to work through the above, currently busy getting release 0.3 ready. Will look at this next week.
But if you are interested in taking a stab at developing this feature happy to work with you.
Harish.