Confusion around (non-hll) distinct

53 views

Skip to first unread message

fran...@automatic.com

unread,

May 4, 2016, 6:17:06 PM5/4/16

to citus-users

From the documentation:

"Citus supports count(distinct) aggregates in several ways. If the count(distinct) aggregate is on the distribution column, Citus can directly push down the query to the workers. If not, Citus needs to repartition the underlying data in the cluster to parallelize count(distinct) aggregates and avoid pulling all rows to the master."

But Citus does not seem to be performing this repartitioning:

select count(distinct car_id) from distrib_mcuser_tripsummary;
ERROR:  cannot compute aggregate (distinct)
DETAIL:  table partitioning is unsuitable for aggregate (distinct)

Am I missing something that enables automatic repartitioning here?

It seems to be happening automatically for distributed joins on non-distribution columns--I'm using the task-tracker executor.

I'm also curious about the reasons why certain SQL features are unsupported:

select distinct car_id from distrib_mcuser_tripsummary;
ERROR:  cannot perform distributed planning on this query
DETAIL:  Distinct clause is currently unsupported

when this works to yield distinct values:

select car_id from distrib_mcuser_tripsummary group by car_id;

select count(*) from (select car_id from distrib_mcuser_tripsummary group by car_id) subquery;
ERROR:  cannot perform distributed planning on this query 
DETAIL:  Subqueries without aggregates are not supported yet

but this workaround for count distinct runs, despite the aggregate not touching any data:

select count(*) from (select car_id, count(1) from distrib_mcuser_tripsummary group by car_id) subquery;

Thanks,

Franklin

Marco Slot

unread,

May 6, 2016, 7:29:53 AM5/6/16

to citus-users

On Thursday, May 5, 2016 at 12:17:06 AM UTC+2, fran...@automatic.com wrote:

"Citus supports count(distinct) aggregates in several ways. If the count(distinct) aggregate is on the distribution column, Citus can directly push down the query to the workers. If not, Citus needs to repartition the underlying data in the cluster to parallelize count(distinct) aggregates and avoid pulling all rows to the master."

But Citus does not seem to be performing this repartitioning:

select count(distinct car_id) from distrib_mcuser_tripsummary; ERROR: cannot compute aggregate (distinct) DETAIL: table partitioning is unsuitable for aggregate (distinct)

Am I missing something that enables automatic repartitioning here?

No, sorry, this seems to be an issue with the documentation. Citus does not yet automatically re-partitioned data for count distinct queries. The recommended way is to use HLL.

I'm also curious about the reasons why certain SQL features are unsupported:

select distinct car_id from distrib_mcuser_tripsummary; ERROR: cannot perform distributed planning on this query DETAIL: Distinct clause is currently unsupported
when this works to yield distinct values:
select car_id from distrib_mcuser_tripsummary group by car_id;

I think it just hasn't been implemented yet, but maybe you can add to the issue: https://github.com/citusdata/citus/issues/144

select count(*) from (select car_id from distrib_mcuser_tripsummary group by car_id) subquery; ERROR: cannot perform distributed planning on this query DETAIL: Subqueries without aggregates are not supported yet
but this workaround for count distinct runs, despite the aggregate not touching any data:

select count(*) from (select car_id, count(1) from distrib_mcuser_tripsummary group by car_id) subquery;

The subquery logic is currently a bit too restrictive. We are working on making it more flexible. In the upcoming 5.1 release (or the master branch in the GitHub repo) you can also just do select count(distinct car_id) from distrib_mcuser_tripsummary, assuming car_id is the distribution column.