From the documentation:
"Citus supports count(distinct) aggregates in several ways. If the count(distinct) aggregate is on the distribution column, Citus can directly push down the query to the workers. If not, Citus needs to repartition the underlying data in the cluster to parallelize count(distinct) aggregates and avoid pulling all rows to the master."
But Citus does not seem to be performing this repartitioning:
select count(distinct car_id) from distrib_mcuser_tripsummary;
ERROR: cannot compute aggregate (distinct)
DETAIL: table partitioning is unsuitable for aggregate (distinct)
Am I missing something that enables automatic repartitioning here?
It seems to be happening automatically for distributed joins on non-distribution columns--I'm using the task-tracker executor.
I'm also curious about the reasons why certain SQL features are unsupported:
select distinct car_id from distrib_mcuser_tripsummary;
ERROR: cannot perform distributed planning on this query
DETAIL: Distinct clause is currently unsupported
when this works to yield distinct values:
select car_id from distrib_mcuser_tripsummary group by car_id;
select count(*) from (select car_id from distrib_mcuser_tripsummary group by car_id) subquery;
ERROR: cannot perform distributed planning on this query
DETAIL: Subqueries without aggregates are not supported yet
but this workaround for count distinct runs, despite the aggregate not touching any data:
select count(*) from (select car_id, count(1) from distrib_mcuser_tripsummary group by car_id) subquery;
Thanks,
Franklin