Mike,
Thanks for the patch. It looks good to go. I can check it directly into the github master tomorrow, unless you want to post a formal pull request tonight.
In my review of the patch, I discovered a few opportunities for follow-up work.
1. The pseudo-handling of multi-column distinct via multiple distinct nodes (currently guarded by hard-coded restrictions) needs to be eliminated - it couldn't really work.
2. The existing distinct nodes should be generalized to take a list of expressions rather than a single column reference.
This expansion of functionality will eventually need to make its way down to the distinct executor and into some unit tests.
Multi-expression GROUP BY could provide a model for this -- or actually a replacement (more on this below).
3. There are opportunities for further optimization specific to "distinct" processing of special cases: if the distinct expression (or any element of a distinct expression list) is a reference to a partitioning column, all the work can be done by the distributed distinct node and the final distinct node can be left off -- actually this also applies to GROUP BY but I don't recall seeing this optimization in the "Aggregation" push-down either.
4. The placement of a projection node above the distinct node -- if one is ever needed -- seems like a missed optimization -- the only column values that can usefully flow into a distinct node are the exact same ones that flow out of it. It seems like we could be needlessly sending/receiving overly wide rows if a required projection is not pushed below the distributed distinct.
Actually, items 2 and 3 make me want to re-write distinct as a special case of GROUP BY and possibly eliminate the redundancy of separate support for a distinct executor -- it's hard for me to imagine that there's a lot to be gained at execution time from hard-coding the special-case constraints of distinct vs. group by (no extra columns -- just the same set coming in, getting grouped, and going out).
I'll log issues for all of these tomorrow.
I'll send another message shortly with some ideas about pull-based execution.
--paul