Hi!
At
https://github.com/greenplum-db/gpdb/pull/10548 I tried to improve
the grouping sets planning so that it would generate two plans that only
differ in the "locus", and decide which is cheaper based on cost.
However, the planner's cost model doesn't take into account that one of
the plans allows the Finalize Aggregation step to run in parallel across
all segments, while in the other plan, it's performed in a single
process, on the QD node.
This can be demonstrated with a slightly simpler query than in the PR.
For example:
Plan A:
postgres=# explain select a, b, sum(a+b) from test_info group by a,b;
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=21101.45..21101.49 rows=4 width=16)
Group Key: a, b
-> Gather Motion 3:1 (slice1; segments: 3)
(cost=21101.00..21101.36 rows=12 width=16)
-> Partial HashAggregate (cost=21101.00..21101.12 rows=4
width=16)
Group Key: a, b
-> Seq Scan on test_info (cost=0.00..11101.00
rows=333334 width=8)
Optimizer: Postgres query optimizer
(7 rows)
Plan B:
QUERY PLAN
-------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=21101.45..21101.57
rows=4 width=16)
-> Finalize HashAggregate (cost=21101.45..21101.49 rows=2 width=16)
Group Key: a, b
-> Redistribute Motion 3:3 (slice2; segments: 3)
(cost=21101.00..21101.36 rows=4 width=16)
Hash Key: a, b
-> Partial HashAggregate (cost=21101.00..21101.12
rows=4 width=16)
Group Key: a, b
-> Seq Scan on test_info (cost=0.00..11101.00
rows=333334 width=8)
Optimizer: Postgres query optimizer
(9 rows)
The plans are identical except for the Motions. Plan A has a Gather
Motion between the Partial and Finalize stages, and performs the
Finalize Aggregate step in the QD node, while plan B has a Redistribute
Motion, performs the Finalize Aggregate step in parallel, and then
Gathers the final result.
Plan A is better when you have a small number of groups in the final
result, while plan B is better with a large number of groups because the
final aggregation can be performed in parallel. Currently, the planner
tries to maximize parallelism, and will only produce a path for plan B,
and plan A isn't even considered. It would be nice to consider both options.
The problem is that our cost model doesn't take that difference in
parallelism into account. Plan B will always look more expensive. The
cost is exactly the same up to the Finalize Aggregate step, but plan B
contains an extra Gather at the top, making it more expensive.
How should we deal with that?
In PostgreSQL, parallelism is reflected in the cost model so that the
cost of a plan node reflects the time that will be spent on that node,
not the total effort across all workers. In other words, the cost of a
node is divided by the numbers of parallel workers executing it. For
example:
postgres=# explain select count(*) from test_info;
QUERY PLAN
-------------------------------------------------------------------------
Aggregate (cost=16925.00..16925.01 rows=1 width=8)
-> Seq Scan on test_info (cost=0.00..14425.00 rows=1000000 width=0)
(2 rows)
vs.
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8)
-> Gather (cost=10633.33..10633.54 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9633.33..9633.34 rows=1 width=8)
-> Parallel Seq Scan on test_info (cost=0.00..8591.67
rows=416667 width=0)
(5 rows)
There are some fudge factors that complicate it a somewhat, but in a
very rough sense, the Parallel Seq Scan with two workers is considered
to have 1/2 the cost of a serial Seq Scan.
I'd like us to adopt this same model for costing Greenplum MPP plans.
It's a big change, though, and could affect a lot of plans. And I'm not
sure what it would look like at the code level.
Thoughts? How does GPORCA model parallelism in the cost estimates?
- Heikki