Cost modeling parallelism

89 views
Skip to first unread message

Heikki Linnakangas

unread,
Jul 30, 2020, 3:47:45 AM7/30/20
to Greenplum Developers
Hi!

At https://github.com/greenplum-db/gpdb/pull/10548 I tried to improve
the grouping sets planning so that it would generate two plans that only
differ in the "locus", and decide which is cheaper based on cost.
However, the planner's cost model doesn't take into account that one of
the plans allows the Finalize Aggregation step to run in parallel across
all segments, while in the other plan, it's performed in a single
process, on the QD node.

This can be demonstrated with a slightly simpler query than in the PR.
For example:

Plan A:

postgres=# explain select a, b, sum(a+b) from test_info group by a,b;
QUERY PLAN

--------------------------------------------------------------------------------------------
Finalize HashAggregate (cost=21101.45..21101.49 rows=4 width=16)
Group Key: a, b
-> Gather Motion 3:1 (slice1; segments: 3)
(cost=21101.00..21101.36 rows=12 width=16)
-> Partial HashAggregate (cost=21101.00..21101.12 rows=4
width=16)
Group Key: a, b
-> Seq Scan on test_info (cost=0.00..11101.00
rows=333334 width=8)
Optimizer: Postgres query optimizer
(7 rows)

Plan B:

QUERY PLAN

-------------------------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3) (cost=21101.45..21101.57
rows=4 width=16)
-> Finalize HashAggregate (cost=21101.45..21101.49 rows=2 width=16)
Group Key: a, b
-> Redistribute Motion 3:3 (slice2; segments: 3)
(cost=21101.00..21101.36 rows=4 width=16)
Hash Key: a, b
-> Partial HashAggregate (cost=21101.00..21101.12
rows=4 width=16)
Group Key: a, b
-> Seq Scan on test_info (cost=0.00..11101.00
rows=333334 width=8)
Optimizer: Postgres query optimizer
(9 rows)


The plans are identical except for the Motions. Plan A has a Gather
Motion between the Partial and Finalize stages, and performs the
Finalize Aggregate step in the QD node, while plan B has a Redistribute
Motion, performs the Finalize Aggregate step in parallel, and then
Gathers the final result.

Plan A is better when you have a small number of groups in the final
result, while plan B is better with a large number of groups because the
final aggregation can be performed in parallel. Currently, the planner
tries to maximize parallelism, and will only produce a path for plan B,
and plan A isn't even considered. It would be nice to consider both options.

The problem is that our cost model doesn't take that difference in
parallelism into account. Plan B will always look more expensive. The
cost is exactly the same up to the Finalize Aggregate step, but plan B
contains an extra Gather at the top, making it more expensive.

How should we deal with that?

In PostgreSQL, parallelism is reflected in the cost model so that the
cost of a plan node reflects the time that will be spent on that node,
not the total effort across all workers. In other words, the cost of a
node is divided by the numbers of parallel workers executing it. For
example:


postgres=# explain select count(*) from test_info;
QUERY PLAN
-------------------------------------------------------------------------
Aggregate (cost=16925.00..16925.01 rows=1 width=8)
-> Seq Scan on test_info (cost=0.00..14425.00 rows=1000000 width=0)
(2 rows)


vs.

QUERY PLAN

--------------------------------------------------------------------------------------------
Finalize Aggregate (cost=10633.55..10633.56 rows=1 width=8)
-> Gather (cost=10633.33..10633.54 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9633.33..9633.34 rows=1 width=8)
-> Parallel Seq Scan on test_info (cost=0.00..8591.67
rows=416667 width=0)
(5 rows)

There are some fudge factors that complicate it a somewhat, but in a
very rough sense, the Parallel Seq Scan with two workers is considered
to have 1/2 the cost of a serial Seq Scan.

I'd like us to adopt this same model for costing Greenplum MPP plans.
It's a big change, though, and could affect a lot of plans. And I'm not
sure what it would look like at the code level.

Thoughts? How does GPORCA model parallelism in the cost estimates?

- Heikki

Jinbao Chen

unread,
Jul 30, 2020, 5:26:30 AM7/30/20
to Greenplum Developers, Heikki Linnakangas

I am thinking about whether we can get enough benefits after introducing additional complexity.

If the number of groups is small, then whether it is PlanA or PlanB, the cost of Final Aggregate is very small. If the number of groups is large, obviously PlanB is a better choice.

PlanA has advantages when using GroupAgg in one-stage aggregation. Because Final Aggregate no longer needs to sort, but the single-point bottleneck problem is likely to be more serious.

So I think that considering PlanA can't bring too many benefits, but it adds complexity.

Jinbao Chen

unread,
Aug 4, 2020, 4:27:27 AM8/4/20
to Greenplum Developers, Jinbao Chen, Heikki Linnakangas
Hi,
I thought about it carefully and think that it is necessary to consider the difference in parallelism.  Just ignore my previous reply.

Heikki Linnakangas

unread,
Aug 4, 2020, 4:43:25 AM8/4/20
to Jinbao Chen (Pivotal), Greenplum Developers
Ok. There are actually two slightly different issues here:

1. The cost model doesn't currently take parallelism into account.
Performing a computation in parallel on all QEs, and performing the same
computation in a single node, both have the same total cost.

2. We don't generate Paths to perform the same computation in multiple
loci. In the example discussed here, it's a GROUP BY, but the same
applies to joins too. If you have a join between A and B, and the
planner has a choice to redistribute A or B, it will only generate a
Path for one of the choices. The other choice might seem more expensive
"locally" at the join node, but because it produces a different
distribution of the data, it might avoid a Motion elsewhere in the plan,
making the overall plan cheaper.

Even if we decide that it's not worth it to address issue 2 right now, I
think it would still make sense to address issue 1, and change the cost
model so that the cost and row count of each individual node represents
the cost in a single segment, not the sum of all work across all nodes.
That would be consistent with the way PostgreSQL represents costs of
parallel plans.

- Heikki

On 04/08/2020 11:27, Jinbao Chen wrote:
> Hi,
> I thought about it carefully and think that it is necessary to consider
> the difference in parallelism.  Just ignore my previous reply.
>
> On Thursday, July 30, 2020 at 5:26:30 PM UTC+8 Jinbao Chen wrote:
>
>
> I am thinking about whether we can get enough benefits after
> introducing additional complexity.
>
> If the number of groups is small, then whether it is PlanA or PlanB,
> the cost of Final Aggregate is very small. If the number of groups
> is large, obviously PlanB is a better choice.
>
> PlanA has advantages when using GroupAgg in one-stage aggregation.
> Because Final Aggregate no longer needs to sort, but the
> single-point bottleneck problem is likely to be more serious.
>
> So I think that considering PlanA can't bring too many benefits, but
> it adds complexity.
>
>
>
> On Thursday, July 30, 2020 at 3:47:45 PM UTC+8 Heikki Linnakangas wrote:
>
> Hi!
>
> At https://github.com/greenplum-db/gpdb/pull/10548
> <https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgreenplum-db%2Fgpdb%2Fpull%2F10548&data=02%7C01%7Clinnakangash%40vmware.com%7Cf40ba426b5ea42dd677008d838503ad8%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637321264520031929&sdata=s8FYPt8reoUvybxP7gX%2FOxWWC9mhGKhneQd89amGCqA%3D&reserved=0>

Jinbao Chen

unread,
Aug 4, 2020, 5:54:35 AM8/4/20
to Greenplum Developers, Heikki Linnakangas, Jinbao Chen
We can first try to solve issue 1 in the aggregation path. The aggregation path is the part that suffers the most from issue 1. Not only will it affect the costs on a single node. At the same time, when considering spill and streaming, if we do not consider the number of execution QEs, we have no way to estimate the number of groups of each node and the size of available memory, so that the number of spilled tuples and output tuples cannot be estimated.
I checked the performance pipeline and found that the wrong Aggregate Plan was the main reason that caused the master to be slower than 6x.

Zhenghua Lyu

unread,
Aug 8, 2020, 7:21:47 AM8/8/20
to Heikki Linnakangas, Greenplum Developers
I'd like us to adopt this same model for costing Greenplum MPP plans.
It's a big change, though, and could affect a lot of plans. And I'm not
sure what it would look like at the code level.
I vote for this method. I think this will help a lot on the performance of planner.
It is a good time now.

From: Heikki Linnakangas <linnak...@vmware.com>
Sent: Thursday, July 30, 2020 3:47 PM
To: Greenplum Developers <gpdb...@greenplum.org>
Subject: Cost modeling parallelism
 
--
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Heikki Linnakangas

unread,
Sep 1, 2020, 3:08:34 AM9/1/20
to Zhenghua Lyu, Greenplum Developers
On 08/08/2020 14:21, Zhenghua Lyu wrote:
> I'd like us to adopt this same model for costing Greenplum MPP plans.
> It's a big change, though, and could affect a lot of plans. And I'm not
> sure what it would look like at the code level.
> I vote for this method. I think this will help a lot on the performance of planner.
> It is a good time now.

To close the loop here: I pushed PR
https://github.com/greenplum-db/gpdb/pull/10676 to do this now, commit
c5f6dbbe84c43741db1a1c6e31bac1e3bb37bfaa.

I'm going to continue on https://github.com/greenplum-db/gpdb/pull/10548
now, to address the original issue that started this discussion.

- Heikki
Reply all
Reply to author
Forward
0 new messages