Correct plan of general & segmentGeneral path containning volatile functions

43 views

Skip to first unread message

Zhenghua Lyu

unread,

Jul 4, 2020, 1:32:51 PM7/4/20

to Greenplum Developers

Hi, there is a previous thread https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/JLZYOrgy0tQ on the same subject. And now we find more and create much more test cases to handle a wider scope of issues. So I open a new thread. The following is for the github pr: https://github.com/greenplum-db/gpdb/pull/10418

The pr is to solve a key problem: if segmentgeneral (replicated table) or general locus path contains volatile functions, it cannot be treated general or segmentgeneral any more.

Correct plan of general & segmentGeneral path with volatiole functions. by kainwen · Pull Request #10418 · greenplum-db/gpdb

General and segmentGeneral locus imply that if the corresponding slice is executed in many different segments should provide the same result data set. Thus, in some cases, General and segmentGenera...

github.com

Please help review it. Thanks in advance!

===================================================

Background of planner’s locus

Greenplum Planner uses locus to model the data distribution of paths. Some typical examples are:

Function scan of generate_series is General locus (because the information is complete at every single place)

the scan path of a replicated table is segmenteGeneral locus (because data only stored in segments, not master, and the information is complete at every single segment)

……

General and segmentGeneral locus imply that if the corresponding slice is executed in many different segments should provide the same result data set. Thus, in some cases, General and segmentGeneral can be treated like broadcast.

But what if the segmentGeneral and general locus path contain mutable functions? Mutable functions, by definition, do not guarantee results of different invokes. So for such cases, they lose the property and cannot be treated as *general. Previously, Greenplum planner does not handle these cases correctly.

The wrong plans

We first notice this issue when Heikki raised an issue in Github: Replicated table has inconsistent content after UPDATE #10226. And after some investigation, we find other bad cases. Some bad plans are listed below, the complete list can be found here: https://gist.github.com/kainwen/ad43f4ae88dc56f3c6eb73056b195e8b

create table t_hashdist(a int, b int, c int) distributed by (a);
CREATE
create table t_rep(a int, b int, c int) distributed replicated;
CREATE

---- pushed down filter
explain (costs off, verbose) select * from t_rep, t_hashdist where t_rep.a > random();
QUERY PLAN
-------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3)
Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Nested Loop
Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Seq Scan on public.t_hashdist
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Materialize
Output: t_rep.a, t_rep.b, t_rep.c
-> Seq Scan on public.t_rep
Output: t_rep.a, t_rep.b, t_rep.c
Filter: ((t_rep.a)::double precision > random())
Optimizer: Postgres query optimizer
(12 rows)

-- join qual
explain (costs off, verbose) select * from t_hashdist, t_rep x, t_rep y where x.a + y.a > random();

QUERY PLAN
----------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3)
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c
-> Nested Loop
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c
-> Nested Loop
Output: x.a, x.b, x.c, y.a, y.b, y.c
Join Filter: (((x.a + y.a))::double precision > random())
-> Seq Scan on public.t_rep x
Output: x.a, x.b, x.c
-> Materialize
Output: y.a, y.b, y.c
-> Seq Scan on public.t_rep y
Output: y.a, y.b, y.c
-> Materialize
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Seq Scan on public.t_hashdist
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
Optimizer: Postgres query optimizer
(18 rows)

The root cause is: if a general or segmentGeneral locus paths contain volatile functions, they cannot be treated as general or segmentGeneral.

Proposal on fix method

The idea is: when we find the pattern (a general or segmentGeneral locus paths contain volatile functions), we create a motion path above it to turn its locus to singleQE and then create a projection path. Then the core job becomes how we choose the places to check:

For a single base rel, we should only check its restriction, this is the at bottom of planner, this is at the function set_rel_pathlist

When creating a join path, if the join locus is general or segmentGeneral, check its joinqual to see if it contains volatile functions

When handling subquery, we will invoke set_subquery_pathlist function, at the end of this function, check the targetlist and havingQual

When creating limit path, the check and change algorithm should also be used

Correctly handle make_subplan

OrderBy clause and Group Clause should be included in targetlist and handled by the above Step 3.

A pr is opened: Correct plan of general & segmentGeneral path with volatiole functions. #10418

DMLs on Replicated table

Update & Delete Statement on a replicated table is special. These statements have to be dispatched to each segment to execute. So if they contain volatile functions in their targetList or where clause, we should reject such statements:

For targetList, we check it at the function create_motion_path_for_upddel

For where clause, they will be handled in the query planner and if we find the pattern and want to fix it, do another check if we are updating or deleting replicated table, if so reject the statement.

For insert statement, if we have correct behavior on planner, it is OK automatically. Refer to the cases in the above gist link and the following gist link.

The correct plans after fixing

The cases that matched the above wrong ones are listed below. For the whole list, please refer to https://gist.github.com/kainwen/cd96c8fc35b68c9d87d73af7f269ba99

create table t_hashdist(a int, b int, c int) distributed by (a);
CREATE
create table t_rep(a int, b int, c int) distributed replicated;
CREATE

---- pushed down filter
explain (costs off, verbose) select * from t_rep, t_hashdist where t_rep.a > random();
QUERY PLAN
-------------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3)
Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Nested Loop
Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Seq Scan on public.t_hashdist
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Materialize
Output: t_rep.a, t_rep.b, t_rep.c
-> Broadcast Motion 1:3 (slice2; segments: 1)
Output: t_rep.a, t_rep.b, t_rep.c
-> Result
Output: t_rep.a, t_rep.b, t_rep.c
-> Seq Scan on public.t_rep
Output: t_rep.a, t_rep.b, t_rep.c
Filter: ((t_rep.a)::double precision > random())
Optimizer: Postgres query optimizer
(16 rows)

-- join qual
explain (costs off, verbose) select * from t_hashdist, t_rep x, t_rep y where x.a + y.a > random();
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c
-> Result
Output: x.a, x.b, x.c, y.a, y.b, y.c
-> Gather Motion 1:1 (slice1; segments: 1)
Output: x.a, x.b, x.c, y.a, y.b, y.c
-> Nested Loop
Output: x.a, x.b, x.c, y.a, y.b, y.c
Join Filter: (((x.a + y.a))::double precision > random())
-> Seq Scan on public.t_rep x
Output: x.a, x.b, x.c
-> Materialize
Output: y.a, y.b, y.c
-> Seq Scan on public.t_rep y
Output: y.a, y.b, y.c
-> Materialize
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Gather Motion 3:1 (slice2; segments: 3)
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
-> Seq Scan on public.t_hashdist
Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
Optimizer: Postgres query optimizer
(22 rows)

Zhenghua Lyu

unread,

Jul 5, 2020, 1:13:24 AM7/5/20

to Greenplum Developers

A diff of the plan before and after this patch can be found : https://www.diffnow.com/report/jeupd

It will be helpful to understand the change.

From: Zhenghua Lyu <zl...@vmware.com>
Sent: Sunday, July 5, 2020 1:32 AM
To: Greenplum Developers <gpdb...@greenplum.org>
Subject: Correct plan of general & segmentGeneral path containning volatile functions

--
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Reply all

Reply to author

Forward

0 new messages