Correct plan of general & segmentGeneral path containning volatile functions

43 views
Skip to first unread message

Zhenghua Lyu

unread,
Jul 4, 2020, 1:32:51 PM7/4/20
to Greenplum Developers
 Hi, there is a previous thread https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/JLZYOrgy0tQ on the same subject. And now we find more and create much more test cases to handle a wider scope of issues. So I open a new thread. The following is for the github pr: https://github.com/greenplum-db/gpdb/pull/10418
The pr is to solve a key problem: if segmentgeneral (replicated table) or general locus path contains volatile functions, it cannot be treated general or segmentgeneral any more.
General and segmentGeneral locus imply that if the corresponding slice is executed in many different segments should provide the same result data set. Thus, in some cases, General and segmentGenera...

Please help review it. Thanks in advance!

===================================================

Background of planner’s locus


Greenplum Planner uses locus to model the data distribution of paths. Some typical examples are:

  • Function scan of generate_series is General locus (because the information is complete  at every single place)

  • the scan path of a replicated table is segmenteGeneral locus (because data only stored in segments, not master, and the information is complete at every single segment)

  • ……


General and segmentGeneral locus imply that if the corresponding slice is executed in many different segments should provide the same result data set. Thus, in some cases, General and segmentGeneral can be treated like broadcast.


But what if the segmentGeneral and general locus path contain mutable functions? Mutable functions, by definition, do not guarantee results of different invokes. So for such cases, they lose the property and cannot be treated as *general. Previously, Greenplum planner does not handle these cases correctly. 


The wrong plans

We first notice this issue when Heikki raised an issue in Github: Replicated table has inconsistent content after UPDATE #10226. And after some investigation, we find other bad cases. Some bad plans are listed below, the complete list can be found here: https://gist.github.com/kainwen/ad43f4ae88dc56f3c6eb73056b195e8b


create table t_hashdist(a int, b int, c int) distributed by (a);
CREATE
create table t_rep(a int, b int, c int) distributed replicated;
CREATE

---- pushed down filter
explain (costs off, verbose) select * from t_rep, t_hashdist where t_rep.a > random();
QUERY PLAN                                                                         
-------------------------------------------------------------------------------------
Gather Motion 3:1  (slice1; segments: 3)                                           
  Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c      
  ->  Nested Loop                                                                  
        Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
        ->  Seq Scan on public.t_hashdist                                          
              Output: t_hashdist.a, t_hashdist.b, t_hashdist.c                     
        ->  Materialize                                                            
              Output: t_rep.a, t_rep.b, t_rep.c                                    
              ->  Seq Scan on public.t_rep                                         
                    Output: t_rep.a, t_rep.b, t_rep.c                              
                    Filter: ((t_rep.a)::double precision > random())               
Optimizer: Postgres query optimizer                                                
(12 rows)

-- join qual
explain (costs off, verbose) select * from t_hashdist, t_rep x, t_rep y where x.a + y.a > random();

QUERY PLAN                                                                            
----------------------------------------------------------------------------------------
Gather Motion 3:1  (slice1; segments: 3)                                              
  Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c      
  ->  Nested Loop                                                                     
        Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c
        ->  Nested Loop                                                               
              Output: x.a, x.b, x.c, y.a, y.b, y.c                                    
              Join Filter: (((x.a + y.a))::double precision > random())               
              ->  Seq Scan on public.t_rep x                                          
                    Output: x.a, x.b, x.c                                             
              ->  Materialize                                                         
                    Output: y.a, y.b, y.c                                             
                    ->  Seq Scan on public.t_rep y                                    
                          Output: y.a, y.b, y.c                                       
        ->  Materialize                                                               
              Output: t_hashdist.a, t_hashdist.b, t_hashdist.c                        
              ->  Seq Scan on public.t_hashdist                                       
                    Output: t_hashdist.a, t_hashdist.b, t_hashdist.c                  
Optimizer: Postgres query optimizer                                                   
(18 rows)

The root cause is: if a general or segmentGeneral locus paths contain volatile functions, they cannot be treated as general or segmentGeneral.


Proposal on fix method

The idea is: when we find the pattern (a general or segmentGeneral locus paths contain volatile functions), we create a motion path above it to turn its locus to singleQE and then create a projection path. Then the core job becomes how we choose the places to check:


  1. For a single base rel, we should only check its restriction, this is the at bottom of planner, this is at the function set_rel_pathlist

  2. When creating a join path, if the join locus is general or segmentGeneral, check its joinqual to see if it contains volatile functions

  3. When handling subquery, we will invoke set_subquery_pathlist function, at the end of this function, check the targetlist and havingQual

  4. When creating limit path, the check and change algorithm should also be used

  5. Correctly handle make_subplan


OrderBy clause and Group Clause should be included in targetlist and handled by the above Step 3.


A pr is opened: Correct plan of general & segmentGeneral path with volatiole functions. #10418

DMLs on Replicated table

Update & Delete Statement on a replicated table is special. These statements have to be dispatched to each segment to execute. So if they contain volatile functions in their targetList or where clause, we should reject such statements:


  1. For targetList, we check it at the function create_motion_path_for_upddel 

  2. For where clause, they will be handled in the query planner and if we find the pattern and want to fix it, do another check if we are updating or deleting replicated table, if so reject the statement.


For insert statement, if we have correct behavior on planner, it is OK automatically. Refer to the  cases in the above gist link and the following gist link.

The correct plans after fixing

The cases that matched the above wrong ones are listed below. For the whole list, please refer to https://gist.github.com/kainwen/cd96c8fc35b68c9d87d73af7f269ba99


create table t_hashdist(a int, b int, c int) distributed by (a);
CREATE
create table t_rep(a int, b int, c int) distributed replicated;
CREATE

---- pushed down filter
explain (costs off, verbose) select * from t_rep, t_hashdist where t_rep.a > random();
QUERY PLAN
-------------------------------------------------------------------------------------
Gather Motion 3:1  (slice1; segments: 3)
  Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
  ->  Nested Loop
        Output: t_rep.a, t_rep.b, t_rep.c, t_hashdist.a, t_hashdist.b, t_hashdist.c
        ->  Seq Scan on public.t_hashdist
              Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
        ->  Materialize
              Output: t_rep.a, t_rep.b, t_rep.c
              ->  Broadcast Motion 1:3  (slice2; segments: 1)
                    Output: t_rep.a, t_rep.b, t_rep.c
                    ->  Result
                          Output: t_rep.a, t_rep.b, t_rep.c
                          ->  Seq Scan on public.t_rep
                                Output: t_rep.a, t_rep.b, t_rep.c
                                Filter: ((t_rep.a)::double precision > random())
Optimizer: Postgres query optimizer
(16 rows)

-- join qual
explain (costs off, verbose) select * from t_hashdist, t_rep x, t_rep y where x.a + y.a > random();
QUERY PLAN
----------------------------------------------------------------------------------
Nested Loop
  Output: t_hashdist.a, t_hashdist.b, t_hashdist.c, x.a, x.b, x.c, y.a, y.b, y.c
  ->  Result
        Output: x.a, x.b, x.c, y.a, y.b, y.c
        ->  Gather Motion 1:1  (slice1; segments: 1)
              Output: x.a, x.b, x.c, y.a, y.b, y.c
              ->  Nested Loop
                    Output: x.a, x.b, x.c, y.a, y.b, y.c
                    Join Filter: (((x.a + y.a))::double precision > random())
                    ->  Seq Scan on public.t_rep x
                          Output: x.a, x.b, x.c
                    ->  Materialize
                          Output: y.a, y.b, y.c
                          ->  Seq Scan on public.t_rep y
                                Output: y.a, y.b, y.c
  ->  Materialize
        Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
        ->  Gather Motion 3:1  (slice2; segments: 3)
              Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
              ->  Seq Scan on public.t_hashdist
                    Output: t_hashdist.a, t_hashdist.b, t_hashdist.c
Optimizer: Postgres query optimizer
(22 rows)



Zhenghua Lyu

unread,
Jul 5, 2020, 1:13:24 AM7/5/20
to Greenplum Developers
A diff of the plan before and after this patch can be found : https://www.diffnow.com/report/jeupd
It will be helpful to understand the change.

From: Zhenghua Lyu <zl...@vmware.com>
Sent: Sunday, July 5, 2020 1:32 AM
To: Greenplum Developers <gpdb...@greenplum.org>
Subject: Correct plan of general & segmentGeneral path containning volatile functions
 
--
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.
Reply all
Reply to author
Forward
0 new messages