Previously, Greenplum has done some work to get rid of interconnect deadlock. The following links are in
- https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gMa1tW0x_fk/m/wuzvGXBaBAAJ (the 1st discussion in
gpdb-dev mailing list: Deadlock hazards and joins)
- https://github.com/greenplum-db/gpdb/pull/7492 (Github PR: Fix motion hazard between outer and joinqual
#7492)
-
https://github.com/greenplum-db/gpdb/pull/11862 (Github PR: Prefetch NonJoinQual to avoid motion hazard. #11862)
-
https://github.com/greenplum-db/gpdb/issues/15719 (Github Issue: motion deadlock #15719)
The 1st discussion of the above Link1 talks about the interconnect deadlock between join's outer and inner.
The 2nd,3rd,4th links above are motion deadlock between outer and SubPlan, the difference is where the
SubPlans come from: join qual, plan qual, target list, check option (updatable view).
Introduction to Motion
==================
A QE executes the plan it gets from QD. The QE will figure out the Slice belong to it and only
executes that piece of plan. In Greenplum, a plan is cut into pieces by Motions. A QE might
fetch tuples from child slices (it is possible for a QE to have more than one child slices), and
do some computation and then send out the results to its parent slice. A slice can have multiple
child slices but at most one parent slice.
Motion Plan node's execution model is also one-tuple-a-time. Lets now consider when a motion
executor return NULL to identify that it finishes. When all possible sender procs send EOS to the
receiver proc, then ExecMotion on that specific QE will return NULL.
To fetch all tuples from a motion utile ExecMotion returns NULL is sort of like
the sync operation in MPI programming MPI_Barrier
. Sort over motion, Material over motion
or Hash over motion might have this kind of effect.
Plan that only fetching one tuple from a motion is async: QEs in this slice might reach different
part of code path.
The above is key to understand motion deadlock.
-
Deadlock Pattern: Abstract Description
================================
Without any specific handling, the below is interconnect UDP deadlock pattern:
S is the slice that has more than one child slices: C1, C2, ...
S's executor algorithm match motion deadlock pattern if it is like:
1. fetch one single tuple from C1 (via Motion A)
2. and then fetch all tuples from C2 (via Motion B)
Why? What is the deadlock wait chain in the above pattern?
Lets discuss using the following symbols:
Slices: S, C1, C2 (C1, C2 are child slices of S).
S.x: the QE in Slice S on seg x
C1.x: the QE in Slice C1 on seg x
C2.x: the QE in Slice C2 on seg x
It is easy to build deadlock using the following steps:
- make C1 only send data to seg1 and contains much data
- make C1 only contains data on seg1, so only C1.1 will not be IDLE
- make C2 only send data to seg0 and contains much data
- make C2 only contains data on seg1, so only C2.1 will not be IDLE
Wait events in deadlock chain are:
- S.0 is waiting for one tuple from Motion A
- it cannot get any tuples because we know C1 only sends data to seg1
- so S.0 is actually waiting for the end of Motion A
- it is waiting for C1.1 to end
- S.0 ----> C1.1
- C1.1 is waiting for S.0's ACK, its sender buffer is full and thus blocked
- C1.1 ----> S.0
- S.0 already fetch one tuple from C1 (via motion A) and now it is fetching tuples from C2 (via motion B)
- so it is waiting for C2.1 to end
- S.0 --> C2.1
- C2.1 is sending data to S.0 and its sender buffer is full it waits for ACK from S.0
- C2.1 ----> S.0
- S.0's main thread can now only be waked up by motion A's package
Deadlock happens!
Deadlock Cases: Concrete Examples
==============================
Current
main branch top commit is: df114f0d3d2
To play with all the following cases, you can apply the patch (in
attachment: 0001-revert-prefetch-inner-joinqual-qual-logic.patch
)
to revert previous prefetch fix method. Some of the cases (deadlock
involving SubPlan in target list and check option) do not need any
patches, just the above main branch has the deadlock bug.
GUCs setting:
turn off autovacuum and then set GUCs as below to help reproduce:
set gp_interconnect_queue_depth =1;
set gp_interconnect_snd_queue_depth =1;
set gp_autostats_mode = none;
set disable_cost = 1e20;
Play all cases under 3-seg demo-cluster.
##Case 1: Outer & Inner Deadlock (need to play with the revert prefetch patch)
create table t_motion_deadlock_1(a
int, b
int);
create table t_motion_deadlock_2(a
int, b
int);
-- make t_motion_deadlock_1 only contains much data
-- in segment 1, and due to a = b, all data will also
-- be redistributed to segment 1 on b.
insert into t_motion_deadlock_1
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;
-- the same skill to let t_motion_deadlock_2 only contain
-- much data on segment 0 and only redistributed t segment 0
-- by b.
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_2
where gp_segment_id
<> 0;
insert into t_motion_deadlock_2
select y.a, x.b
from t_motion_deadlock_1 x, t_motion_deadlock_2 y
limit 10;
-- below plan should redistribute both inner and outer
-- hash join
explain (costs off,
verbose)
select count(1)
from
t_motion_deadlock_1 x,
t_motion_deadlock_2 y
where x.b
= y.b;
QUERY PLAN
------------------------------------------------------------------------------
Finalize Aggregate
Output:
count(1)
-> Gather Motion
3:1 (slice1; segments:
3)
Output: (PARTIAL count(1))
->
Partial Aggregate
Output:
PARTIAL count(1)
->
Hash Join
Hash Cond: (x.b
= y.b)
-> Redistribute Motion
3:3 (slice2; segments:
3)
Output: x.b
Hash Key:
x.b
-> Seq Scan
on public.t_motion_deadlock_1 x
Output: x.b
->
Hash
Output: y.b
-> Redistribute Motion
3:3 (slice3; segments:
3)
Output: y.b
Hash Key:
y.b
-> Seq Scan
on public.t_motion_deadlock_2 y
Output: y.b
Optimizer: Postgres query optimizer
-- ==============================================
-- outer plan & joinqual
-- ==============================================
create table t_motion_deadlock_1(a
int, b
int);
create table t_motion_deadlock_2(a
int, b
int);
create table t_motion_deadlock_3(a
int, b
int);
insert into t_motion_deadlock_1
select i,i
from generate_series(1,
10000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30)i;
insert into t_motion_deadlock_3
select i,i
from generate_series(1,
10000)i;
-- hash join
explain (costs off,
verbose)
select count(1)
from
t_motion_deadlock_1 x join t_motion_deadlock_2 y
on x.b
= y.a
and
x.b + y.a
> (select count(1)
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);
QUERY PLAN
--------------------------------------------------------------------------------------------
Finalize Aggregate
Output:
count(1)
-> Gather Motion
3:1 (slice1; segments:
3)
Output: (PARTIAL count(1))
->
Partial Aggregate
Output:
PARTIAL count(1)
->
Hash Join
Hash Cond: (x.b
= y.a)
Join Filter:
((x.b + y.a)
> (SubPlan
1))
-> Redistribute Motion
3:3 (slice2; segments:
3)
Output: x.b, x.a
Hash Key:
x.b
-> Seq Scan
on public.t_motion_deadlock_1 x
Output: x.b, x.a
->
Hash
Output: y.a, y.b
-> Seq Scan
on public.t_motion_deadlock_2 y
Output: y.a, y.b
SubPlan 1
->
Aggregate
Output:
count(1)
-> Result
Filter: (z.b
< (x.a
+ y.b))
-> Materialize
Output: z.b
-> Broadcast Motion
3:3 (slice3; segments:
3)
Output: z.b
-> Seq Scan
on public.t_motion_deadlock_3 z
Output: z.b
Optimizer: Postgres query optimizer
- fetch a tuple from outer's motion
- match a tuple using hash join
- check join qual (this will evaluate the SubPlan which contains a motion)
- note the SubPlan has a materialize over broadcast to make it rescannable
## Case 3: outer & upper level plannode's plan qual
NOTE: this case does not need any patch, current main branch and 6X all will deadlock.
The table and data is just the same as Case 2, and the SQL is also the same, just using ORCA.
gpadmin=# explain (costs
off,
verbose)
select count(1)
from
t_motion_deadlock_1 x join t_motion_deadlock_2 y
on x.b
= y.a
and
x.b + y.a
> (select count(1)
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
Finalize Aggregate
Output:
count(1)
-> Gather Motion
3:1 (slice1; segments:
3)
Output: (PARTIAL count(1))
->
Partial Aggregate
Output:
PARTIAL count(1)
-> Result
Filter: ((t_motion_deadlock_1.b
+ t_motion_deadlock_2.a)
> (SubPlan
1))
->
Hash Join
Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b, t_motion_deadlock_2.a, t_motion_deadlock_2.b
Hash Cond: (t_motion_deadlock_1.b
= t_motion_deadlock_2.a)
-> Redistribute Motion
3:3 (slice2; segments:
3)
Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
Hash Key:
t_motion_deadlock_1.b
-> Seq Scan
on public.t_motion_deadlock_1
Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
->
Hash
Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
-> Seq Scan
on public.t_motion_deadlock_2
Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
SubPlan 1
->
Aggregate
Output:
count(1)
-> Result
Filter: (t_motion_deadlock_3.b
< (t_motion_deadlock_1.a
+ t_motion_deadlock_2.b))
-> Materialize
Output: t_motion_deadlock_3.b
-> Broadcast Motion
3:3 (slice3; segments:
3)
Output: t_motion_deadlock_3.b
-> Seq Scan
on public.t_motion_deadlock_3
Output: t_motion_deadlock_3.b
Optimizer: Pivotal Optimizer (GPORCA)
Deadlock pattern:
- The Result plan (over a HashJoin, and contains a SubPlan in its filter) fetch a tuple from its left tree (Hash Join)
- The HashJoin fetch one tuple from its outer via Motion
- The Hash Join match a tuple and return to its parent (the Result)
- The Result node evaluate its fitler (plan qual) which will evalute the SubPlan containing the motion
- note the SubPlan has a materialize over broadcast to make it rescannable
## Case 4: outer & join's planqual (very very similar to the above case 3)
The table and data is just the same as Case 2.
explain (costs off,
verbose)
select count(1)
from
t_motion_deadlock_1 x left join t_motion_deadlock_2 y
on x.b
= y.a
where
x.a is null or exists (select random()
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate
Output:
count(1)
-> Gather Motion
3:1 (slice1; segments:
3)
Output: (PARTIAL count(1))
->
Partial Aggregate
Output:
PARTIAL count(1)
->
Hash Left Join
Hash Cond: (x.b
= y.a)
Filter: ((x.a
IS NULL)
OR (SubPlan
1))
-> Redistribute Motion
3:3 (slice3; segments:
3)
Output: x.b, x.a
Hash Key:
x.b
-> Seq Scan
on public.t_motion_deadlock_1 x
Output: x.b, x.a
->
Hash
Output: y.a, y.b
-> Seq Scan
on public.t_motion_deadlock_2 y
Output: y.a, y.b
SubPlan 1
-> Result
Filter: (z.b
< (x.a
+ y.b))
-> Materialize
Output: z.b
-> Broadcast Motion
3:3 (slice2; segments:
3)
Output: z.b
-> Seq Scan
on public.t_motion_deadlock_3 z
Output: z.b
Optimizer: Postgres query optimizer
Deadlock pattern: it is very similar to the above case, just the SubPlan appear in HashJoin's plan qual (not join filter).
## Case 5: outer & SubPlan in Target List
create table t_motion_deadlock_1(a
int, b
int);
create table t_motion_deadlock_2(a
int, b
int);
create table t_motion_deadlock_3(a
int, b
int);
insert into t_motion_deadlock_1
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30)i;
insert into t_motion_deadlock_3
select i,i
from generate_series(1,
10000)i;
-- hash join
explain (costs off,
verbose)
select
(select count(1)
from t_motion_deadlock_3 z
where z.a
< x.a
+ y.b ) s
from t_motion_deadlock_1 x
join t_motion_deadlock_2 y
on x.b
= y.a;
QUERY PLAN
--------------------------------------------------------------------------------
Gather Motion 3:1 (slice1; segments: 3)
Output: ((SubPlan 1))
-> Hash Join
Output: (SubPlan 1)
Hash Cond: (x.b = y.a)
-> Redistribute Motion 3:3 (slice3; segments: 3)
Output: x.a, x.b
Hash Key: x.b
-> Seq Scan on public.t_motion_deadlock_1 x
Output: x.a, x.b
-> Hash
Output: y.b, y.a
-> Seq Scan on public.t_motion_deadlock_2 y
Output: y.b, y.a
SubPlan 1
-> Aggregate
Output: count(1)
-> Result
Filter: (z.a < (x.a + y.b))
-> Materialize
Output: z.a
-> Broadcast Motion 3:3 (slice2; segments: 3)
Output: z.a
-> Seq Scan on public.t_motion_deadlock_3 z
Output: z.a
Optimizer: Postgres query optimizer
Deadlock pattern:
- Hash Join fetch a tuple from its outer via Motion
- match a tuple and return to its parent
- The parent needs to evaluate the target list which contains a SubPlan with Motion
- note the SubPlan has a materialize over broadcast to make it rescannable
## Case 6: outer & check option SubPlans
create table t1(a
int);
create table t2(a
int);
create table t3(a
int);
create view mv as
select a
from t1
where exists (select 1 from t2
where t2.a
= t1.a)
with check option;
insert into t1
select generate_series(1,
30000);
delete from t1
where gp_segment_id
<> 2;
insert into t2
select generate_series(1,
30000);
delete from t2
where gp_segment_id
<> 2;
explain (costs off,
verbose)
update mv
set a
= 6;
explain (costs off, verbose) update mv set a = 6;
QUERY PLAN
------------------------------------------------------------------------
Update on public.t1
-> Explicit Redistribute Motion 3:3 (slice1; segments: 3)
Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, (DMLAction)
-> Split
Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, DMLAction
-> Hash Join
Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid
Inner Unique: true
Hash Cond: (t1.a = t2.a)
-> Seq Scan on public.t1
Output: t1.ctid, t1.gp_segment_id, t1.a
-> Hash
Output: t2.ctid, t2.a
-> HashAggregate
Output: t2.ctid, t2.a
Group Key: t2.a
-> Seq Scan on public.t2
Output: t2.ctid, t2.a
SubPlan 1
-> Result
Filter: (t2_1.a = t1.a)
-> Materialize
Output: t2_1.a
-> Broadcast Motion 3:3 (slice2; segments: 3)
Output: t2_1.a
-> Seq Scan on public.t2 t2_1
Output: t2_1.a
SubPlan 2
-> Broadcast Motion 3:3 (slice3; segments: 3)
Output: t2_2.a
-> Seq Scan on public.t2 t2_2
Output: t2_2.a
Optimizer: Postgres query optimizer
Deadlock pattern:
- the ModifyTable node fetch a tuple from its outer via motion
- it check the tuple based on check options which are SubPlans containing motion
- one is normal subplan, there is a material over the broadcast motion
- the other is hash subplan, if the hash table is not built, it will firstly fetch all tuples from motion and build the hash table
More on Interconnect UDP deadlock: when No SubPlan at all
=================================================
When
there is no subplan at all, Join Plan is the only plan that might fall into the deadlock pattern.
- Recursive
Union
- Append Node
- Sequence Node
All the above plan will all firstly finish a part of child plan and then turn to next, so they do not match
deadlock pattern's 1st step : fetch one tuple from motion.
Also note, I believe Merge Join will never hit deadlock (even without prefetch). Reasons:
- Merge join has to make sure both side in order
- If outer plan contain motion, there must be a Sort over that motion
- Sort will fetch all tuples from the motion and then return
- so it also does not match the 1st step of deadlock pattern.
But seems previous engineers still write the same code for merge join.
The fix method is to prefetch inner when outer contain motion for Join Plan.
Planner introduces a lot of code to hint this.
ORCA's rule is so simple, always prefetch inner except for nestloop index scan plan.
I think we can take ORCA's idea, remove all code from planner and decide this runtime.
- for hash join always prefetch inner,
- for merge join keeps it as Postgres
- for nestloop join:
- if there is no params, always prefetch
- if there is params, seems more need to dig into.
Anyway, current code for this part can work.
More on Interconnect UDP deadlock: when there come SubPlans
=====================================================
SubPlans can come easily: you can always write SQL in target list (not pull-up-able) or
in where condition or join condition. So when considering SubPlans, interconnect UDP
deadlock has nothing to do with Join.
So we can always create such kind of plan:
- make a plan contains motion (no Sort or Material over it)
- wrap it in a subquery and then write a SQL in target list
Thus there is an interesting TODO: go through all cases in Greenplum see all patterns of Motion.
Currently in my mind:
- agg
- join
- update|delete with join
- split update
- subplan
- table value expression (scatter clause?)
- ....????
Anyway, for such kind of deadlock, prefetch_joinqual and prefetch_qual is not the correct way to fix.
|
Here are some reminders before you submit the pull request Add tests for the change Document changes Communicate in the mailing list if needed Pass make installcheck Review a PR in return to ...
|
The idea is:
- when the 1st time to execute a plannode
- if it contains subPlans, prefetch all the subPlans
- if it is hash subplan, build the hash table
- for normal subplan, find all materaial over motion, execute then.
Any discussion and comments are welcomed!