On Interconnect UDP deadlock in Greenplum

117 views
Skip to first unread message

Zhenghua Lyu

unread,
Jul 2, 2023, 9:20:29 AM7/2/23
to gpdb...@greenplum.org
Hi all,
    
    Interconnect UDP is the most widely used interconnect type in Greenplum. Interconnect UDP deadlock is
    deadlock among QEs due to the current interconnect UDP's implementation. When it happens, it is super
    hard to debug to even identify that is interconnect deadlock.
 
   Historical Background
   ==================

   Previously, Greenplum has done some work to get rid of interconnect deadlock. The following links are in
   time order (from old times to now):

  1. https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gMa1tW0x_fk/m/wuzvGXBaBAAJ (the 1st discussion in gpdb-dev mailing list: Deadlock hazards and joins)
  2. https://github.com/greenplum-db/gpdb/pull/7492 (Github PR: Fix motion hazard between outer and joinqual #7492)
  3. https://github.com/greenplum-db/gpdb/pull/11862 (Github PR: Prefetch NonJoinQual to avoid motion hazard. #11862)
  4. https://github.com/greenplum-db/gpdb/issues/15719 (Github Issue: motion deadlock #15719)

    The 1st discussion of the above Link1 talks about the interconnect deadlock between join's outer and inner.
    The 2nd,3rd,4th links above are motion deadlock between outer and SubPlan, the difference is where the
    SubPlans come from: join qual, plan qual, target list, check option (updatable view).

   Introduction to Motion
   ==================

   
    A QE executes the plan it gets from QD. The QE will figure out the Slice belong to it and only
    executes that piece of plan. In Greenplum, a plan is cut into pieces by Motions. A QE might
    fetch tuples from child slices (it is possible for a QE to have more than one child slices), and
    do some computation and then send out the results to its parent slice. A slice can have multiple
    child slices but at most one parent slice.

    Motion Plan node's execution model is also one-tuple-a-time. Lets now consider when a motion
    executor return NULL to identify that it finishes. When all possible sender procs send EOS to the
    receiver proc, then ExecMotion on that specific QE will return NULL.

    To fetch all tuples from a motion utile ExecMotion returns NULL is sort of like
    the sync operation in MPI programming MPI_Barrier​. Sort over motion, Material over motion
    or Hash over motion might have this kind of effect.
  
    Plan that only fetching one tuple from a motion is async: QEs in this slice might reach different
    part of code path.
 
    The above is key to understand motion deadlock.
    
   Deadlock Pattern: Abstract Description
   ================================

    Without any specific handling, the below is interconnect UDP deadlock pattern:

    S is the slice that has more than one child slices: C1, C2, ...
    S's executor algorithm match motion deadlock pattern if it is like:

    1. fetch one single tuple from C1 (via Motion A)
    2. and then fetch all tuples from C2 (via Motion B)

    Why? What is the deadlock wait chain in the above pattern?
    Lets discuss using the following symbols:
    
    Slices: S, C1, C2 (C1, C2 are child slices of S).
    S.x: the QE in Slice S on seg x
    C1.x: the QE in Slice C1 on seg x
    C2.x: the QE in Slice C2 on seg x

    It is easy to build deadlock using the following steps:
  1. make C1 only send data to seg1 and contains much data
  2. make C1 only contains data on seg1, so only C1.1 will not be IDLE
  3. make C2 only send data to seg0 and contains much data
  4. make C2 only contains data on seg1, so only C2.1 will not be IDLE
    Wait events in deadlock chain are:

  1. S.0 is waiting for one tuple from Motion A
    1. it cannot get any tuples because we know C1 only sends data to seg1
    2. so S.0 is actually waiting for the end of Motion A
    3. it is waiting for C1.1 to end
    4. S.0 ----> C1.1
  2. C1.1 is waiting for S.0's ACK, its sender buffer is full and thus blocked
    1. C1.1 ----> S.0
  3. S.0 already fetch one tuple from C1 (via motion A) and now it is fetching tuples from C2 (via motion B)
    1. so it is waiting for C2.1 to end
    2. S.0 --> C2.1
  4. C2.1 is sending data to S.0 and its sender buffer is full it waits for ACK from S.0
    1. C2.1 ----> S.0
  5. S.0's main thread can now only be waked up by motion A's package
    Deadlock happens!

   Deadlock Cases: Concrete Examples
   ==============================

   Current main branch top commit is: df114f0d3d2
    To play with all the following cases, you can apply the patch (in
    attachment: 0001-revert-prefetch-inner-joinqual-qual-logic.patch​)
    to revert previous prefetch fix method. Some of the cases (deadlock
    involving SubPlan in target list and check option) do not need any
    patches, just the above main branch has the deadlock bug.

    GUCs setting:
    turn off autovacuum and then set GUCs as below to help reproduce:

    set gp_interconnect_queue_depth =1;
    set gp_interconnect_snd_queue_depth =1;
    set gp_autostats_mode = none;
    set disable_cost = 1e20;

    Play all cases under 3-seg demo-cluster.

    ##Case 1: Outer & Inner Deadlock (need to play with the revert prefetch patch)
   
    
create table t_motion_deadlock_1(a int, b int);
create table t_motion_deadlock_2(a int, b int);

-- make t_motion_deadlock_1 only contains much data
-- in segment 1, and due to a = b, all data will also
-- be redistributed to segment 1 on b.
insert into t_motion_deadlock_1 select i,i from generate_series(1, 30000)i;
delete from t_motion_deadlock_1 where gp_segment_id <> 1;

-- the same skill to let t_motion_deadlock_2 only contain
-- much data on segment 0 and only redistributed t segment 0
-- by b.
insert into t_motion_deadlock_2 select i,i from generate_series(1, 30000)i;
delete from t_motion_deadlock_2 where gp_segment_id <> 0;
insert into t_motion_deadlock_2
select y.a, x.b from t_motion_deadlock_1 x, t_motion_deadlock_2 y limit 10;
-- below plan should redistribute both inner and outer
-- hash join
explain (costs off, verbose)
select count(1)
from
  t_motion_deadlock_1 x,
  t_motion_deadlock_2 y
where x.b = y.b;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(1)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(1))
         ->  Partial Aggregate
               Output: PARTIAL count(1)
               ->  Hash Join
                     Hash Cond: (x.b = y.b)
                     ->  Redistribute Motion 3:3  (slice2; segments: 3)
                           Output: x.b
                           Hash Key: x.b
                           ->  Seq Scan on public.t_motion_deadlock_1 x
                                 Output: x.b
                     ->  Hash
                           Output: y.b
                           ->  Redistribute Motion 3:3  (slice3; segments: 3)
                                 Output: y.b
                                 Hash Key: y.b
                                 ->  Seq Scan on public.t_motion_deadlock_2 y
                                       Output: y.b
 Optimizer: Postgres query optimizer
    
Deadlock pattern:

Hash Join Slice's execute algorithm (without prefetch inner):
  1. fetch one tuple from outer plan to see if any
  2. if there is no tuple from outer plan, then no need to build hash table, just return NULL
  3. if there is any tuple from outer, then fetch all tuples from inner to build the Hash Table 
 
##Case 2: outer & subplan in joinqual

-- ==============================================
-- outer plan & joinqual
-- ==============================================
create table t_motion_deadlock_1(a int, b int);
create table t_motion_deadlock_2(a int, b int);
create table t_motion_deadlock_3(a int, b int);

insert into t_motion_deadlock_1 select i,i from generate_series(1, 10000)i;
delete from t_motion_deadlock_1 where gp_segment_id <> 1;
insert into t_motion_deadlock_2 select i,i from generate_series(1, 30)i;
insert into t_motion_deadlock_3 select i,i from generate_series(1, 10000)i;

-- hash join
explain (costs off, verbose)
select count(1)
from
  t_motion_deadlock_1 x join t_motion_deadlock_2 y
  on x.b = y.a and
     x.b + y.a > (select count(1) from t_motion_deadlock_3 z where z.b < x.a + y.b);

                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(1)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(1))
         ->  Partial Aggregate
               Output: PARTIAL count(1)
               ->  Hash Join
                     Hash Cond: (x.b = y.a)
                     Join Filter: ((x.b + y.a) > (SubPlan 1))
                     ->  Redistribute Motion 3:3  (slice2; segments: 3)
                           Output: x.b, x.a
                           Hash Key: x.b
                           ->  Seq Scan on public.t_motion_deadlock_1 x
                                 Output: x.b, x.a
                     ->  Hash
                           Output: y.a, y.b
                           ->  Seq Scan on public.t_motion_deadlock_2 y
                                 Output: y.a, y.b
                     SubPlan 1
                       ->  Aggregate
                             Output: count(1)
                             ->  Result
                                   Filter: (z.b < (x.a + y.b))
                                   ->  Materialize
                                         Output: z.b
                                         ->  Broadcast Motion 3:3  (slice3; segments: 3)
                                               Output: z.b
                                               ->  Seq Scan on public.t_motion_deadlock_3 z
                                                     Output: z.b
 Optimizer: Postgres query optimizer

Deadlock pattern:

Hash Join Slice's execute algorithm (without prefetch joinqual):
  1. fetch a tuple from outer's motion
  2. match a tuple using hash join
  3. check join qual (this will evaluate the SubPlan which contains a motion)
    1. note the SubPlan has a materialize over broadcast to make it rescannable
## Case 3: outer & upper level plannode's plan qual

NOTE: this case does not need any patch, current main branch and 6X all will deadlock.

The table and data is just the same as Case 2, and the SQL is also the same, just using ORCA.
gpadmin=# explain (costs off, verbose) select count(1)
from
  t_motion_deadlock_1 x join t_motion_deadlock_2 y
  on x.b = y.a and
     x.b + y.a > (select count(1) from t_motion_deadlock_3 z where z.b < x.a + y.b);
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(1)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(1))
         ->  Partial Aggregate
               Output: PARTIAL count(1)
               ->  Result
                     Filter: ((t_motion_deadlock_1.b + t_motion_deadlock_2.a) > (SubPlan 1))
                     ->  Hash Join
                           Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b, t_motion_deadlock_2.a, t_motion_deadlock_2.b
                           Hash Cond: (t_motion_deadlock_1.b = t_motion_deadlock_2.a)
                           ->  Redistribute Motion 3:3  (slice2; segments: 3)
                                 Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
                                 Hash Key: t_motion_deadlock_1.b
                                 ->  Seq Scan on public.t_motion_deadlock_1
                                       Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
                           ->  Hash
                                 Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
                                 ->  Seq Scan on public.t_motion_deadlock_2
                                       Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
                     SubPlan 1
                       ->  Aggregate
                             Output: count(1)
                             ->  Result
                                   Filter: (t_motion_deadlock_3.b < (t_motion_deadlock_1.a + t_motion_deadlock_2.b))
                                   ->  Materialize
                                         Output: t_motion_deadlock_3.b
                                         ->  Broadcast Motion 3:3  (slice3; segments: 3)
                                               Output: t_motion_deadlock_3.b
                                               ->  Seq Scan on public.t_motion_deadlock_3
                                                     Output: t_motion_deadlock_3.b
 Optimizer: Pivotal Optimizer (GPORCA)

Deadlock pattern:
  1. The Result plan (over a HashJoin, and contains a SubPlan in its filter) fetch a tuple from its left tree (Hash Join)
  2. The HashJoin fetch one tuple from its outer via Motion
  3. The Hash Join match a tuple and return to its parent (the Result)
  4. The Result node evaluate its fitler (plan qual) which will evalute the SubPlan containing the motion
    1. note the SubPlan has a materialize over broadcast to make it rescannable

## Case 4: outer & join's planqual (very very similar to the above case 3)

The table and data is just the same as Case 2.

explain (costs off, verbose)
select count(1)
from
  t_motion_deadlock_1 x left join t_motion_deadlock_2 y
  on x.b = y.a
where
   x.a is null or exists (select random() from t_motion_deadlock_3 z where z.b < x.a + y.b);
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize Aggregate
   Output: count(1)
   ->  Gather Motion 3:1  (slice1; segments: 3)
         Output: (PARTIAL count(1))
         ->  Partial Aggregate
               Output: PARTIAL count(1)
               ->  Hash Left Join
                     Hash Cond: (x.b = y.a)
                     Filter: ((x.a IS NULL) OR (SubPlan 1))
                     ->  Redistribute Motion 3:3  (slice3; segments: 3)
                           Output: x.b, x.a
                           Hash Key: x.b
                           ->  Seq Scan on public.t_motion_deadlock_1 x
                                 Output: x.b, x.a
                     ->  Hash
                           Output: y.a, y.b
                           ->  Seq Scan on public.t_motion_deadlock_2 y
                                 Output: y.a, y.b
                     SubPlan 1
                       ->  Result
                             Filter: (z.b < (x.a + y.b))
                             ->  Materialize
                                   Output: z.b
                                   ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                         Output: z.b
                                         ->  Seq Scan on public.t_motion_deadlock_3 z
                                               Output: z.b
 Optimizer: Postgres query optimizer

Deadlock pattern: it is very similar to the above case, just the SubPlan appear in HashJoin's plan qual (not join filter).

## Case 5: outer & SubPlan in Target List

create table t_motion_deadlock_1(a int, b int);
create table t_motion_deadlock_2(a int, b int);
create table t_motion_deadlock_3(a int, b int);

insert into t_motion_deadlock_1 select i,i from generate_series(1, 30000)i;
delete from t_motion_deadlock_1 where gp_segment_id <> 1;
insert into t_motion_deadlock_2 select i,i from generate_series(1, 30)i;
insert into t_motion_deadlock_3 select i,i from generate_series(1, 10000)i;

-- hash join
explain (costs off, verbose)
select
  (select count(1) from t_motion_deadlock_3 z where z.a < x.a + y.b ) s
from t_motion_deadlock_1 x join t_motion_deadlock_2 y on x.b = y.a;

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Hash Join
         Output: (SubPlan 1)
         Hash Cond: (x.b = y.a)
         ->  Redistribute Motion 3:3  (slice3; segments: 3)
               Output: x.a, x.b
               Hash Key: x.b
               ->  Seq Scan on public.t_motion_deadlock_1 x
                     Output: x.a, x.b
         ->  Hash
               Output: y.b, y.a
               ->  Seq Scan on public.t_motion_deadlock_2 y
                     Output: y.b, y.a
         SubPlan 1
           ->  Aggregate
                 Output: count(1)
                 ->  Result
                       Filter: (z.a < (x.a + y.b))
                       ->  Materialize
                             Output: z.a
                             ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                   Output: z.a
                                   ->  Seq Scan on public.t_motion_deadlock_3 z
                                         Output: z.a
 Optimizer: Postgres query optimizer

Deadlock pattern:
  1. Hash Join fetch a tuple from its outer via Motion
  2. match a tuple and return to its parent
  3. The parent needs to evaluate the target list which contains a SubPlan with Motion
    1. note the SubPlan has a materialize over broadcast to make it rescannable
## Case 6: outer & check option SubPlans

create table t1(a int);
create table t2(a int);
create table t3(a int);

create view mv as
select a from t1
where exists (select 1 from t2 where t2.a = t1.a)
with check option;

insert into t1 select generate_series(1, 30000);
delete from t1 where gp_segment_id <> 2;

insert into t2 select generate_series(1, 30000);
delete from t2 where gp_segment_id <> 2;

explain (costs off, verbose) update mv set a = 6;
explain (costs off, verbose) update mv set a = 6;
                               QUERY PLAN
------------------------------------------------------------------------
 Update on public.t1
   ->  Explicit Redistribute Motion 3:3  (slice1; segments: 3)
         Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, (DMLAction)
         ->  Split
               Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, DMLAction
               ->  Hash Join
                     Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid
                     Inner Unique: true
                     Hash Cond: (t1.a = t2.a)
                     ->  Seq Scan on public.t1
                           Output: t1.ctid, t1.gp_segment_id, t1.a
                     ->  Hash
                           Output: t2.ctid, t2.a
                           ->  HashAggregate
                                 Output: t2.ctid, t2.a
                                 Group Key: t2.a
                                 ->  Seq Scan on public.t2
                                       Output: t2.ctid, t2.a
   SubPlan 1
     ->  Result
           Filter: (t2_1.a = t1.a)
           ->  Materialize
                 Output: t2_1.a
                 ->  Broadcast Motion 3:3  (slice2; segments: 3)
                       Output: t2_1.a
                       ->  Seq Scan on public.t2 t2_1
                             Output: t2_1.a
   SubPlan 2
     ->  Broadcast Motion 3:3  (slice3; segments: 3)
           Output: t2_2.a
           ->  Seq Scan on public.t2 t2_2
                 Output: t2_2.a
 Optimizer: Postgres query optimizer

Deadlock pattern:
  1. the ModifyTable node fetch a tuple from its outer via motion
  2. it check the tuple based on check options which are SubPlans containing motion
    1. one is normal subplan, there is a material over the broadcast motion
    2. the other is hash subplan, if the hash table is not built, it will firstly fetch all tuples from motion and build the hash table
More on Interconnect UDP deadlock: when No SubPlan at all
=================================================

When there is no subplan at all, Join Plan is the only plan that might fall into the deadlock pattern.
  • Recursive Union
  • Append Node
  • Sequence Node
All the above plan will all firstly finish a part of child plan and then turn to next, so they do not match
deadlock pattern's 1st step : fetch one tuple from motion.

Also note, I believe Merge Join will never hit deadlock (even without prefetch). Reasons:
  • Merge join has to make sure both side in order
  • If outer plan contain motion, there must be a Sort over that motion
    • Sort will fetch all tuples from the motion and then return
    • so it also does not match the 1st step of deadlock pattern.
But seems previous engineers still write the same code for merge join.

The fix method is to prefetch inner when outer contain motion for Join Plan.

Planner introduces a lot of code to hint this.
ORCA's rule is so simple, always prefetch inner except for nestloop index scan plan.

I think we can take ORCA's idea, remove all code from planner and decide this runtime.
  • for hash join always prefetch inner,
  • for merge join keeps it as Postgres
  • for nestloop join:
    • if there is no params, always prefetch
    • if there is params, seems more need to dig into.
Anyway, current code for this part can work.

More on Interconnect UDP deadlock: when there come SubPlans
=====================================================

SubPlans can come easily: you can always write SQL in target list (not pull-up-able) or
in where condition or join condition. So when considering SubPlans, interconnect UDP
deadlock has nothing to do with Join.

So we can always create such kind of plan:
  • make a plan contains motion (no Sort or Material over it)
  • wrap it in a subquery and then write a SQL in target list
Thus there is an interesting TODO: go through all cases in Greenplum see all patterns of Motion.
Currently in my mind:
  • agg
  • join
  • update|delete with join
  • split update
  • subplan
  • table value expression (scatter clause?)
  • ....????

Anyway, for such kind of deadlock, prefetch_joinqual and prefetch_qual is not the correct way to fix.
We should focus on SubPlans. I open a PR to fix the issue: https://github.com/greenplum-db/gpdb/pull/15842
Here are some reminders before you submit the pull request Add tests for the change Document changes Communicate in the mailing list if needed Pass make installcheck Review a PR in return to ...
The idea is:
  • when the 1st time to execute a plannode
  • if it contains subPlans, prefetch all the subPlans
    • if it is hash subplan, build the hash table
    • for normal subplan, find all materaial over motion, execute then.

Any discussion and comments are welcomed!


0001-revert-prefetch-inner-joinqual-qual-logic.patch

Hongxu Ma

unread,
Jul 5, 2023, 12:21:29 AM7/5/23
to gpdb...@greenplum.org, Zhenghua Lyu
Read all contents of this document, look good for me.

And two small questions:
  • improvement idea of your PR "prefetch all subplans + material them", does this method increase the memory consumption of a query (more material nodes than before?)? not sure if we executed some of them by steam mode before.
  • do we consider writing an auxiliary script to check if deadlock happened on the hang processes of a query? like we did via GDB manually
    • when user find query hangs, they can run the script on the processes to verify if deadlock happens
    • it's helpful to collect more scenarios
    • (maybe it's not very hard to write, python + gdb lib helps)
Thanks!

From: 'Zhenghua Lyu' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Sunday, July 2, 2023 9:20 PM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: On Interconnect UDP deadlock in Greenplum
 
!! External Email
!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Jesse Zhang

unread,
Jul 6, 2023, 3:08:37 AM7/6/23
to Zhenghua Lyu, gpdb...@greenplum.org
On Sun, Jul 2, 2023 at 6:20 AM 'Zhenghua Lyu' via Greenplum Developers
<gpdb...@greenplum.org> wrote:
>
>
> Also note, I believe Merge Join will never hit deadlock (even without prefetch). Reasons:
>
> Merge join has to make sure both side in order
> If outer plan contain motion, there must be a Sort over that motion
>
> Sort will fetch all tuples from the motion and then return
> so it also does not match the 1st step of deadlock pattern.
>
> But seems previous engineers still write the same code for merge join.
>

I agree with the conclusion but the reasoning might be more nuanced.
Note that it's also possible that one side of the merge join is over a
gather motion that merges sorted input (e.g. from an index scan). You
should (and you'll be able to) show that we won't need to worry about
motion hazard in this case either.

Zhenghua Lyu

unread,
Jul 6, 2023, 3:19:30 AM7/6/23
to pvtl-cont-sbjesse, gpdb...@greenplum.org
Hi Jesse,
     merge gather motion's gangsize is 1, so it is not possible to lead to motion deadlock.

From: Jesse Zhang <sbj...@gmail.com>
Sent: Thursday, July 6, 2023 3:07 PM
To: Zhenghua Lyu <zl...@vmware.com>
Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Re: On Interconnect UDP deadlock in Greenplum
 
!! External Email
Reply all
Reply to author
Forward
0 new messages