On Interconnect UDP deadlock in Greenplum

117 views

Skip to first unread message

Zhenghua Lyu

unread,

Jul 2, 2023, 9:20:29 AM7/2/23

to gpdb...@greenplum.org

Hi all,

Interconnect UDP is the most widely used interconnect type in Greenplum. Interconnect UDP deadlock is

deadlock among QEs due to the current interconnect UDP's implementation. When it happens, it is super

hard to debug to even identify that is interconnect deadlock.

Historical Background

==================

Previously, Greenplum has done some work to get rid of interconnect deadlock. The following links are in

time order (from old times to now):

https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/gMa1tW0x_fk/m/wuzvGXBaBAAJ (the 1^st discussion in gpdb-dev mailing list: Deadlock hazards and joins)
https://github.com/greenplum-db/gpdb/pull/7492 (Github PR: Fix motion hazard between outer and joinqual #7492)
https://github.com/greenplum-db/gpdb/pull/11862 (Github PR: Prefetch NonJoinQual to avoid motion hazard. #11862)
https://github.com/greenplum-db/gpdb/issues/15719 (Github Issue: motion deadlock #15719)

The 1^st discussion of the above Link1 talks about the interconnect deadlock between join's outer and inner.

The 2nd,3rd,4^th links above are motion deadlock between outer and SubPlan, the difference is where the

SubPlans come from: join qual, plan qual, target list, check option (updatable view).

Introduction to Motion

==================

A QE executes the plan it gets from QD. The QE will figure out the Slice belong to it and only

executes that piece of plan. In Greenplum, a plan is cut into pieces by Motions. A QE might

fetch tuples from child slices (it is possible for a QE to have more than one child slices), and

do some computation and then send out the results to its parent slice. A slice can have multiple

child slices but at most one parent slice.

Motion Plan node's execution model is also one-tuple-a-time. Lets now consider when a motion

executor return NULL to identify that it finishes. When all possible sender procs send EOS to the

receiver proc, then ExecMotion on that specific QE will return NULL.

To fetch all tuples from a motion utile ExecMotion returns NULL is sort of like

the sync operation in MPI programming MPI_Barrier. Sort over motion, Material over motion

or Hash over motion might have this kind of effect.

Plan that only fetching one tuple from a motion is async: QEs in this slice might reach different

part of code path.

The above is key to understand motion deadlock.

Deadlock Pattern: Abstract Description

================================

Without any specific handling, the below is interconnect UDP deadlock pattern:

S is the slice that has more than one child slices: C1, C2, ...

S's executor algorithm match motion deadlock pattern if it is like:

1. fetch one single tuple from C1 (via Motion A)

2. and then fetch all tuples from C2 (via Motion B)

Why? What is the deadlock wait chain in the above pattern?

Lets discuss using the following symbols:

Slices: S, C1, C2 (C1, C2 are child slices of S).

S.x: the QE in Slice S on seg x

C1.x: the QE in Slice C1 on seg x

C2.x: the QE in Slice C2 on seg x

It is easy to build deadlock using the following steps:

make C1 only send data to seg1 and contains much data
make C1 only contains data on seg1, so only C1.1 will not be IDLE
make C2 only send data to seg0 and contains much data
make C2 only contains data on seg1, so only C2.1 will not be IDLE

Wait events in deadlock chain are:

S.0 is waiting for one tuple from Motion A

it cannot get any tuples because we know C1 only sends data to seg1
so S.0 is actually waiting for the end of Motion A
it is waiting for C1.1 to end
S.0 ----> C1.1

C1.1 is waiting for S.0's ACK, its sender buffer is full and thus blocked

C1.1 ----> S.0

S.0 already fetch one tuple from C1 (via motion A) and now it is fetching tuples from C2 (via motion B)

so it is waiting for C2.1 to end
S.0 --> C2.1

C2.1 is sending data to S.0 and its sender buffer is full it waits for ACK from S.0

C2.1 ----> S.0

S.0's main thread can now only be waked up by motion A's package

Deadlock happens!

Deadlock Cases: Concrete Examples

==============================

Current main branch top commit is: df114f0d3d2

To play with all the following cases, you can apply the patch (in

attachment: 0001-revert-prefetch-inner-joinqual-qual-logic.patch)

to revert previous prefetch fix method. Some of the cases (deadlock

involving SubPlan in target list and check option) do not need any

patches, just the above main branch has the deadlock bug.

GUCs setting:

turn off autovacuum and then set GUCs as below to help reproduce:

set gp_interconnect_queue_depth =1;
set gp_interconnect_snd_queue_depth =1;

set gp_autostats_mode = none;
set disable_cost = 1e20;

Play all cases under 3-seg demo-cluster.

##Case 1: Outer & Inner Deadlock (need to play with the revert prefetch patch)

create table t_motion_deadlock_1(a
int, b
int);create table t_motion_deadlock_2(a
int, b
int);

-- make t_motion_deadlock_1 only contains much data
-- in segment 1, and due to a = b, all data will also
-- be redistributed to segment 1 on b.
insert into t_motion_deadlock_1
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;

-- the same skill to let t_motion_deadlock_2 only contain
-- much data on segment 0 and only redistributed t segment 0
-- by b.
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_2
where gp_segment_id
<> 0;
insert into t_motion_deadlock_2
select y.a, x.b
from t_motion_deadlock_1 x, t_motion_deadlock_2 y
limit 10;
-- below plan should redistribute both inner and outer
-- hash join
explain (costs off,
verbose)
select count(1)
from
  t_motion_deadlock_1 x,
  t_motion_deadlock_2 y
where x.b
= y.b;
                                  QUERY PLAN
------------------------------------------------------------------------------
 Finalize Aggregate
   Output:
count(1)
   ->  Gather Motion
3:1  (slice1; segments:
3)
         Output: (PARTIAL count(1))
         -> 
Partial Aggregate
               Output:
PARTIAL count(1)
               -> 
Hash Join
                     Hash Cond: (x.b
= y.b)
                     ->  Redistribute Motion
3:3  (slice2; segments:
3)
                           Output: x.b
                           Hash Key:
 x.b
                           ->  Seq Scan
on public.t_motion_deadlock_1 x
                                 Output: x.b
                     -> 
Hash
                           Output: y.b
                           ->  Redistribute Motion
3:3  (slice3; segments:
3)
                                 Output: y.b
                                 Hash Key:
 y.b
                                 ->  Seq Scan
on public.t_motion_deadlock_2 y
                                       Output: y.b
 Optimizer: Postgres query optimizer

Deadlock pattern:

Hash Join Slice's execute algorithm (without prefetch inner):

fetch one tuple from outer plan to see if any
if there is no tuple from outer plan, then no need to build hash table, just return NULL
if there is any tuple from outer, then fetch all tuples from inner to build the Hash Table

##Case 2: outer & subplan in joinqual

-- ==============================================-- outer plan & joinqual
-- ==============================================
create table t_motion_deadlock_1(a
int, b
int);
create table t_motion_deadlock_2(a
int, b
int);
create table t_motion_deadlock_3(a
int, b
int);

insert into t_motion_deadlock_1
select i,i
from generate_series(1,
10000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30)i;
insert into t_motion_deadlock_3
select i,i
from generate_series(1,
10000)i;

-- hash join
explain (costs off,
verbose)
select count(1)
from
  t_motion_deadlock_1 x join t_motion_deadlock_2 y
  on x.b
= y.a
and
     x.b + y.a
> (select count(1)
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);

                                         QUERY PLAN
--------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output:
count(1)
   ->  Gather Motion
3:1  (slice1; segments:
3)
         Output: (PARTIAL count(1))
         -> 
Partial Aggregate
               Output:
PARTIAL count(1)
               -> 
Hash Join
                     Hash Cond: (x.b
= y.a)
                     Join Filter:
 ((x.b + y.a)
> (SubPlan
1))
                     ->  Redistribute Motion
3:3  (slice2; segments:
3)
                           Output: x.b, x.a
                           Hash Key:
 x.b
                           ->  Seq Scan
on public.t_motion_deadlock_1 x
                                 Output: x.b, x.a
                     -> 
Hash
                           Output: y.a, y.b
                           ->  Seq Scan
on public.t_motion_deadlock_2 y
                                 Output: y.a, y.b
                     SubPlan 1
                       -> 
Aggregate
                             Output:
count(1)
                             ->  Result
                                   Filter: (z.b
< (x.a
+ y.b))
                                   ->  Materialize
                                         Output: z.b
                                         ->  Broadcast Motion
3:3  (slice3; segments:
3)
                                               Output: z.b
                                               ->  Seq Scan
on public.t_motion_deadlock_3 z
                                                     Output: z.b
 Optimizer: Postgres query optimizer

Deadlock pattern:

Hash Join Slice's execute algorithm (without prefetch joinqual):

fetch a tuple from outer's motion
match a tuple using hash join
check join qual (this will evaluate the SubPlan which contains a motion)

note the SubPlan has a materialize over broadcast to make it rescannable

## Case 3: outer & upper level plannode's plan qual

NOTE: this case does not need any patch, current main branch and 6X all will deadlock.

The table and data is just the same as Case 2, and the SQL is also the same, just using ORCA.

gpadmin=# explain (costs
off,
verbose)
select count(1)from
  t_motion_deadlock_1 x join t_motion_deadlock_2 y
  on x.b
= y.a
and
     x.b + y.a
> (select count(1)
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);
                                                          QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------
 Finalize Aggregate
   Output:
count(1)
   ->  Gather Motion
3:1  (slice1; segments:
3)
         Output: (PARTIAL count(1))
         -> 
Partial Aggregate
               Output:
PARTIAL count(1)
               ->  Result
                     Filter: ((t_motion_deadlock_1.b
+ t_motion_deadlock_2.a)
> (SubPlan
1))
                     -> 
Hash Join
                           Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b, t_motion_deadlock_2.a, t_motion_deadlock_2.b
                           Hash Cond: (t_motion_deadlock_1.b
= t_motion_deadlock_2.a)
                           ->  Redistribute Motion
3:3  (slice2; segments:
3)
                                 Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
                                 Hash Key:
 t_motion_deadlock_1.b
                                 ->  Seq Scan
on public.t_motion_deadlock_1
                                       Output: t_motion_deadlock_1.a, t_motion_deadlock_1.b
                           -> 
Hash
                                 Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
                                 ->  Seq Scan
on public.t_motion_deadlock_2
                                       Output: t_motion_deadlock_2.a, t_motion_deadlock_2.b
                     SubPlan 1
                       -> 
Aggregate
                             Output:
count(1)
                             ->  Result
                                   Filter: (t_motion_deadlock_3.b
< (t_motion_deadlock_1.a
+ t_motion_deadlock_2.b))
                                   ->  Materialize
                                         Output: t_motion_deadlock_3.b
                                         ->  Broadcast Motion
3:3  (slice3; segments:
3)
                                               Output: t_motion_deadlock_3.b
                                               ->  Seq Scan
on public.t_motion_deadlock_3
                                                     Output: t_motion_deadlock_3.b

 Optimizer: Pivotal Optimizer (GPORCA)

Deadlock pattern:

The Result plan (over a HashJoin, and contains a SubPlan in its filter) fetch a tuple from its left tree (Hash Join)
The HashJoin fetch one tuple from its outer via Motion
The Hash Join match a tuple and return to its parent (the Result)
The Result node evaluate its fitler (plan qual) which will evalute the SubPlan containing the motion

note the SubPlan has a materialize over broadcast to make it rescannable

## Case 4: outer & join's planqual (very very similar to the above case 3)

The table and data is just the same as Case 2.

explain (costs off,
verbose)select count(1)
from
  t_motion_deadlock_1 x left join t_motion_deadlock_2 y
  on x.b
= y.a
where
   x.a is null or exists (select random()
from t_motion_deadlock_3 z
where z.b
< x.a
+ y.b);
                                      QUERY PLAN
--------------------------------------------------------------------------------------
 Finalize Aggregate
   Output:
count(1)
   ->  Gather Motion
3:1  (slice1; segments:
3)
         Output: (PARTIAL count(1))
         -> 
Partial Aggregate
               Output:
PARTIAL count(1)
               -> 
Hash Left Join
                     Hash Cond: (x.b
= y.a)
                     Filter: ((x.a
IS NULL)
OR (SubPlan
1))
                     ->  Redistribute Motion
3:3  (slice3; segments:
3)
                           Output: x.b, x.a
                           Hash Key:
 x.b
                           ->  Seq Scan
on public.t_motion_deadlock_1 x
                                 Output: x.b, x.a
                     -> 
Hash
                           Output: y.a, y.b
                           ->  Seq Scan
on public.t_motion_deadlock_2 y
                                 Output: y.a, y.b
                     SubPlan 1
                       ->  Result
                             Filter: (z.b
< (x.a
+ y.b))
                             ->  Materialize
                                   Output: z.b
                                   ->  Broadcast Motion
3:3  (slice2; segments:
3)
                                         Output: z.b
                                         ->  Seq Scan
on public.t_motion_deadlock_3 z
                                               Output: z.b
 Optimizer: Postgres query optimizer

Deadlock pattern: it is very similar to the above case, just the SubPlan appear in HashJoin's plan qual (not join filter).

## Case 5: outer & SubPlan in Target List

create table t_motion_deadlock_1(a
int, b
int);create table t_motion_deadlock_2(a
int, b
int);
create table t_motion_deadlock_3(a
int, b
int);

insert into t_motion_deadlock_1
select i,i
from generate_series(1,
30000)i;
delete from t_motion_deadlock_1
where gp_segment_id
<> 1;
insert into t_motion_deadlock_2
select i,i
from generate_series(1,
30)i;
insert into t_motion_deadlock_3
select i,i
from generate_series(1,
10000)i;

-- hash join
explain (costs off,
verbose)
select
  (select count(1)
from t_motion_deadlock_3 z
where z.a
< x.a
+ y.b ) s

from t_motion_deadlock_1 x
join t_motion_deadlock_2 y
on x.b
= y.a;

                                   QUERY PLAN
--------------------------------------------------------------------------------
 Gather Motion 3:1  (slice1; segments: 3)
   Output: ((SubPlan 1))
   ->  Hash Join
         Output: (SubPlan 1)
         Hash Cond: (x.b = y.a)
         ->  Redistribute Motion 3:3  (slice3; segments: 3)
               Output: x.a, x.b
               Hash Key: x.b
               ->  Seq Scan on public.t_motion_deadlock_1 x
                     Output: x.a, x.b
         ->  Hash
               Output: y.b, y.a
               ->  Seq Scan on public.t_motion_deadlock_2 y
                     Output: y.b, y.a
         SubPlan 1
           ->  Aggregate
                 Output: count(1)
                 ->  Result
                       Filter: (z.a < (x.a + y.b))
                       ->  Materialize
                             Output: z.a
                             ->  Broadcast Motion 3:3  (slice2; segments: 3)
                                   Output: z.a
                                   ->  Seq Scan on public.t_motion_deadlock_3 z
                                         Output: z.a

 Optimizer: Postgres query optimizer

Deadlock pattern:

Hash Join fetch a tuple from its outer via Motion
match a tuple and return to its parent
The parent needs to evaluate the target list which contains a SubPlan with Motion

note the SubPlan has a materialize over broadcast to make it rescannable

## Case 6: outer & check option SubPlans

create table t1(a int);
create table t2(a int);

create table t3(a int);

create view mv as

select a from t1

where exists (select 1 from t2 where t2.a = t1.a)

with check option;

insert into t1 select generate_series(1, 30000);

delete from t1 where gp_segment_id <> 2;

insert into t2 select generate_series(1, 30000);

delete from t2 where gp_segment_id <> 2;

explain (costs off, verbose) update mv set a = 6;

explain (costs off, verbose) update mv set a = 6;
QUERY PLAN

------------------------------------------------------------------------

Update on public.t1

-> Explicit Redistribute Motion 3:3 (slice1; segments: 3)

Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, (DMLAction)

-> Split

Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid, DMLAction

-> Hash Join

Output: 6, t1.ctid, t1.gp_segment_id, t2.ctid

Inner Unique: true

Hash Cond: (t1.a = t2.a)

-> Seq Scan on public.t1

Output: t1.ctid, t1.gp_segment_id, t1.a

-> Hash

Output: t2.ctid, t2.a

-> HashAggregate

Output: t2.ctid, t2.a

Group Key: t2.a

-> Seq Scan on public.t2

Output: t2.ctid, t2.a

SubPlan 1

-> Result

Filter: (t2_1.a = t1.a)

-> Materialize

Output: t2_1.a

-> Broadcast Motion 3:3 (slice2; segments: 3)

Output: t2_1.a

-> Seq Scan on public.t2 t2_1

Output: t2_1.a

SubPlan 2

-> Broadcast Motion 3:3 (slice3; segments: 3)

Output: t2_2.a

-> Seq Scan on public.t2 t2_2

Output: t2_2.a
Optimizer: Postgres query optimizer

Deadlock pattern:

the ModifyTable node fetch a tuple from its outer via motion
it check the tuple based on check options which are SubPlans containing motion

one is normal subplan, there is a material over the broadcast motion
the other is hash subplan, if the hash table is not built, it will firstly fetch all tuples from motion and build the hash table

More on Interconnect UDP deadlock: when No SubPlan at all

=================================================

When there is no subplan at all, Join Plan is the only plan that might fall into the deadlock pattern.

Recursive Union
Append Node
Sequence Node

All the above plan will all firstly finish a part of child plan and then turn to next, so they do not match

deadlock pattern's 1^st step : fetch one tuple from motion.

Also note, I believe Merge Join will never hit deadlock (even without prefetch). Reasons:

Merge join has to make sure both side in order
If outer plan contain motion, there must be a Sort over that motion

Sort will fetch all tuples from the motion and then return
so it also does not match the 1^st step of deadlock pattern.

But seems previous engineers still write the same code for merge join.

The fix method is to prefetch inner when outer contain motion for Join Plan.

Planner introduces a lot of code to hint this.

ORCA's rule is so simple, always prefetch inner except for nestloop index scan plan.

I think we can take ORCA's idea, remove all code from planner and decide this runtime.

for hash join always prefetch inner,
for merge join keeps it as Postgres
for nestloop join:

if there is no params, always prefetch
if there is params, seems more need to dig into.

Anyway, current code for this part can work.

More on Interconnect UDP deadlock: when there come SubPlans

=====================================================

SubPlans can come easily: you can always write SQL in target list (not pull-up-able) or

in where condition or join condition. So when considering SubPlans, interconnect UDP

deadlock has nothing to do with Join.

So we can always create such kind of plan:

make a plan contains motion (no Sort or Material over it)
wrap it in a subquery and then write a SQL in target list

Thus there is an interesting TODO: go through all cases in Greenplum see all patterns of Motion.

Currently in my mind:

agg
join
update|delete with join
split update
subplan
table value expression (scatter clause?)
....????

Anyway, for such kind of deadlock, prefetch_joinqual and prefetch_qual is not the correct way to fix.

We should focus on SubPlans. I open a PR to fix the issue: https://github.com/greenplum-db/gpdb/pull/15842

Refactor motion deadlock by kainwen · Pull Request #15842 · greenplum-db/gpdb

Here are some reminders before you submit the pull request Add tests for the change Document changes Communicate in the mailing list if needed Pass make installcheck Review a PR in return to ...

github.com

The idea is:

when the 1^st time to execute a plannode
if it contains subPlans, prefetch all the subPlans

if it is hash subplan, build the hash table
for normal subplan, find all materaial over motion, execute then.

Any discussion and comments are welcomed!

0001-revert-prefetch-inner-joinqual-qual-logic.patch

Hongxu Ma

unread,

Jul 5, 2023, 12:21:29 AM7/5/23

to gpdb...@greenplum.org, Zhenghua Lyu

Read all contents of this document, look good for me.

And two small questions:

improvement idea of your PR "prefetch all subplans + material them", does this method increase the memory consumption of a query (more material nodes than before?)? not sure if we executed some of them by steam mode before.
do we consider writing an auxiliary script to check if deadlock happened on the hang processes of a query? like we did via GDB manually

when user find query hangs, they can run the script on the processes to verify if deadlock happens
it's helpful to collect more scenarios
(maybe it's not very hard to write, python + gdb lib helps)

Thanks!

From: 'Zhenghua Lyu' via Greenplum Developers <gpdb...@greenplum.org>
Sent: Sunday, July 2, 2023 9:20 PM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: On Interconnect UDP deadlock in Greenplum

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Jesse Zhang

unread,

Jul 6, 2023, 3:08:37 AM7/6/23

to Zhenghua Lyu, gpdb...@greenplum.org

On Sun, Jul 2, 2023 at 6:20 AM 'Zhenghua Lyu' via Greenplum Developers
<gpdb...@greenplum.org> wrote:
>
>
> Also note, I believe Merge Join will never hit deadlock (even without prefetch). Reasons:
>
> Merge join has to make sure both side in order
> If outer plan contain motion, there must be a Sort over that motion
>
> Sort will fetch all tuples from the motion and then return
> so it also does not match the 1st step of deadlock pattern.
>
> But seems previous engineers still write the same code for merge join.
>

I agree with the conclusion but the reasoning might be more nuanced.
Note that it's also possible that one side of the merge join is over a
gather motion that merges sorted input (e.g. from an index scan). You
should (and you'll be able to) show that we won't need to worry about
motion hazard in this case either.

Zhenghua Lyu

unread,

Jul 6, 2023, 3:19:30 AM7/6/23

to pvtl-cont-sbjesse, gpdb...@greenplum.org

Hi Jesse,

merge gather motion's gangsize is 1, so it is not possible to lead to motion deadlock.

From: Jesse Zhang <sbj...@gmail.com>
Sent: Thursday, July 6, 2023 3:07 PM
To: Zhenghua Lyu <zl...@vmware.com>
Cc: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Re: On Interconnect UDP deadlock in Greenplum