Hi!
In PR 1583 [1], me and Shreedhar pondered the mechanism that we have in
the planner/executor, to avoid so called "deadlock hazards". If I
understood correctly, we keep track of whether there is a Motion node on
both sides of a join, and if so, we force one side of the join to be
fully materialized, before fetching the first tuple from the other side
of the join. This effectively serializes the Motion nodes, so that only
one of them is active at any time.
There are comments explaining how we avoid that situation, and force the
"prefetching" of one side of a join, and there's even a debugging GUC,
gp_enable_motion_deadlock_sanity, to perform additional checks on plans,
and warn if a plan is nevertheless created that poses this "deadlock
hazard".
However, I could not find an explanation anywhere of what the underlying
deadlock problem is! Anyone remember?
I've been trying to think about this, but I don't see any fundamental
reason why a join with a Motion node on both sides would deadlock. For
the giggles, I removed all the code related to deadlock hazards [2]. And
it seems to pass all the regresion tests, except for one issue:
In segspace test, with gp_cte_sharing=on, this query gets stuck:
> regression=# explain update foo set j=m.cc1 from (
> with ctesisc as
> (select * from testsisc order by i2)
> select t1.i1 as cc1, t1.i2 as cc2
> from ctesisc as t1, ctesisc as t2
> where t1.i1 = t2.i2 ) as m;
> QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------
> Update (slice0; segments: 3) (rows=2 width=18)
> -> Explicit Redistribute Motion 3:3 (slice3; segments: 3) (cost=2.70..3.91 rows=2 width=18)
> -> Nested Loop (cost=2.70..3.91 rows=2 width=18)
> -> Seq Scan on foo (cost=0.00..1.01 rows=1 width=14)
> -> Materialize (cost=2.70..2.80 rows=4 width=4)
> -> Broadcast Motion 3:3 (slice2; segments: 3) (cost=2.26..2.69 rows=4 width=4)
> -> Subquery Scan m (cost=2.26..2.56 rows=2 width=4)
> -> Hash Join (cost=2.26..2.52 rows=2 width=8)
> Hash Cond: share0_ref2.i2 = share0_ref1.i1
> -> Redistribute Motion 3:3 (slice1; segments: 3) (cost=1.02..1.25 rows=1 width=4)
> Hash Key: si2
> -> Shared Scan (shhare0_ref2.are slice:id 1:0) (cost=1.02..1.23 rows=1 width=8)
> -> Hash (cost=1.23..1.23 rows=1 width=8)
> -> Shared Scan (share slice:id 2:0) (cost=1.02..1.23 rows=1 width=8)
> -> Sort (cost=1.02..1.02 rows=1 width=8)
> Sort Key: public.testsisc.i2
> -> Seq Scan on testsisc (cost=0.00..1.01 rows=1 width=8)
> Settings: gp_cte_sharing=on
> Optimizer status: legacy query optimizer
> (19 rows)
It seems that the outer side of the join, with the first Shared Scan
node returns 0 rows. Because of that, the Hash Join node never bothers
to execute the outer side - the join returns 0 rows regardless of the
inner side, if there are no rows from the outer side. So, the second
Shared Scan node is never executed. And that SharedInputScan code
doesn't seem to handle that too well. When shutting down the executor,
the "producer" side of the Shared Scan waits until the "consumer" side
has read all the data, but the consumer never even started, so the
producer waits forever.
So, there is that "deadlock hazard". However, I don't see why that's
restricted to joins. Seems like the same could happen in any plan, with
a "cross-slice" Shared Scan node.
And indeed, I was able to construct a case that doesn't involve any
joins, and locks up indefinitely with an unmodified git checkout:
create table testsisc (i1 integer, i2 integer);
insert into testsisc values (1, 1);
set gp_cte_sharing=on;
with ctesisc as
(select * from testsisc order by i2)
select i1 from ctesisc where 1=2
union all
select count(*) from ctesisc t;
The plan looks like this:
> regression=# explain with ctesisc as
> (select * from testsisc order by i2)
> select i1 from ctesisc where 1=2
> union all
> select count(*) from ctesisc t;
> QUERY PLAN
> --------------------------------------------------------------------------------------------------------------
> Gather Motion 3:1 (slice3; segments: 3) (cost=1.02..2.55 rows=2 width=6)
> -> Append (cost=1.02..2.55 rows=1 width=6)
> -> Subquery Scan "*SELECT* 1" (cost=1.02..1.24 rows=1 width=4)
> -> Result (cost=1.02..1.23 rows=1 width=4)
> One-Time Filter: false
> -> Shared Scan (share slice:id 3:0) (cost=1.02..1.23 rows=1 width=8)
> -> Sort (cost=1.02..1.02 rows=1 width=8)
> Sort Key: public.testsisc.i2
> -> Seq Scan on testsisc (cost=0.00..1.01 rows=1 width=8)
> -> Redistribute Motion 1:3 (slice2; segments: 1) (cost=1.30..1.32 rows=1 width=8)
> Hash Key: (count((count(*))))
> -> Aggregate (cost=1.30..1.31 rows=1 width=8)
> -> Gather Motion 3:1 (slice1; segments: 3) (cost=1.23..1.28 rows=1 width=8)
> -> Aggregate (cost=1.23..1.24 rows=1 width=8)
> -> Subquery Scan t (cost=1.02..1.23 rows=1 width=0)
> -> Shared Scan (share slice:id 1:0) (cost=1.02..1.23 rows=1 width=8)
> Settings: gp_cte_sharing=on
> Optimizer status: legacy query optimizer
> (18 rows)
The trick here is similar to what the join did in the original query:
the Shared Scan in the first branch of the Append is never executed,
because of the One-Time Filter in the Result node.
So my conclusions are:
1. There is a bug with cross-slice Shared Scans in general that needs to
be fixed.
2. There is no particular danger with Motion nodes below joins.
Am I missing something?
[1]
https://github.com/greenplum-db/gpdb/pull/1583
[2]
https://github.com/hlinnaka/gpdb/tree/motion-hazard
- Heikki