How to avoid performance regression of expanding a partition table?

268 views
Skip to first unread message

z...@pivotal.io

unread,
Nov 26, 2018, 8:05:09 AM11/26/18
to Greenplum Developers
Hi all,

     Now the MVP version gpdb is almost finished in the master branch of GPDB. (Several small tasks remained: a random numsegments pipeline job and fix github issue 6309).

     Anyway, the new gpexpand works now.

     It is a good time to talk about how to avoid a big performance regression of the new expand for handling partition tables compared to gpdb5.

     Now we expand a partition table including all the middle-level partitions and leaf partitions in one single transaction. It is a very simple implementation but has three big drawbacks:
     1. expand all the tables in one transaction, the rollback cost is so big(suppose we have expanded 99 tables, just left one leaf partition to finish all the job, it fails, then the users have to
         rollback all the tables
     2. expand table has to acquire the access exclusive lock on the root table, so any read/write to any partitions(root, middle-level, leaf) is not allowed.
     3. In GPDB5, we expand each leaf partition one by one and each in a single transaction. So we can concurrently expand several leaf partitions in the meantime. And in current master branch's code,
         we can only expand each leaf serially

    I think the performance problem is very important and we have to consider improving it.
    MPP team have looked into this problem and have a proposal for this problem. Our solution is posted at https://groups.google.com/a/greenplum.org/forum/#!topic/gpdb-dev/IoLekI-jz1Q
   
    To emphasize the special part of expanding a partition table, I think it is better to open a new thread and I paste MPP team's solution here.
   
    Any idea to solve the performance regression problem of expanding a partition table is appreciated, thanks in advance!

    ====================

   The core idea is to divide the process into two steps.
 
   T is a partition table, R is its root, and M1, M2, …, Mk is its middle-level partition tables, and L1, L2, L3, …, Ln is its leaf partitions. Then:

Lemma: If R, M1, M2, …, Mk are hash-distributed and L1, L2, L3, …, Ln is either hash-distributed or randomly distributed. We can correctly do URD operations on the parition table. But if R is randomly distributed and L1, L2, L3, …, Ln is either hash-distributed or randomly distributed, things will go wrong.


Proof: insert|select|update|delete for partiton tables are always dispatch to all segments. And insert will generate a redistributed motion which computes the target segment id using the root’s distributed columns(This is the key point why things never go wrong). But if the root table is randomly distributed, the redistribtued motion will compute target segment id randomly, things might go wrong if the target partition is hash-distributed.


Add a new syntax and its semantic is shown as below:

  1. Syntax: Alter table {tabname} set with(reorganize=true|false) expand table;

  2. The options in with clause can only contain at most one option, and if provided, the option’s name can only be ‘reorganize’(this is the semantic in GPDB4 and 5, we keep it same)

  3. if the table is partition table and it is the root

    1. Check ‘reorganize’ can only be false, otherwise throw error

    2. If it is a partital table, we suppose partiton tables each subpart has only one consistent numsegments, and we do following two things(in a single transaction, and holds Access ExclusiveLock for all the tables involved):

      1. For root and its middle-level tables, increase the table policy’s numsegments to make it a full table, do not change any other fields in this table’s policy

      2. For all its leaf tables, increase the table policy’s numsegments to make it a full table, and if it is originally hash-distributed, change it to randomly-distributed, or do nothing to any other fields in this table’s policy

    3. Else, throw an warning that this table does not need reshuffle

  4. Else if Else if the table is partition table and it is not the root

    1. If the table is partital table, throw an error|warning to notice user that should use this statement for root table

    2. Else it is a full table, throw an error|warning that no need to expand


Currently, the first stage of gpexpand in online, it does not need restart the cluster and the cluster is always available(except for modifying the catalog, this is not allowed). And thanks to the numsegments field in gp_distribution_policy, we do not need change hash-distributed table’s policy to randomly distribtued, so that planner might take this advantage to generate plans without motion.


At the second stage of gpexpand

  • for each non-partion table, gpexpand simply invokes alter table {tab} set (reorganize=true) expand table to rebalance data and increase the table’s numsegments in policy(but do not change any other fields in policy)

  • For partion table, the task is complex if we want to guarantee performance

    • In first transaction, invoke  alter table {tab} set (reorganize=true) expand table  on root table to finish expand process

    • After the first transaction, the root|middle|leaf tables are all full tables, so to rebalance data is the job of alter table set with(reorganize=true) distributed by (c1, ...) we invoke this statement for each leaf table one by one and each in a single transaction to rebalance their data and change their policy to the same policy as the root’s.



Heikki Linnakangas

unread,
Nov 26, 2018, 8:33:30 AM11/26/18
to z...@pivotal.io, Greenplum Developers
On 26/11/2018 15:05, z...@pivotal.io wrote:
> Lemma: If R, M1, M2, …, Mk are hash-distributed and L1, L2, L3, …, Ln is
> either hash-distributed or randomly distributed. We can correctly do URD
> operations on the parition table. But if R is randomly distributed and
> L1, L2, L3, …, Ln is either hash-distributed or randomly distributed,
> things will go wrong.

Things also go wrong, if you insert rows directly to the partition.

In general, I don't like this idea that the root partition and leafs can
have different distribution keys, but we have to somehow guarantee that
the parent's distribution key is always "compatible" with the child's
key. Life would be much simpler, if you only ever had to look at the
distribution key of the actual partition you're inserting to.

If we're going to support the scenario that different partitions have
different distribution policies, then let's bite the bullet and make
sure that everything works when the distribution policies are different.
In fact, I don't think there's any limitation in the planner/executor
today, that the partitions would need to have matching distribution
keys. If there is, let's just fix that.

So my thinking is that we don't need any extra syntax. To expand a
partitioned table:

1. ALTER TABLE ONLY root_partition EXPAND TABLE

This changes the gp_distribution_policy of the root partition. Since
there's no data in it, this finishes quickly. Partitions are unmodified.
This doesn't really affect planning or much else, except that it sets
the policy for any new partitions that you create.

2. For each partition: ALTER TABLE partition EXPAND TABLE

Moves the data, and updates gp_distribution_policy, for that partition.

- Heikki

z...@pivotal.io

unread,
Nov 26, 2018, 8:56:38 AM11/26/18
to Greenplum Developers, z...@pivotal.io
Hi Heikki, thanks for your reply.

Things also go wrong, if you insert rows directly to the partition. 
 
Directly insert into the partition is correct since the leaf partition's policy is random.  What is the case that causes things wrong?

So my thinking is that we don't need any extra syntax. To expand a 
partitioned table: 
1. ALTER TABLE ONLY root_partition EXPAND TABLE 
 
This changes the gp_distribution_policy of the root partition. Since 
there's no data in it, this finishes quickly. Partitions are unmodified. 
This doesn't really affect planning or much else, except that it sets 
the policy for any new partitions that you create. 
2. For each partition: ALTER TABLE partition EXPAND TABLE 
Moves the data, and updates gp_distribution_policy, for that partition. 

I am not sure if I correctly understand your method, but there is a counter-example mentioned in my previous proposal:  

Suppose old cluster contains n segments, and we add m new segments(so in the end we have n + m segments),

The first step will set the root partition's numsegments  to the new cluster size n+m.

Then suppose between the first and second steps, user does the following queries:

1. insert a tuple into the root that will fall into the newly added segments -- insert's plan is based on root's policy
2. user cannot select the just inserted tuples because the scan slice is not dispatched to newly added segments.


gpadmin=# CREATE TABLE rank (id int, year int)
gpadmin
-# DISTRIBUTED BY (id)
gpadmin
-# PARTITION BY RANGE (year)
gpadmin
-# ( START (2006) END (2016) EVERY (1),
gpadmin
(#   DEFAULT PARTITION extra );
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_extra" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_2" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_3" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_4" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_5" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_6" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_7" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_8" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_9" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_10" for table "rank"
NOTICE
:  CREATE TABLE will create partition "rank_1_prt_11" for table "rank"
CREATE TABLE
gpadmin
=# set Test_print_direct_dispatch_info = on;
SET
gpadmin
=# set allow_system_table_mods = on;
SET
gpadmin
=# update gp_distribution_policy set numsegments = 2 ;
UPDATE
12
gpadmin
=# update gp_distribution_policy set numsegments = 3 where localoid = 'rank'::regclass;
UPDATE
1
gpadmin
=# select relname, g.* from pg_class p, gp_distribution_policy g where g.localoid = p.oid;
     relname      
| localoid | attrnums | policytype | numsegments
------------------+----------+----------+------------+-------------
 rank_1_prt_5    
|    32789 | {1}      | p          |           2
 rank_1_prt_6    
|    32793 | {1}      | p          |           2
 rank_1_prt_7    
|    32797 | {1}      | p          |           2
 rank_1_prt_8    
|    32801 | {1}      | p          |           2
 rank_1_prt_9    
|    32805 | {1}      | p          |           2
 rank            
|    32769 | {1}      | p          |           3
 rank_1_prt_extra
|    32772 | {1}      | p          |           2
 rank_1_prt_2    
|    32777 | {1}      | p          |           2
 rank_1_prt_3    
|    32781 | {1}      | p          |           2
 rank_1_prt_4    
|    32785 | {1}      | p          |           2
 rank_1_prt_10    
|    32809 | {1}      | p          |           2
 rank_1_prt_11    
|    32813 | {1}      | p          |           2
(12 rows)

gpadmin=# explain insert into rank values (1, 2009);
                    QUERY PLAN
--------------------------------------------------
 Insert on rank  (cost=0.00..0.01 rows=1 width=0)
   ->  Result  (cost=0.00..0.01 rows=1 width=0)
 Planning time: 2.772 ms
 Optimizer: legacy query optimizer
(4 rows)

gpadmin=# select * from rank;
INFO:  (slice 1) Dispatch command to PARTIAL contents: 0 1
 id | year
----+------
(0 rows)

gpadmin=# insert into rank values (1, 2009);
INFO:  (slice 0) Dispatch command to SINGLE content
INFO:  Distributed transaction command 'Distributed Prepare' to SINGLE content
INFO:  Distributed transaction command 'Distributed Commit Prepared' to SINGLE content
INSERT 0 1
gpadmin=# select * from rank;
INFO:  (slice 1) Dispatch command to PARTIAL contents: 0 1
 id | year
----+------
(0 rows)

Heikki Linnakangas

unread,
Nov 26, 2018, 9:33:23 AM11/26/18
to z...@pivotal.io, Greenplum Developers
On 26/11/2018 15:56, z...@pivotal.io wrote:
> Hi Heikki, thanks for your reply.
>
> Things also go wrong, if you insert rows directly to the partition.
>
> Directly insert into the partition is correct since the leaf partition's
> policy is random.  What is the case that causes things wrong?

Oh, right. I was thinking that it's important for correctness, that new
rows are inserted to segment that would be correct under the new
distribution policy. But it's not, so that's fine.

> So my thinking is that we don't need any extra syntax. To expand a
> partitioned table:
> 1. ALTER TABLE ONLY root_partition EXPAND TABLE
>
> This changes the gp_distribution_policy of the root partition. Since
> there's no data in it, this finishes quickly. Partitions are
> unmodified.
> This doesn't really affect planning or much else, except that it sets
> the policy for any new partitions that you create.
> 2. For each partition: ALTER TABLE partition EXPAND TABLE
> Moves the data, and updates gp_distribution_policy, for that partition.
>
> I am not sure if I correctly understand your method, but there is a
> counter-example mentioned in my previous proposal:
>
> Suppose old cluster contains n segments, and we add m new segments(so in
> the end we have n + m segments),
>
> The first step will set the root partition's numsegments  to the new
> cluster size n+m.
>
> Then suppose between the first and second steps, user does the following
> queries:
>
> 1. insert a tuple into the root that will fall into the newly added
> segments -- insert's plan is based on root's policy

Hmm. I would expect the insert's plan to be based on the distribution
policy of the partition that the tuple belongs to. Not on the policy of
the root partition. But I guess the executor isn't wired that way,
currently. How invasive would it be to fix that?

- Heikki
Message has been deleted

z...@pivotal.io

unread,
Nov 26, 2018, 9:53:01 AM11/26/18
to Greenplum Developers, z...@pivotal.io
Hi Heikki,

Hmm. I would expect the insert's plan to be based on the distribution 
policy of the partition that the tuple belongs to. Not on the policy of 
the root partition. But I guess the executor isn't wired that way, 
currently. How invasive would it be to fix that? 

I don't think it can be easily fixed.

Think of a redistributed motion version plan. 

gpadmin=# explain insert into rank values (1, 2009), (2, 2010), (3, 1000);
                                       QUERY PLAN
----------------------------------------------------------------------------------------
 
Insert on rank  (cost=0.00..0.04 rows=1 width=8)
   
->  Redistribute Motion 1:3  (slice1; segments: 1)  (cost=0.00..0.04 rows=3 width=8)
         
Hash Key: "*VALUES*".column1
         
->  Values Scan on "*VALUES*"  (cost=0.00..0.04 rows=1 width=8)
 
Planning time: 0.315 ms
 
Optimizer: legacy query optimizer
(6 rows) 


    
The values scan is often(I don't check whether it can be on entry) spawned on segment(a single QE).

And on the segment, we do not maintain partition info in the catalog, that is why pengzhou's PR (https://github.com/greenplum-db/gpdb/commit/cfe3f386fb5b29e18641ea70f162def30ff1e21c#diff-2b1247515e9aec2cf5fc7b716eb677c7R14585) dispatch a bitmap of partition info to the QE.

Mike Roth

unread,
Nov 26, 2018, 9:54:15 AM11/26/18
to Heikki Linnakangas, z...@pivotal.io, Greenplum Developers, Bhuvnesh Chaudhary, Venkatesh Raghavan


On Nov 26, 2018, at 9:33 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:

Hmm. I would expect the insert's plan to be based on the distribution policy of the partition that the tuple belongs to. Not on the policy of the root partition. But I guess the executor isn't wired that way, currently. How invasive would it be to fix that?

Isn’t this also going to be different for ORCA?  Planner uses the individual leaf partitions however ORCA IIRC will only look at the root partition (does so for dynamic partition elimination).

Can I ask someone familiar with ORCA to chime in here?

~ mike


z...@pivotal.io

unread,
Nov 26, 2018, 9:59:51 AM11/26/18
to Greenplum Developers, hlinna...@pivotal.io, z...@pivotal.io, bchau...@pivotal.io, vrag...@pivotal.io
Isn’t this also going to be different for ORCA?  Planner uses the individual leaf partitions however ORCA IIRC will only look at the root partition (does so for dynamic partition elimination).

Can I ask someone familiar with ORCA to chime in here?

One thing that needs to keep in mind: currently, orca cannot deal with partial tables, it will fallback to the planner.

I test the plan generated by ORCA

gpadmin=# explain insert into rank values (1, 2009);
                                QUERY PLAN
---------------------------------------------------------------------------
 
Insert  (cost=0.00..0.02 rows=1 width=8)
   
->  Result  (cost=0.00..0.00 rows=1 width=12)
         
->  Partition Selector for rank  (cost=0.00..0.00 rows=1 width=8)
               
->  Result  (cost=0.00..0.00 rows=1 width=8)
                     
->  Result  (cost=0.00..0.00 rows=1 width=8)
                           
->  Result  (cost=0.00..0.00 rows=1 width=1)
 
Planning time: 46.522 ms
 
Optimizer: PQO version 3.9.0
(8 rows)



gpadmin
=# explain insert into rank values (1, 2009), (2, 2010), (3, 1000);
                                    QUERY PLAN
-----------------------------------------------------------------------------------

 
Insert  (cost=0.00..0.05 rows=1 width=8)
   
->  Result  (cost=0.00..0.00 rows=1 width=12)
         
->  Partition Selector for rank  (cost=0.00..0.00 rows=1 width=8)
               
->  Result  (cost=0.00..0.00 rows=1 width=8)
                     
->  Values Scan on "Values"  (cost=0.00..0.00 rows=1 width=8)
 
Planning time: 10.756 ms
 
Optimizer: PQO version 3.9.0
(7 rows)


It is different from the planner. 

Want to hear more advice from ORCA experts. Thanks!

Heikki Linnakangas

unread,
Nov 26, 2018, 10:06:08 AM11/26/18
to z...@pivotal.io, Greenplum Developers
On 26/11/2018 16:52, z...@pivotal.io wrote:
> Hmm. I would expect the insert's plan to be based on the distribution
> policy of the partition that the tuple belongs to. Not on the policy of
> the root partition. But I guess the executor isn't wired that way,
> currently. How invasive would it be to fix that?
>
>
> I don't think it can be easily fixed.

Too bad :-(

> Think of a redistributed motion version plan.
>
> |
> gpadmin=# explain insert into rank values (1, 2009), (2, 2010), (3, 1000);
>                                        QUERY PLAN
> ----------------------------------------------------------------------------------------
> Inserton rank (cost=0.00..0.04rows=1width=8)
> ->RedistributeMotion1:3(slice1;segments:1)(cost=0.00..0.04rows=3width=8)
> HashKey:"*VALUES*".column1
> ->ValuesScanon "*VALUES*"(cost=0.00..0.04rows=1width=8)
> Planningtime:0.315ms
> Optimizer:legacy query optimizer
> (6rows)
>
> |
>
> The values scan is often(I don't check whether it can be on entry)
> spawned on segment(a single QE).

Yes, the Redistribute Motion would need to determine which partition the
tuple belongs to, and compute the target segment based on that.

> And on the segment, we do not maintain partition info in the catalog,
> that is why pengzhou's PR
> (https://github.com/greenplum-db/gpdb/commit/cfe3f386fb5b29e18641ea70f162def30ff1e21c#diff-2b1247515e9aec2cf5fc7b716eb677c7R14585)
> dispatch a bitmap of partition info to the QE.

The executor has the partition information available at execution time,
in the segments too. It's passed along with the query plan from the QD.
It needs it in a plan like above, in any case, to determine which
partition to insert to.

- Heikki

Robert Eckhardt

unread,
Nov 26, 2018, 10:23:26 AM11/26/18
to Zhenghua Lyu, gpdb...@greenplum.org
On Mon, Nov 26, 2018 at 8:05 AM <z...@pivotal.io> wrote:
>
> Hi all,
>
> Now the MVP version gpdb is almost finished in the master branch of GPDB. (Several small tasks remained: a random numsegments pipeline job and fix github issue 6309).
>
> Anyway, the new gpexpand works now.
>
> It is a good time to talk about how to avoid a big performance regression of the new expand for handling partition tables compared to gpdb5.

I'm not sure I'd be ready to put this in bold quite yet. Clearly this
behavior is less than ideal, however, it will be fixed automatically
when we get parallel execution. Furthermore, we are getting gains from
not redistributing the entire table which can't be discounted.
Reducing the movement might in fact more than make up for the level of
parallelism we had.

Questions:
1) what is the performance now? is there some data set you could put
some numbers to?
2) why do you think this is going to be an issue?

From my perspective I would like the transition period to be safe and
as fast as possible.

I dislike the idea of having different distributions with in the same
partition hierarchy. I don't really have a good reason for this, it
just seems complicated and prone to error.

It seems like we might be breaking a bunch of assumptions that both
Orca and the planner have. If that is a temporary state then ok, as
long as it is accounted for. If it is a long term state I have bigger
worries about performance and other such matters while in the
intermediate state.

-- Rob

Heikki Linnakangas

unread,
Nov 26, 2018, 10:42:52 AM11/26/18
to z...@pivotal.io, Greenplum Developers
So, if we're going to take advantage of the intermediate state, where
the root partition is marked as Hash distributed, and the leafs are a
mix of random and hash distributed, let's take a closer look at the
proposed syntax:

On 26/11/2018 15:05, z...@pivotal.io wrote:
> Add a new syntax and its semantic is shown as below:
>
> 1. Syntax: Alter table {tabname} set with(reorganize=true|false) expand table;
>
> 2. The options in with clause can only contain at most one option, and if
> provided, the option’s name can only be ‘reorganize’(this is the semantic
> in GPDB4 and 5, we keep it same)
>
> 3. if the table is partition table and it is the root
>
> 1. Check ‘reorganize’ can only be false, otherwise throw error
>
> 2. If it is a partital table, we suppose partiton tables each subpart has
> only one consistent numsegments, and we do following two things(in a
> single transaction, and holds Access ExclusiveLock for all the tables
> involved):
>
> 1. For root and its middle-level tables, increase the table policy’s
> numsegments to make it a full table, do not change any other
> fields in this table’s policy
>
> 2. For all its leaf tables, increase the table policy’s numsegments
> to make it a full table, and if it is originally hash-distributed,
> change it to randomly-distributed, or do nothing to any other
> fields in this table’s policy
>
> 3. Else, throw an warning that this table does not need reshuffle
>
> 4. Else if Else if the table is partition table and it is not the root
>
> 1. If the table is partital table, throw an error|warning to notice
> user that should use this statement for root table
>
> 2. Else it is a full table, throw an error|warning that no need to
> expand

These semantics seem very complicated. I don't like how it works
differently with root and leaf partitions. I don't really understand
what the reorganize flag means. I thought it meant "force redistribution
of data, even if the change wouldn't require it". Which might be useful
to rebalance data in a randomly distributed table, or to recover if data
is distributed incorrectly for some reason.

Here's a counter-proposal, how I'd envision this to work. I'm not sure
how much this differs from what you're proposing. If it's the same
thing, I hope this is a simpler way of stating it.


Parent-Child Distribution Policy Invariant
------------------------------------------

If a partitioned table is Hash distributed, then all its leaf partitions
must also be Hash partitioned on the same distribution key, with the
same 'numsegments', or randomly distributed.

If a partitioned table is Randomly distributed, then all the leafs must
be leaf partitioned as well.

We maintain this invariant in all ALTER TABLE commands. We already check
this in ALTER TABLE SET DISTRIBUTED BY, so this isn't new.


ALTER TABLE <table> EXPAND TABLE
--------------------------------

Sets gp_distribution_policy.numsegments of the table to match the
current cluster size, and moves all data. If the table is a partitioned
table, recurses to all partitions.


ALTER TABLE ONLY <table> EXPAND TABLE
-------------------------------------

Sets gp_distribution_policy.numsegments of the table to match the
current cluster size, and moves all data. Does *not* recurse to partitions.

If <table> is a partitioned table, or an individual partition belonging
to a partitioned table, and the operation would violate the Parent-Child
Distribution Policy Invariant, throw an error.



Expanding a partitioned table with these commands
-------------------------------------------------

You could just run "ALTER TABLE root EXPAND TABLE", and have it recurse
to all the leafs. But if you want to do it in smaller pieces, you can:

1. For each leaf:
ALTER TABLE ONLY <leaf> SET DISTRIBUTED RANDOMLY

2. ALTER TABLE <root> EXPAND TABLE

3. For each leaf:
ALTER TABLE ONLY <leaf> SET DISTRIBUTED BY (<root's distribution key>)



The state after 1 is valid and doesn't break the Parent-Child
Distribution Policy Invariant, because children are allowed to be
randomly distributed, regardless of the parent.

The state after 2 is valid, because the command recurses to all
children, so the root and all the partitions will get the same
'numsegments'. Because all the leafs are randomly distributed, this
command doesn't move any data, and finishes quickly.

Step 3 moves all the data, one partition at a time.

- Heikki

Zhenghua Lyu

unread,
Nov 26, 2018, 10:55:21 AM11/26/18
to Heikki Linnakangas, Greenplum Developers
Hi heikki

You solution is just what I state in my proposal.

We might combine 1 and 2 into a single txn?

Thanks.
--
Best Regards,
Zhenghua Lyu

Ashwin Agrawal

unread,
Nov 26, 2018, 1:16:08 PM11/26/18
to Zhenghua Lyu, Heikki Linnakangas, gpdb...@greenplum.org

It would be extremely helpful to follow along the discussion if the table/matrix can be portraited to clarify exactly when (means for all operations and direct child access vs via root) root distribution policy or intermediate root's policy or child's policy is used.

Zhenghua Lyu

unread,
Nov 26, 2018, 8:17:07 PM11/26/18
to Greenplum Developers

Best Regards,
Zhenghua Lyu


---------- Forwarded message ---------
From: Zhenghua Lyu <z...@pivotal.io>
Date: Tue, Nov 27, 2018 at 9:16 AM
Subject: Re: How to avoid performance regression of expanding a partition table?
To: Heikki Linnakangas <hlinna...@pivotal.io>


1. For each leaf: 
    ALTER TABLE ONLY <leaf> SET DISTRIBUTED RANDOMLY
2. ALTER TABLE <root> EXPAND TABLE 

Sorry, I missed something. 

Just setting the leaf's policy to random but does not change its numsegments to the new cluster size is not enough. 

Currently, set-distributed-by-statement does not change numsegments. 

after the first two steps, if leafs' numsegments are not updated, the counter-example of insert still exists.


Best Regards,
Zhenghua Lyu

Heikki Linnakangas

unread,
Nov 27, 2018, 1:45:49 AM11/27/18
to Zhenghua Lyu, Greenplum Developers
On 27/11/2018 03:16, Zhenghua Lyu wrote:
> 1. For each leaf:
>
>     ALTER TABLE ONLY <leaf> SET DISTRIBUTED RANDOMLY
>
> 2. ALTER TABLE <root> EXPAND TABLE
>
>
> Sorry, I missed something.
>
> Just setting the leaf's policy to random but does not change its
> numsegments to the new cluster size is not enough.

It's subtle. Remember that ALTER TABLE, without ONLY, recurses to all
children. The "ALTER TABLE <root> EXPAND TABLE" command in step 2 sets
changes numsegments on all the leaf's, too.

The dance that gpexpand is doing with these commands is subtle. But the
commands are consistent and make sense when you look at each command
individually.

- Heikki

Zhenghua Lyu

unread,
Nov 27, 2018, 2:00:20 AM11/27/18
to Heikki Linnakangas, Greenplum Developers
I got. We reach an agreement. Thanks!

Step 2 will only change numsegments without moving any data. 

Step1 and 2 is just the code implemented by pengzhou in his PR(last week).

Thanks.

Best Regards,
Zhenghua Lyu

z...@pivotal.io

unread,
Nov 27, 2018, 8:05:01 PM11/27/18
to Greenplum Developers, z...@pivotal.io
Hi Rob,

Questions: 
1) what is the performance now? is there some data set you could put 
some numbers to? 
2) why do you think this is going to be an issue? 

Thanks for your reply and questions.

MPP team is busy with tests and other jobs. Will answer your questions later. Sorry for that.

Thanks.

Scott Kahler

unread,
Nov 29, 2018, 1:25:14 PM11/29/18
to Zhenghua Lyu, gpdb...@greenplum.org
In the current expand when you expand a table, going from random back to distributed. I believe the process is:

being
read tuples from disk
shuffle tuples to appropriate segment
write tuples to disk
swap old table for new table
drop old table
commit

is that correct?

with jump consistent hashing does that process change such that we 

begin
read only tuple that need to be rehashed
send tuples to appropriate segment
insert tuples into table
delete moved tuples from current table
commit

--

Scott Kahler | Pivotal, Greenplum Product Management  | ska...@pivotal.io | 816.237.0610

Jesse Zhang

unread,
Nov 29, 2018, 1:33:21 PM11/29/18
to Scott Kahler, z...@pivotal.io, gpdb...@greenplum.org
We still read all tuples, but we only send the ones that need moving for insert / delete. There should be a minimal portion if data that needs movement. This is still transactional (if the process is interrupted for any reason, the pre-expasion data is visible) and should always result in less overall I/O than swapping in a new table.

Cheers,
Jesse

--
You received this message because you are subscribed to the Google Groups "Greenplum Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.

Ning Yu

unread,
Nov 29, 2018, 7:42:12 PM11/29/18
to Jesse Zhang, Scott Kahler, Zhenghua Lv, gpdb...@greenplum.org
Also want to put some in-theory data on how many data are transferred when expanding a table from M segments to N segments.

For traditional reorganization:
- each new segment transfer in 1/N data;
- each old segment transfer in 1/N data;
- each old segment transfer out 1/M data;
- all segments transfer out 1/1 data in total;

For new reshuffle:
- each new segment transfer in 1/N data;
- each old segment transfer out 1/M - 1/N = (N - M) / MN data;
- all segments transfer out (N - M) / N data in total;

We can see that the load on new segments are the same in the two approaches, but reshuffle has less load on old segments.

Ashwin Agrawal

unread,
Jun 9, 2021, 1:31:16 PM6/9/21
to Greenplum Developers, Mike Roth, zlyu, Greenplum Developers, Bhuvnesh Chaudhary, Venkatesh Raghavan, Heikki Linnakangas
Discussed with Bhuvnesh and Team to get details from ORCA side for this finally.
Here is the response:
If the child and root distribution mismatches, the question was, would orca produce wrong results.
ORCA continues to produce correct results, because we will consider the distribution to be random and add motions accordingly if required. 

ORCA does not fallback if the parent and child distribution is mismatched. However, if there is a mismatch in the parent and child partitions (i,e parent is hash distributed and child is randomly distributed), ORCA considers the table to be randomly distributed and it will ensure correct
results.


Details:
CREATE TABLE p1 (a int, b int) partition by range(b) (start(1) end(4) every(1));

set allow_system_table_mods = on;
1) Set numsegments to the size of the newly expanded cluster for all the tables including root
update gp_distribution_policy set numsegments = 4 where localoid in (select localoid from gp_distribution_policy);

2) Set distribution policy from the childs to be random (now the parent has hash distribution and child has randomly)
update gp_distribution_policy set distkey = '', distclass = '' where localoid in (select localoid from gp_distribution_policy where localoid::regclass::text like 'p1_%');

3) Insert data to root partitioned table, data is tributed by hash and goes to 1 segment
INSERT INTO p1 select 1, 1 from generate_series(1,1000)i;
bchaudhary=# SELECT DISTINCT gp_segment_id, a FROM p1;
gp_segment_id | a
---------------+---
1 | 1

4) Insert data to child partitioned table, data is distributed randomly
INSERT INTO p1_1_prt_1 select 1, 1 from generate_series(1,1000)i;
SELECT DISTINCT gp_segment_id, a FROM p1;
bchaudhary=# SELECT DISTINCT gp_segment_id, a FROM p1;
gp_segment_id | a
---------------+---
2 | 1
0 | 1
3 | 1
1 | 1

Consider this plan: It adds redistribution motion on top of p1 since there is a mismatch of distribution between parent and root.

bchaudhary=# explain select * from p1 p11, p1 p12 where p11.a=p12.a;
NOTICE: One or more columns in the following table(s) do not have statistics: p1
HINT: For non-partitioned tables, run analyze <table_name>(<column_list>). For partitioned tables, run analyze rootpartition <table_name>(<column_list>). See log for columns missing statistics.
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
Gather Motion 4:1 (slice3; segments: 4) (cost=0.00..862.43 rows=2000 width=16)
-> Hash Join (cost=0.00..862.33 rows=500 width=16)
Hash Cond: (p1.a = p1_1.a)
-> Redistribute Motion 4:4 (slice1; segments: 4) (cost=0.00..431.03 rows=500 width=8)
Hash Key: p1.a
-> Sequence (cost=0.00..431.01 rows=500 width=8)
-> Partition Selector for p1 (dynamic scan id: 1) (cost=10.00..100.00 rows=25 width=4)
Partitions selected: 3 (out of 3)
-> Dynamic Seq Scan on p1 (dynamic scan id: 1) (cost=0.00..431.01 rows=500 width=8)
-> Hash (cost=431.03..431.03 rows=500 width=8)
-> Redistribute Motion 4:4 (slice2; segments: 4) (cost=0.00..431.03 rows=500 width=8)
Hash Key: p1_1.a
-> Sequence (cost=0.00..431.01 rows=500 width=8)
-> Partition Selector for p1 (dynamic scan id: 2) (cost=10.00..100.00 rows=25 width=4)
Partitions selected: 3 (out of 3)
-> Dynamic Seq Scan on p1 p1_1 (dynamic scan id: 2) (cost=0.00..431.01 rows=500 width=8)
Optimizer: Pivotal Optimizer (GPORCA)

Reply all
Reply to author
Forward
0 new messages