Lemma: If R, M1, M2, …, Mk are hash-distributed and L1, L2, L3, …, Ln is either hash-distributed or randomly distributed. We can correctly do URD operations on the parition table. But if R is randomly distributed and L1, L2, L3, …, Ln is either hash-distributed or randomly distributed, things will go wrong.
Proof: insert|select|update|delete for partiton tables are always dispatch to all segments. And insert will generate a redistributed motion which computes the target segment id using the root’s distributed columns(This is the key point why things never go wrong). But if the root table is randomly distributed, the redistribtued motion will compute target segment id randomly, things might go wrong if the target partition is hash-distributed.
Add a new syntax and its semantic is shown as below:
Syntax: Alter table {tabname} set with(reorganize=true|false) expand table;
The options in with clause can only contain at most one option, and if provided, the option’s name can only be ‘reorganize’(this is the semantic in GPDB4 and 5, we keep it same)
if the table is partition table and it is the root
Check ‘reorganize’ can only be false, otherwise throw error
If it is a partital table, we suppose partiton tables each subpart has only one consistent numsegments, and we do following two things(in a single transaction, and holds Access ExclusiveLock for all the tables involved):
For root and its middle-level tables, increase the table policy’s numsegments to make it a full table, do not change any other fields in this table’s policy
For all its leaf tables, increase the table policy’s numsegments to make it a full table, and if it is originally hash-distributed, change it to randomly-distributed, or do nothing to any other fields in this table’s policy
Else, throw an warning that this table does not need reshuffle
Else if Else if the table is partition table and it is not the root
If the table is partital table, throw an error|warning to notice user that should use this statement for root table
Else it is a full table, throw an error|warning that no need to expand
Currently, the first stage of gpexpand in online, it does not need restart the cluster and the cluster is always available(except for modifying the catalog, this is not allowed). And thanks to the numsegments field in gp_distribution_policy, we do not need change hash-distributed table’s policy to randomly distribtued, so that planner might take this advantage to generate plans without motion.
At the second stage of gpexpand
for each non-partion table, gpexpand simply invokes alter table {tab} set (reorganize=true) expand table to rebalance data and increase the table’s numsegments in policy(but do not change any other fields in policy)
For partion table, the task is complex if we want to guarantee performance
In first transaction, invoke alter table {tab} set (reorganize=true) expand table on root table to finish expand process
After the first transaction, the root|middle|leaf tables are all full tables, so to rebalance data is the job of alter table set with(reorganize=true) distributed by (c1, ...) we invoke this statement for each leaf table one by one and each in a single transaction to rebalance their data and change their policy to the same policy as the root’s.
Things also go wrong, if you insert rows directly to the partition.
So my thinking is that we don't need any extra syntax. To expand a
partitioned table:
1. ALTER TABLE ONLY root_partition EXPAND TABLE
This changes the gp_distribution_policy of the root partition. Since
there's no data in it, this finishes quickly. Partitions are unmodified.
This doesn't really affect planning or much else, except that it sets
the policy for any new partitions that you create.
2. For each partition: ALTER TABLE partition EXPAND TABLE
Moves the data, and updates gp_distribution_policy, for that partition.
gpadmin=# CREATE TABLE rank (id int, year int)
gpadmin-# DISTRIBUTED BY (id)
gpadmin-# PARTITION BY RANGE (year)
gpadmin-# ( START (2006) END (2016) EVERY (1),
gpadmin(# DEFAULT PARTITION extra );
NOTICE: CREATE TABLE will create partition "rank_1_prt_extra" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_2" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_3" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_4" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_5" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_6" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_7" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_8" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_9" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_10" for table "rank"
NOTICE: CREATE TABLE will create partition "rank_1_prt_11" for table "rank"
CREATE TABLE
gpadmin=# set Test_print_direct_dispatch_info = on;
SET
gpadmin=# set allow_system_table_mods = on;
SET
gpadmin=# update gp_distribution_policy set numsegments = 2 ;
UPDATE 12
gpadmin=# update gp_distribution_policy set numsegments = 3 where localoid = 'rank'::regclass;
UPDATE 1
gpadmin=# select relname, g.* from pg_class p, gp_distribution_policy g where g.localoid = p.oid;
relname | localoid | attrnums | policytype | numsegments
------------------+----------+----------+------------+-------------
rank_1_prt_5 | 32789 | {1} | p | 2
rank_1_prt_6 | 32793 | {1} | p | 2
rank_1_prt_7 | 32797 | {1} | p | 2
rank_1_prt_8 | 32801 | {1} | p | 2
rank_1_prt_9 | 32805 | {1} | p | 2
rank | 32769 | {1} | p | 3
rank_1_prt_extra | 32772 | {1} | p | 2
rank_1_prt_2 | 32777 | {1} | p | 2
rank_1_prt_3 | 32781 | {1} | p | 2
rank_1_prt_4 | 32785 | {1} | p | 2
rank_1_prt_10 | 32809 | {1} | p | 2
rank_1_prt_11 | 32813 | {1} | p | 2
(12 rows)
gpadmin=# explain insert into rank values (1, 2009); QUERY PLAN-------------------------------------------------- Insert on rank (cost=0.00..0.01 rows=1 width=0) -> Result (cost=0.00..0.01 rows=1 width=0) Planning time: 2.772 ms Optimizer: legacy query optimizer(4 rows)
gpadmin=# select * from rank;INFO: (slice 1) Dispatch command to PARTIAL contents: 0 1 id | year----+------(0 rows)
gpadmin=# insert into rank values (1, 2009);INFO: (slice 0) Dispatch command to SINGLE contentINFO: Distributed transaction command 'Distributed Prepare' to SINGLE contentINFO: Distributed transaction command 'Distributed Commit Prepared' to SINGLE contentINSERT 0 1gpadmin=# select * from rank;INFO: (slice 1) Dispatch command to PARTIAL contents: 0 1 id | year----+------(0 rows)Hmm. I would expect the insert's plan to be based on the distribution
policy of the partition that the tuple belongs to. Not on the policy of
the root partition. But I guess the executor isn't wired that way,
currently. How invasive would it be to fix that?
gpadmin=# explain insert into rank values (1, 2009), (2, 2010), (3, 1000);
QUERY PLAN
----------------------------------------------------------------------------------------
Insert on rank (cost=0.00..0.04 rows=1 width=8)
-> Redistribute Motion 1:3 (slice1; segments: 1) (cost=0.00..0.04 rows=3 width=8)
Hash Key: "*VALUES*".column1
-> Values Scan on "*VALUES*" (cost=0.00..0.04 rows=1 width=8)
Planning time: 0.315 ms
Optimizer: legacy query optimizer
(6 rows)
On Nov 26, 2018, at 9:33 AM, Heikki Linnakangas <hlinna...@pivotal.io> wrote:Hmm. I would expect the insert's plan to be based on the distribution policy of the partition that the tuple belongs to. Not on the policy of the root partition. But I guess the executor isn't wired that way, currently. How invasive would it be to fix that?
Isn’t this also going to be different for ORCA? Planner uses the individual leaf partitions however ORCA IIRC will only look at the root partition (does so for dynamic partition elimination).Can I ask someone familiar with ORCA to chime in here?
gpadmin=# explain insert into rank values (1, 2009);
QUERY PLAN
---------------------------------------------------------------------------
Insert (cost=0.00..0.02 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=12)
-> Partition Selector for rank (cost=0.00..0.00 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=1)
Planning time: 46.522 ms
Optimizer: PQO version 3.9.0
(8 rows)
gpadmin=# explain insert into rank values (1, 2009), (2, 2010), (3, 1000);
QUERY PLAN
-----------------------------------------------------------------------------------
Insert (cost=0.00..0.05 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=12)
-> Partition Selector for rank (cost=0.00..0.00 rows=1 width=8)
-> Result (cost=0.00..0.00 rows=1 width=8)
-> Values Scan on "Values" (cost=0.00..0.00 rows=1 width=8)
Planning time: 10.756 ms
Optimizer: PQO version 3.9.0
(7 rows)
1. For each leaf:
ALTER TABLE ONLY <leaf> SET DISTRIBUTED RANDOMLY
2. ALTER TABLE <root> EXPAND TABLE
Questions:
1) what is the performance now? is there some data set you could put
some numbers to?
2) why do you think this is going to be an issue?
--
You received this message because you are subscribed to the Google Groups "Greenplum Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.