Orca and planner have different default dist policies for CTAS

Ning Yu

olvasatlan,

2018. nov. 12. 23:11:242018. 11. 12.

– gpdb...@greenplum.org

Hi,

I noticed that orca and planner have different default distribution policy for CTAS statements:

set optimizer to off;

SET

create table t_planner as values (1);

psql:/tmp/ctas.sql:5: NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'column1' as the Greenplum Database data distribution key for this table.

HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.

SELECT 1

set optimizer to on;

SET

create table t_gporca as values (1);

psql:/tmp/ctas.sql:8: NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.

SELECT 1

\d+ t_planner

Table "public.t_planner"

---------+---------+-----------+---------+--------------+-------------

Distributed by: (column1)

\d+ t_gporca

Table "public.t_gporca"

---------+---------+-----------+---------+--------------+-------------

Distributed randomly

We can see that t_planner is hash distributed while t_gporca is randomly distributed. Both master and 5X have the same behavior. I also verified some of other CREATE TABLE statements, here is the result (the planner & orca columns are the results):

| create table t | planner | orca |

|:------------------------------------------:|:-------:|:--------:|

| as values (1) | by (c1) | randomly |

| as select c1 from generate_series(1,10) c1 | by (c1) | randomly |

| as select * from t1 | by (c1) | by (c1) |

| as table t1 | by (c1) | by (c1) |

We can see that the dist policies set by orca depend on the AS clause, is this by design? Or is this a bug? Personally I found it a bit confusing.

Best Regards

Ning

Bhuvnesh Chaudhary

olvasatlan,

2018. nov. 12. 23:45:042018. 11. 12.

– n...@pivotal.io, gpdb...@greenplum.org

Hello Ning,

It's an expected behavior for CTAS statements. Refer to https://gpdb.docs.pivotal.io/43180/ref_guide/sql_commands/CREATE_TABLE_AS.html

From the docs:

The Greenplum Database server configuration parameter gp_create_table_random_default_distribution controls the default table distribution policy if the DISTRIBUTED BY clause is not specified when you create a table. Greenplum Database follows these rules to create a table if a distribution policy is not specified.

If the legacy query optimizer creates the table, and the value of the parameter is off, the table distribution policy is determined based on the command.
If the legacy query optimizer creates the table, and the value of the parameter is on, the table distribution policy is random.
If GPORCA creates the table, the table distribution policy is random. The parameter value has no affect.

Also in the table you shared, for the below as well ORCA will chose random distribution, so its consistent for CTAS to consider random distribution with ORCA.

- as select * from t1

- as table t1

create table t1 (a int) distributed by (a);

create table t2 as select * from t1;

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.

SELECT 0

create table t3 as table t1;

NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.

However, i am not sure of the background why we decided it this way, but i see your point.

Thanks,
Bhuvnesh Chaudhary
Mobile: +1-973.906.6976

Ning Yu

olvasatlan,

2018. nov. 13. 0:51:412018. 11. 13.

– Bhuvnesh Chaudhary, gpdb...@greenplum.org

On Tue, Nov 13, 2018 at 12:44 PM, Bhuvnesh Chaudhary <bchau...@pivotal.io> wrote:

Hello Ning,

It's an expected behavior for CTAS statements. Refer to https://gpdb.docs.pivotal.io/43180/ref_guide/sql_commands/CREATE_TABLE_AS.html

From the docs:
The Greenplum Database server configuration parameter gp_create_table_random_default_distribution controls the default table distribution policy if the DISTRIBUTED BY clause is not specified when you create a table. Greenplum Database follows these rules to create a table if a distribution policy is not specified.
If the legacy query optimizer creates the table, and the value of the parameter is off, the table distribution policy is determined based on the command.
If the legacy query optimizer creates the table, and the value of the parameter is on, the table distribution policy is random.
If GPORCA creates the table, the table distribution policy is random. The parameter value has no affect.

Cool, thanks for the information.

Also in the table you shared, for the below as well ORCA will chose random distribution, so its consistent for CTAS to consider random distribution with ORCA.
- as select * from t1
- as table t1
create table t1 (a int) distributed by (a);
create table t2 as select * from t1;
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.
SELECT 0
create table t3 as table t1;
NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.

I double checked why I saw different results with you on above two statements. The source table t1 used in my experiments is a __partially distributed tables__ which has numsegments=1 (my cluster is a 3-primary demo cluster) as I'm working on this feature recently, when orca detects any partially distributed tables it will fallback to planner (introduced in 4eb65a53dea6a6f380415f3a0fe3934ac20ce8ca) so the tables are created in planner style. When I replaced t1 with a fully distributed table I got the same results with you, the new tables created by orca are randomly distributed.

If orca always create randomly distributed tables I will comfortable enough. So I think the real problem is on the fallback logic, we should let planner creates randomly distributed tables when falling back from orca. I will have a try.

Thanks

Ning

Válasz mindenkinek

Válasz a szerzőnek

Továbbítás