Hi,
I noticed that orca and planner have different default distribution policy for CTAS statements:
set optimizer to off;
SET
create table t_planner as values (1);
psql:/tmp/ctas.sql:5: NOTICE: Table doesn't have 'DISTRIBUTED BY' clause -- Using column(s) named 'column1' as the Greenplum Database data distribution key for this table.
HINT: The 'DISTRIBUTED BY' clause determines the distribution of data. Make sure column(s) chosen are the optimal data distribution key to minimize skew.
SELECT 1
set optimizer to on;
SET
create table t_gporca as values (1);
psql:/tmp/ctas.sql:8: NOTICE: Table doesn't have 'DISTRIBUTED BY' clause. Creating a NULL policy entry.
SELECT 1
\d+ t_planner
Table "public.t_planner"
Column | Type | Modifiers | Storage | Stats target | Description
---------+---------+-----------+---------+--------------+-------------
column1 | integer | | plain | |
Distributed by: (column1)
\d+ t_gporca
Table "public.t_gporca"
Column | Type | Modifiers | Storage | Stats target | Description
---------+---------+-----------+---------+--------------+-------------
column1 | integer | | plain | |
Distributed randomly
We can see that t_planner is hash distributed while t_gporca is randomly distributed. Both master and 5X have the same behavior. I also verified some of other CREATE TABLE statements, here is the result (the planner & orca columns are the results):
| create table t | planner | orca |
|:------------------------------------------:|:-------:|:--------:|
| as values (1) | by (c1) | randomly |
| as select c1 from generate_series(1,10) c1 | by (c1) | randomly |
| as select * from t1 | by (c1) | by (c1) |
| as table t1 | by (c1) | by (c1) |
We can see that the dist policies set by orca depend on the AS clause, is this by design? Or is this a bug? Personally I found it a bit confusing.
Best Regards
Ning