'CREATE TABLE LIKE' Command and 'INCLUDING STORAGE' for Append-Optimized Tables

286 views

Skip to first unread message

Alexandra Wang

unread,

Jun 1, 2023, 11:32:26 AM6/1/23

to gpdb...@greenplum.org, fhua...@vmware.com

Hello Greenplum hackers,

TL;DR: the 'CREATE TABLE LIKE INCLUDING STORAGE' command currently
presents undocumented behavior for append-optimized tables,
particularly when multiple 'LIKE' clauses are involved.

Recently, I stumbled upon a GPDB_12_MERGE_FIXME related to the
CREATE TABLE LIKE command with the INCLUDING STORAGE option.

In upstream Postgres, the INCLUDING STORAGE option is relevant only to
toast management. Specifically, when this option is specified, the
attstorge field from the source table's pg_attribute entry gets copied
over.

Here's an example that demonstrates the use of INCLUDING STORAGE for
heap tables in PG16dev. I have also included INCLUDING COMPRESSION and
INCLUDING ALL for comparison. (The COMPRESSION keyword is only
available in PG14+, hence I used upstream Postgres for this
experiment. GP7 exhibits the same behavior with heap tables)

### INCLUDING STORAGE behavior in Postgres

1. INCLUDING STORAGE is only relevant to toast management, and the
storage option is column-specific.

2. INCLUDING COMPRESSION is also only relevant to toast management,
and the compression option is also column-specific.

3. INCLUDING STORAGE and INCLUDING COMPRESSION don't necessarily
depend on each other.

4. Even with INCLUDING ALL, the per-table property "fillfactor" is not
copied. (Nor is the table access method, due to upstream Postgres
having only one in-core access method, this aspect is hard to
demonstrate on the main branch)

Returning to Greenplum, commit 951065c48cc as part of PR 7609 extends
the use of INCLUDING STORAGE, making it responsible also for copying
append-optimized table's encoding options. This specific behavior of
Greenplum is currently not well-documented.(See both 6X doc and 7Xdoc)

To understand how INCLUDING STORAGE works with append-optimized
tables, I carried out the following queries:

```sql
-- pre-cleanup
DROP TABLE IF EXISTS heap;
DROP TABLE IF EXISTS heap2;
DROP TABLE IF EXISTS ao;
DROP TABLE IF EXISTS ao2;
DROP TABLE IF EXISTS aoco;
DROP TABLE IF EXISTS aoco2;
DROP TABLE IF EXISTS like_heap;
DROP TABLE IF EXISTS like_heap_storage;
DROP TABLE IF EXISTS like_ao;
DROP TABLE IF EXISTS like_ao_storage;
DROP TABLE IF EXISTS like_aoco;
DROP TABLE IF EXISTS like_aoco_storage;
DROP TABLE IF EXISTS like_heap_heap_storage;
DROP TABLE IF EXISTS like_heap_ao_storage;
DROP TABLE IF EXISTS like_heap_aoco_storage;

-- create two identical source heap tables
CREATE TABLE heap (a text) WITH (fillfactor = 50);
ALTER TABLE heap ALTER COLUMN a SET STORAGE MAIN;
CREATE TABLE heap2 (a2 text) WITH (fillfactor = 50);
ALTER TABLE heap2 ALTER COLUMN a2 SET STORAGE MAIN;

-- create two identical source ao_row tables
CREATE TABLE ao (b text) USING ao_row WITH (compresstype=zstd,compresslevel=5,blocksize=65536);
ALTER TABLE ao ALTER COLUMN b SET STORAGE MAIN;
CREATE TABLE ao2 (b2 text) USING ao_row WITH (compresstype=zstd,compresslevel=5,blocksize=65536);
ALTER TABLE ao2 ALTER COLUMN b2 SET STORAGE MAIN;

-- create two identical source ao_column tables
CREATE TABLE aoco (c text ENCODING (compresstype='zlib', compresslevel=3, blocksize=8192)) USING ao_column;
ALTER TABLE aoco ALTER COLUMN c SET STORAGE MAIN;
CREATE TABLE aoco2 (c2 text ENCODING (compresstype='zlib', compresslevel=3, blocksize=8192)) USING ao_column;
ALTER TABLE aoco2 ALTER COLUMN c2 SET STORAGE MAIN;

-- create table likes (single)
create table like_heap (like heap);
create table like_heap_storage (like heap including storage);
create table like_ao (like ao);
create table like_ao_storage (like ao including storage);
create table like_aoco (like aoco);
create table like_aoco_storage (like aoco including storage);

-- create table likes (multiple including storage w/ same AMs)
create table like_heap_heap_storage (like heap including storage, like
heap2 including storage);
create table like_ao_ao_storage (like ao including storage, like ao2 including storage); -- ERROR: parameter "blocksize" specified more than once
create table like_aoco_aoco_storage (like aoco including storage, like aoco2 including storage); -- ERROR: parameter "blocksize" specified more than once

-- create table likes (multiple including storage w/ mixed AMs)
create table like_ao_aoco_storage (like ao including storage, like aoco including storage); -- ERROR: parameter "blocksize" specified more than once
create table like_heap_ao_storage (like heap including storage, like ao including storage);
create table like_heap_aoco_storage (like heap including storage, like aoco including storage);

\d+ heap
\d+ ao
\d+ aoco

\d+ like_heap
\d+ like_heap_storage
\d+ like_ao
\d+ like_ao_storage
\d+ like_aoco
\d+ like_aoco_storage

\d+ like_heap_heap_storage
\d+ like_heap_ao_storage
\d+ like_heap_aoco_storage
```

Note: the above output is from 7X. 6X behaves almost the same, only
difference is that on 6X the ERRORed query errors on "compresslevel"
instead of "blocksize".

### INCLUDING STORAGE behavior in Greenplum 7 and 6

1. For a single like .. including storage clause with append-optimized
tables, all the following properties are copied as expected:
- The toast related storage options such as "main", "extended"
- The table access method
- The per-table encoding options for both ao_row and ao_column tables
- The per-column encoding options for ao_column tables

I did not find documentation that clearly explains what STORAGE means in

this command.

2. For multiple LIKE .. INCLUDING STORAGE clauses, we encounter an
ERROR if there are more than one append-optimized source tables. This
is because both ao_row and ao_column tables have default encoding
options, and currently, the same encoding option can't be specified
more than once, even if the values are the same. I did not find documentation

about this.

3. For multiple LIKE .. INCLUDING STORAGE clauses, if we mix heap
source tables and an append-optimized source table, the new table will
always be an append-optimized table and will take the per-table
properties from the source append-optimized table. I did not find

documentation about this.

Huansong and I had an offline discussion and we think that the
original toast related STORAGE options and the additional encoding
options for append-optimized tables are really not the same thing.
Here are our reasons:

1. The original STORAGE option is only relevant to toast management,
hence it is only useful if the table AM needs inline and/or
out-of-line toast tables. In Greenplum, ao_row tables may need toast,
while ao_column tables never need toast.

2. The original toast related STORAGE option is per-column, whereas
the append-optimized additions can be per-table.

3. When INCLUDING STORAGE is specified for a source append-optimized
table, the access method also gets copied along with the encoding
options, whereas the toast related STORAGE option is table AM
agnostic.

4. For an ao_column table, it is arguable that the encoding options
can be per-column, which is similar to the toast-related storage
options. However, it is also similar to the COMPRESSION options that
were later introduced in PG14. In upstream Postgres, INCLUDING STORAGE
and INCLUDING COMPRESSION are not the same thing either.

### Questions:

1. Given right now 7X and 6X have the same behavior, do we want to
keep this behavior moving forward?

2. To make things easier, should we just disallow mixing AMs for
multiple INCLUDING STORAGE?

3. Alternatively, do we want to consider using a different keyword for
the append-optimized encoding options? For example, something like:

CREATE TABLE t3 (LIKE t1 INCLUDING ENCODING, LIKE t2 INCLUDING ENCODING) USING ao_column;

We'd like to hear your comments and thoughts!

Regards,

Alex

Ashwin Agrawal

unread,

Jun 5, 2023, 8:44:51 AM6/5/23

to Alexandra Wang, gpdb...@greenplum.org, fhua...@vmware.com

On Thu, Jun 1, 2023 at 9:02 PM Alexandra Wang <alexandr...@gmail.com> wrote:

Thanks for the detailed explanation.

### Questions:

1. Given right now 7X and 6X have the same behavior, do we want to
keep this behavior moving forward?

Given the confusing nature of the behaviour, fine to change the same for 7X (don't expect too many users of the same).

If we do a good job at clearly highlighting the change compared to 6X in the migration doc from 6X to 7X,

then the risk can easily be mitigated I believe.

2. To make things easier, should we just disallow mixing AMs for
multiple INCLUDING STORAGE?

I don't think we should enforce that restriction. Instead I believe the way upstream quotes [1]

"If multiple specifications are made for the same kind of object, the last one is used." similar is that we should code and clarify behavior.

Does that help?

3. Alternatively, do we want to consider using a different keyword for
the append-optimized encoding options? For example, something like:

CREATE TABLE t3 (LIKE t1 INCLUDING ENCODING, LIKE t2 INCLUDING ENCODING) USING ao_column;

I would vote for having a separate keyword instead of overloading the upstream STORAGE keyword.

As whatif wishes to include ENCODING but not TOAST, current implementation doesn't allow for that.

May similarly need INCLUDING AM

1] https://www.postgresql.org/docs/current/sql-createtable.html

Ashwin Agrawal (VMware)

Alexandra Wang

unread,

Jun 5, 2023, 11:10:14 PM6/5/23

to Ashwin Agrawal, gpdb...@greenplum.org, fhua...@vmware.com

On Mon, Jun 5, 2023 at 5:44 AM Ashwin Agrawal <ashwi...@gmail.com> wrote:

On Thu, Jun 1, 2023 at 9:02 PM Alexandra Wang <alexandr...@gmail.com> wrote:
1. Given right now 7X and 6X have the same behavior, do we want to
keep this behavior moving forward?

Given the confusing nature of the behaviour, fine to change the same for 7X (don't expect too many users of the same).
If we do a good job at clearly highlighting the change compared to 6X in the migration doc from 6X to 7X,
then the risk can easily be mitigated I believe.

Sounds good; I'd like to improve it in 7X as well.

2. To make things easier, should we just disallow mixing AMs for
multiple INCLUDING STORAGE?

I don't think we should enforce that restriction. Instead I believe the way upstream quotes [1]
"If multiple specifications are made for the same kind of object, the last one is used." similar is that we should code and clarify behavior.
Does that help?

1] https://www.postgresql.org/docs/current/sql-createtable.html

The upstream suggestion sounds good to me. However, I couldn't find an existing
case in upstream that supports this quote. The existing INCLUDING/EXCLUDING
options all depend on a column (e.g. STORAGE, COMPRESSION, GENERATED, DEFAULTS,
IDENTITIY) or a set of columns (e.g. COMMENTS, INDEXES, STATISTICS). Since
duplicate column names aren't allowed with multiple LIKEs in the first place, I
don't see a situation where two LIKE options would conflict in upstream
Postgres. Therefore, if we were to code something that follows the upstream
quote, we'd likely be setting the first example for it.

3. Alternatively, do we want to consider using a different keyword for
the append-optimized encoding options? For example, something like:

CREATE TABLE t3 (LIKE t1 INCLUDING ENCODING, LIKE t2 INCLUDING ENCODING) USING ao_column;

I would vote for having a separate keyword instead of overloading the upstream STORAGE keyword.
As whatif wishes to include ENCODING but not TOAST, current implementation doesn't allow for that.
May similarly need INCLUDING AM

OK. Now we have 4 votes to stop overloading the upstream STORAGE keyword for
encodings (The merge FIXME itself, Huansong, Ashwin, and me).

I also like the idea of having an INCLUDING AM option. However, given that AO
encodings and AM are inherently coupled, we should evaluate multiple options.
I've listed three of them below, but other alternatives are welcome:

### Option 1:

INCLUDING STORAGE: only includes toast-related elements, like in upstream
INCLUDING ENCODING: per-table encodings + per-column encodings for ao_column
tables. If multiple specifications are given, take the last one.
INCLUDING AM: only includes AM. If multiple specifications, take the last one.

Pros:
- Naming makes most sense.
Cons:
- If INCLUDING ENCODING is specified but AM is not in any way, we'd error if
the default AM is not compatible with the encoding options.
- If INCLUDING ENCODING is specified and AM is specified by USING, WITH, or
INCLUDING AM from a different table, compatibility check is needed.

### Option 2:

INCLUDING STORAGE: only includes toast-related elements, like in upstream
INCLUDING ENCODING (or INCLUDING COLUMN_ENCODING for clarity?): per-column
encodings for ao_column tables only. We need to ensure the AM is set to
ao_column through USING, WITH, INCLUDING AM, or default AM.
INCLUDING AM (or INCLUDING AM_ENCODING): AM + per-table encodings. If multiple
specifications, take the last one.

Pros:
- COLUMN_ENCODING for ao_column tables aligns more with other existing LIKE
options given its per-column nature. We don't need to worry about selecting
the last specification.
- We could also propose to upstream to add INCLUDING AM, so that fillfactor as
a per-table property that is relavent only to heap tables also gets included.
(Right now, even INCLUDING ALL in upstream doesn't include fillfactor)
Cons:
- If INCLUDING ENCODING is present for one table, and INCLUDING AM is for
another, they better have the same AM. So we still need a compatibility
check, or make one option invalidate the other.

### Option 3:

INCLUDING STORAGE: only includes toast-related elements, like in upstream
INCLUDING ENCODING (or AM/AM_ENCODING): AM + per-table encodings + per-column
encodings

Pros:
- This is most similar to what we already have.
- Given that AM and encodings are coupled for AO tables, having them together
avoids additional compatibility checks between INCLUDING AM and INCLUDING
ENCODING. (We'd still need to check compatibility if using USING and/or WITH
clause)
Cons:
- All the existing confusions arise as two different properties are included in
one option.

For more context, my recent PR
https://github.com/greenplum-db/gpdb/issues/15676 added test cases that
demonstrate the current behavior on 7X and 6X when both INCLUDING STORAGE and
USING/WITH options are present in the same CREATE TABLE statement. I think the
test cases in that PR could be helpful if you need more context to help decide
which new syntax works best for us.

I'd love to hear your comments and thoughts.

Alex (VMware)

Huansong Fu

unread,

Jun 6, 2023, 6:08:40 PM6/6/23

to Alexandra Wang, Greenplum Developers

+1 to Option 1. Even the cons look like good things to me because there’s more predictability.

As you said there’re so many ways to specify AM in the CREATE TABLE we should just let

INCLUDING ENCODING handle encoding options only.

INCLUDING ENCODING: per-table encodings + per-column encodings for ao_column
tables.

By per-table encoding do you mean the pg_class.reloptions? I think we don’t need to pass those.

Even if we do we should introduce a general syntax (INCLUDING RELOPT?) for all AMs but not just CO.

Best regards,

Huansong

On Jun 5, 2023, at 8:44 AM, Ashwin Agrawal <ashwi...@gmail.com> wrote:

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Alexandra Wang

unread,

Jun 7, 2023, 12:34:38 PM6/7/23

to Huansong Fu, Greenplum Developers

Thank you Huansong for your input, it was very helpful!

On Tue, Jun 6, 2023 at 3:08 PM Huansong Fu <fhua...@vmware.com> wrote:

+1 to Option 1. Even the cons look like good things to me because there’s more predictability.
As you said there’re so many ways to specify AM in the CREATE TABLE we should just let

INCLUDING ENCODING handle encoding options only.

INCLUDING ENCODING: per-table encodings + per-column encodings for ao_column
tables.

By per-table encoding do you mean the pg_class.reloptions? I think we don’t need to pass those.

Even if we do we should introduce a general syntax (INCLUDING RELOPT?) for all AMs but not just CO.

Yes, I meant the pg_class.reloptions. I reviewed the create_table_like_gp test

that was added by commit
https://github.com/greenplum-db/gpdb/commit/951065c48cc1e636ca9bf54bde440c93482ad9bc,
it looks like we wanted to copy the pg_class.reloptions for append-optimized
tables as well, which essentially serve as "per-table" encoding options.

So I went ahead and started working on PR
https://github.com/greenplum-db/gpdb/pull/15714 which incorporates three key
words as Huansong suggested:

- INCLUDING ENCODING: includes per-attributed encoding for ao_column tables

- INCLUDING RELOPT: includes per-table encoding-related reloptions such as
"blocksize", "compresslevel", "compresstype" for a ao_row or ao_column
tables, as well as more generic reloptions such as "fillfactor" for heap
tables (the latter is yet to be implemented)

- INCLUDING AM: includes access method

Please take a look and let me know if this makes sense. Comments and thoughts
are welcome either on this thread or on the PR.

Alexandra Wang (VMware)

Alexandra Wang

unread,

Jun 12, 2023, 4:32:04 PM6/12/23

to Ashwin Agrawal, gpdb...@greenplum.org, fhua...@vmware.com

Update:

On Mon, Jun 5, 2023 at 8:10 PM Alexandra Wang <alexandr...@gmail.com> wrote:

On Mon, Jun 5, 2023 at 5:44 AM Ashwin Agrawal <ashwi...@gmail.com> wrote:
On Thu, Jun 1, 2023 at 9:02 PM Alexandra Wang <alexandr...@gmail.com> wrote:
2. To make things easier, should we just disallow mixing AMs for
multiple INCLUDING STORAGE?

I don't think we should enforce that restriction. Instead I believe the way upstream quotes [1]
"If multiple specifications are made for the same kind of object, the last one is used." similar is that we should code and clarify behavior.
Does that help?
1] https://www.postgresql.org/docs/current/sql-createtable.html

The upstream suggestion sounds good to me. However, I couldn't find an existing
case in upstream that supports this quote. The existing INCLUDING/EXCLUDING
options all depend on a column (e.g. STORAGE, COMPRESSION, GENERATED, DEFAULTS,
IDENTITIY) or a set of columns (e.g. COMMENTS, INDEXES, STATISTICS). Since
duplicate column names aren't allowed with multiple LIKEs in the first place, I
don't see a situation where two LIKE options would conflict in upstream
Postgres. Therefore, if we were to code something that follows the upstream
quote, we'd likely be setting the first example for it.

So for this quote from upstream documentation:

```

If multiple specifications are made for the same kind of object, the last one is used.

```

Huansong pointed out in [his comments for PR
15714](https://github.com/greenplum-db/gpdb/pull/15714#pullrequestreview-1472960655)
that this rule applies to multiple "INCLUDING|EXCLUDING AM|RELOPT|ENCODING"
like_options in the same LIKE clause, rather than multiple LIKE clauses with
the same like_option.

I agree with Huansong that we should simply disallow multiple LIKE claues with
INCLUDING AM or INCLUDING RELOPT.

If any objections please comment on [the
PR](https://github.com/greenplum-d/gpdb/pull/15714).

---

Alexandra Wang (VMware)

Ashwin Agrawal

unread,

Jun 13, 2023, 12:35:03 AM6/13/23

to Alexandra Wang, gpdb...@greenplum.org, fhua...@vmware.com

On Tue, Jun 13, 2023 at 2:02 AM Alexandra Wang <alexandr...@gmail.com> wrote:

Update:

So for this quote from upstream documentation:

```
If multiple specifications are made for the same kind of object, the last one is used.
```

Huansong pointed out in [his comments for PR
15714](https://github.com/greenplum-db/gpdb/pull/15714#pullrequestreview-1472960655)
that this rule applies to multiple "INCLUDING|EXCLUDING AM|RELOPT|ENCODING"
like_options in the same LIKE clause, rather than multiple LIKE clauses with
the same like_option.

I agree with Huansong that we should simply disallow multiple LIKE claues with
INCLUDING AM or INCLUDING RELOPT.

Sounds fine for now. Just-in-case in future we get some use-case to open the door we can evaluate that time.

Having fewer doors open is generally helpful (in this context) :-)

Thanks Huansong and Alex for digging into the details.

Ashwin Agrawal (VMware)

Reply all

Reply to author

Forward

0 new messages