Default DISTRIBUTED BY for a table with no columns

653 views
Skip to first unread message

Heikki Linnakangas

unread,
Jan 26, 2016, 8:35:48 AM1/26/16
to Greenplum Developers
If you create a table that has no columns at all, how is the data
distributed? I would've guess that it behaves the same as a table that
doesn't have any hashable columns, and is distributed randomly. What
actually happens is that the data is stored *in the master*.

postgres=# create table m();
NOTICE: Table has no attributes to distribute on.
CREATE TABLE
postgres=# alter table m add column a int;
ALTER TABLE
postgres=# insert into m select i from generate_series(1,5) i;
INSERT 0 5
postgres=# select a, gp_segment_id from m;
a | gp_segment_id
----+---------------
1 | -1
2 | -1
3 | -1
4 | -1
5 | -1
(5 rows)

I suspect that's an oversight. I don't think there's any user-accessible
way to create a table that's stored in the master rather than in
segments, and I believe that's on purpose. A resourceful DBA might be
able to abuse this, for that effect, but I don't think we want to
support this.

I suggest that we change that case to do DISTRIBUTED RANDOMLY instead.
While we're at it, I don't think it's necessary to have a bespoken
message string for this case ("NOTICE: Table has no attributes to
distribute on") and we could use the more general message for this too:

NOTICE: Table doesn't have 'distributed by' clause, and no column type
is suitable for a distribution key. Creating a NULL policy entry.

If that's OK, all we need to do is to remove the explicit check for the
case of a table with no attributes from transformDistributedBy().

- Heikki

Daniel Gustafsson

unread,
Jan 26, 2016, 10:47:34 AM1/26/16
to Heikki Linnakangas, Greenplum Developers
I think thats OK (and perfectly sound), it follows how we handle tables without
columns which can be distribution keys (eg: create table t (a point);). Also,
wee already state this in the documentation:

"If a table does not have an eligible column, Greenplum distributes the rows
randomly or in round-robin fashion."

While that sentence probably wasn’t referring to the above case it still applies.

cheers ./daniel

Ivan Novick

unread,
Jan 26, 2016, 11:49:44 AM1/26/16
to Heikki Linnakangas, Greenplum Developers
I suspect that's an oversight. I don't think there's any user-accessible way to create a table that's stored in the master rather than in segments, and I believe that's on purpose. A resourceful DBA might be able to abuse this, for that effect, but I don't think we want to support this.
Agreed +1, this should not be supported IMO.

Cheers,
Ivan 

Jon

unread,
Jan 29, 2016, 12:11:16 PM1/29/16
to Greenplum Developers, hlinna...@pivotal.io
I've always wanted master-only tables for managing ETL metadata so you don't have the overhead of MPP for these really small tables.  

Ivan Novick

unread,
Jan 29, 2016, 11:24:53 PM1/29/16
to Jon, Greenplum Developers, Heikki Linnakangas
Hi Jon,

If we do something like that we need a dedicated feature that is designed for it.  There are several similar and related use cases that make tons of sense to what you are asking for, but IMO this particular email thread is about an accidental backdoor and not a thought through feature.

Can you summarize to the email list what type of use case you are thinking about so we can think from ground zero what is the best approach to solve it?  Perhaps a new email thread.

Cheers,
Ivan

Jon Roberts

unread,
Jan 30, 2016, 11:47:50 AM1/30/16
to Ivan Novick, Greenplum Developers, Heikki Linnakangas
I see two use cases for master-only tables.

1.  Metadata tables.  There are many needs to have fast OLTP interaction for managing data loading.  When did a load start, stop, how many rows, error messages, etc are very nice to store in a database.  One solution to this is to use a normal distributed table in Greenplum but the interaction is relatively slow compared to say PostgreSQL which is designed for OLTP.  Another solution is to create a separate database just to manage ETL functions.

It would be a simple solution to have use the Greenplum master database to store this information for two reasons.  First, it reduces overall risk because your ETL processing isn't dependent on yet another database.  Secondly, most production systems have a standby-master so now your ETL data has built-in HA.

2.  Replication.  Longer term, I would like to see the ability to replicate in realtime to the Master database from external databases like Oracle, SQL Server, PostgreSQL, etc.  Many tools like Attunity Replicate, are designed for OLTP to OLTP replication.  This would allow a tool like that to work in an optimal fashion with the target being Greenplum.  We would of course need to drain these tables to distributed tables but this could be done in batches.

A bit off topic, but a third use-case could be for HAWQ.  Just like #2 above, HAWQ could benefit from this replication.  It would allow UPDATE and DELETE statements and then in batches, drain it to tables stored in HDFS.   That would drive adoption for HAWQ and make it easier to target HAWQ for data replication.

If a master-only table is a "feature" instead of an accidental backdoor, then why would it matter if in the future people create master-only tables in this manner?  


Jon Roberts
Principal Engineer | jrob...@pivotal.io | 615-426-8661
Reply all
Reply to author
Forward
0 new messages