Projections

89 views
Skip to first unread message

Andrey Borodin

unread,
Feb 12, 2024, 3:33:34 AMFeb 12
to Greenplum Developers, Кирилл Решке, xi...@yandex-team.ru, smia...@yandex-team.ru
Hey gp-hackers,

Kirill recently produced a PoC to discuss projections in GP [0]. Due
to some glitches of CLA bot he was not able to sign CLA (btw, please
fix it). But let's discuss general approach anyway.
Basic idea is to define projection like this:

CREATE TABLE aott(i int, j int) DISTRIBUTED BY (i);
CREATE PROJECTION p1 ON aott DISTRIBUTED BY (j);

This will double size of the aott table. But with projection p1,
planner will consider join\aggregation pathes with motion elimination
corresponding on distribution key.

To do so we added new relkind == RELKIND_PROJECTION, which affects
relation very similarly to Postgres indexes: every table modification
also modifies all projections.
Also we can consider partial projections: take only several columns
from original table. Or, perhaps, partial projections. However, this
does not seem very useful in case of AOCS (which is considered primary
use-case for projections).

What do you think?


Best regards, Andrey Borodin.

[0] https://github.com/yezzey-gp/ygp/pull/6/files

Zhenghua Lyu

unread,
Feb 12, 2024, 5:45:37 AMFeb 12
to Andrey Borodin, Greenplum Developers, Кирилл Решке, xi...@yandex-team.ru, smia...@yandex-team.ru
Hi, thanks a lot for the proposal.
Seems like it is going to bring projections feature (like Vertica) to GPDB?

I have not looked into this much, but some implementations idea from me is:

* seems it is easy to add base relation's scan paths thus no need to add a new type of Locus
* then let optimizer can choose the best path

Best,
Zhenghua Lyu


--
To unsubscribe from this group and stop receiving emails from it, send an email to gpdb-dev+u...@greenplum.org.


This electronic communication and the information and any files transmitted with it, or attached to it, are confidential and are intended solely for the use of the individual or entity to whom it is addressed and may contain information that is confidential, legally privileged, protected by privacy laws, or otherwise restricted from disclosure to anyone else. If you are not the intended recipient or the person responsible for delivering the e-mail to the intended recipient, you are hereby notified that any use, copying, distributing, dissemination, forwarding, printing, or copying of this e-mail is strictly prohibited. If you received this e-mail in error, please return the e-mail to the sender, delete it from your computer, and destroy any printed copy of it.

Garagashli Niyazi

unread,
Feb 12, 2024, 6:02:35 PMFeb 12
to Greenplum Developers, Andrey Borodin, Кирилл Решке, xi...@yandex-team.ru, smia...@yandex-team.ru
Hi everyone,

First of all, I do believe the proposal to be a good QoL improvement, thank you for this initiative!

Here are some additional features of PROJECTION that could make it even more useful:

1) Make an option of asynchronous PROJECTION update. For example,
CREATE PROJECTION p1 ON aott DISTRIBUTED BY (j) REFRESH EVERY DAY AT 00:00;

This will allow separating the load of the base tables (that can happen hourly) from potential business logic, such as daily mart refreshes. Otherwise, an existing process may very well double in execution time, which is not always desirable.

2) Add an option of OBT-style projections (see StarRocks for reference: https://docs.starrocks.io/docs/using_starrocks/Materialized_view/)

I understand that this may be out of scope for the initial proposition, and may be too complex to even be possible, but just hear me out. Suppose you have normalized tables, each with their specific distribution keys, and you have a query (a view or a script materializing the mart) - running that is costly by definition. If many users perform that, costs can be prohibitive. You can obviously create a table to materialize that query, but then you'd have to find all the places and users that utilize it, and have them change it to 'SELECT * FROM MAT_QUERY;'. With a projection, however, you can do it in background and have optimizer recognize it without users changing anything. So a multiple join query will be processed by optimizer to that select * statement internally. And, again, you can set a schedule for refreshing the data within it.

Thanks!

Ashwin Agrawal

unread,
Feb 12, 2024, 6:24:40 PMFeb 12
to Garagashli Niyazi, Greenplum Developers, Andrey Borodin, xi...@yandex-team.ru, smia...@yandex-team.ru
On Mon, Feb 12, 2024 at 3:02 PM Garagashli Niyazi <zlob...@gmail.com> wrote:
1) Make an option of asynchronous PROJECTION update. For example,
CREATE PROJECTION p1 ON aott DISTRIBUTED BY (j) REFRESH EVERY DAY AT 00:00;

This will allow separating the load of the base tables (that can happen hourly) from potential business logic, such as daily mart refreshes. Otherwise, an existing process may very well double in execution time, which is not always desirable.

That totally depends on what kind of data access guarantees are portrayed for Projections. It's more similar to Materialized views or Normal tables to always get up to date results. Possibly can provide both kind of options as depends on use case and flexibility exposed.

- Ashwin (Broadcom)

Ashwin Agrawal

unread,
Feb 12, 2024, 7:23:47 PMFeb 12
to Andrey Borodin, Greenplum Developers, Кирилл Решке, xi...@yandex-team.ru, smia...@yandex-team.ru
Great to hear about the functional addition.
To start with I don't like the usage of PROJECTION word, as folks will confuse and try to draw it parallel to Vertica Projections.

One of the ways to think about this is as enhanced or type of materialized view

CREATE TABLE aott(i int, j int) DISTRIBUTED BY (i);
CREATE MATERIALIZED VIEW mv as SELECT * FROM aott DISTRIBUTED BY (j);

main enhancement being
- in-core automation present (incremental refresh of materialized view kind-of) to keep the view updated as base table is changing
- planner modified to make use of projection which possible can be done for MV as well so that users don't have to modify their queries to explicitly use MV

Some notes/questions:
- So essentially with this feature COPY operation (which is loading data to a table) is able to reshuffle/motion the tuples to other nodes (to populate its projection) while loading?
- Partial projections for AOCS should still be helpful to reduce full double size of the table if known for queries which will use projection don't fetch all the columns data
- What all use cases do you envision for PROJECTIONS other than just to avoid MOTIONS during query processing?
- So all schema alteration (AT ADD COLUMN and such) operations have to be mirrored to PROJECTIONS as well?
- Can PROJECTIONS have different storage (table type and tablespace) and compression compared to the base table?
- Does creating PROJECTION by default carry associated objects with it like INDEXES (unique index can't be carried but rest)?
- Not just the table's size on disk doubles also WAL, (physical and logical) backups, etc... Possibly should have code to just exclude them from logical backups for sure
- Creating projection for REPLICATED or RANDOM tables seems aren't helpful
- What could be perceived as direct dispatch before in presence of PROJECTIONS may no more be direct dispatch and require 2PC (though I am sure PROJECTION use case is for OLAP centric workload and not OLTP), just for correction of code though will have to plug the things here

Gives a feeling of a lot more static planning and overhead for usage of PROJECTIONS. Instead something more dynamic like how caching works transparently to users would be great if it can get going (understand its a lot more complicated). First time query performs the motion of tuples based on base table, that gets cached (in some temp table of sorts) and used for ongoing queries. Though goes away based on time retention or restart of cluster and such. It's probably partial caching of the interim query results bucket item.

- Ashwin (Broadcom)

Adam Lee

unread,
Feb 20, 2024, 1:36:54 AMFeb 20
to Ashwin Agrawal, Andrey Borodin, Greenplum Developers, Кирилл Решке, xi...@yandex-team.ru, smia...@yandex-team.ru
On Tue, Feb 13, 2024 at 8:23 AM 'Ashwin Agrawal' via Greenplum
Developers <gpdb...@greenplum.org> wrote:
>
> Some notes/questions:
> - So essentially with this feature COPY operation (which is loading data to a table) is able to reshuffle/motion the tuples to other nodes (to populate its projection) while loading?

It would be more than great, but I guess not, this feature could be
done without touching COPY. Did I miss something?

> - Can PROJECTIONS have different storage (table type and tablespace) and compression compared to the base table?

I'm thrilled.

--
Adam

--
Reply all
Reply to author
Forward
0 new messages