Yezzey is functioning. Yeneid is coming. The Future of compute-storage separation in GP.

196 views

Skip to first unread message

Andrey M. Borodin

unread,

Oct 25, 2023, 1:53:06 PM10/25/23

to gpdb...@greenplum.org, Kirill Reshke, Vsevolod Grabelnikov, Aleksei Luzan, Владимир Бородин, Дмитрий Смаль

Hi folks,

As you may know, in September, we launched Yezzey [0] - an extension that allows for transparent switching between table storage on a local filesystem and cloud storage, and back in native format. A discussion on the topic is available here [1].

In fact, Yezzey is just one more step towards making GreenplumDB cloud-native and more manageable, easier to deploy and use. The previous step was point-in-time recovery using WAL-G. And in this post I want to talk about further steps forward.

To be fair, Greenplum isn't an overly-optimized analytical database. There is a lot of room for performance improvements. So many competing systems boast better execution, vectorized, and specialized data handling capabilities, query compilation, and optimization techniques, and so on, but none of this matters much. The MPP nature allows you to simply throw more hardware at the problem. And given the unique combination of analytical capabilities and open source nature, GreenplumDB does not need to fear result of hundred benchmarks.

As our next steps, we are going to concentrate on the following topics:

1. Auto Scaling: Currently, cdbhash() uses the number of segments as a hash parameter. This leads to scaling issues with gp_expand on a large cluster. We would like to implement an access method similar to AO, which materializes the table metadata and decouples hash range from particular segment. That will allow us to bind hash ranges to segments instantly, offloading the file to S3 and declaring it bound to the new segment. It can even be left in the cache on the original segment until it is evicted.
As part of our autoscaling strategy, we will be deprecating non-catalog tables in the HEAP data type. This will allow for zero-cost table addition and removal.
2. Coordination service for sharing tables between clusters, fully S3-backed. Enables reading and writing from multiple clusters, eliminates the need for standby clusters, and opens up new data usage opportunities. Metadata is served by the coordination service when the source cluster is unavailable. Also it maintains writer locks during writes.
3. Materialized views incrementally maintained (a.k.a projections). One of the key strengths of GP is its ability to join large tables. However, distributed hash joins usually lead to significant data network utilization. To mitigate this effect, we can consider materialized views using different distribution keys, which can make the Motion unnecessary. To enable this feature, we will need to introduce a new relation type, the projection, a specific kind of materialized view, and instruct the planner to include it in its considerations.
4. Caching. Currently, Yezzey relies fully on object storage cache, but some S3 implementations charge customers according to the number of GET requests. To reduce these numbers, we need proper local caching in place.
5. ANN - ANN integration for AI. Currently, PG_Vector using HNSW has taken over the world. We need to integrate this feature into GP as well.
6. QUIC - QUIC Motion, or at least Motion Compression, see [4]
7. Coordination-less mode - Given the metadata table service from (2), we should be able to plan queries on each cluster node
8. Backup, Offload, and Table-Sharing Storage - Currently, Yezzey stores data in different object storage buckets. However, we can think about creating a backup bucket and using offload buckets in the same way, as homogeneous parts of the storage system. This would allow us to convert a local table to a Yezzey table instantly.
9. Built-in Time Travel - SELECTing a table at a specific point in the near past should be no problem, unless the visibility map is heavily modified. Also, dropping a table should be no problem; the table is stored in backup in exactly the same format, making table drops equivalent to table evictions from the cache.

No doubt, these steps are ambitious. It is likely that we will only be able to implement some subset of the plans. However, this is an ordered list of the things we want to achieve in Greenplum DB. As always, we commit to developing these projects in an open-source manner. Any discussion is greatly appreciated. I sincerely hope that some of the resulting developments will make their way into the upstream codebase

Best regards, Andrey Borodin.

[0] https://github.com/yezzey-gp/yezzey/blob/v1.8/notes/announce.md
[1] https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/ImJz6DlwT_A
[2] https://github.com/wal-g/wal-g/releases/tag/v2.0.0
[3] https://cloud.yandex.com
[4] https://github.com/greenplum-db/gpdb/pull/16045

Ivan Novick

unread,

Oct 25, 2023, 3:14:28 PM10/25/23

to Andrey M. Borodin, gpdb...@greenplum.org, Kirill Reshke, Vsevolod Grabelnikov, Aleksei Luzan, Владимир Бородин, Дмитрий Смаль

This is very interesting!

Please do keep in mind the Greenplum mission and strategy is to stay aligned to PostgreSQL project, so very large deviations that go in the opposite direction would be against the Greenplum strategy.

But definitely there is a lot of valuable ideas here.

Materialize view topic has a lot of interest and potential for Greenplum, I wonder if we can seperate somehow some proposals inspired by this grand vision to improve Greenplum incrementally in smaller chunks.

This is a good discussion for many to participate.

Thank you Andrey!

-----------------------------------------

Ivan Novick

Director of Product Management

VMware Data Solutions

From: Andrey M. Borodin <x4...@yandex-team.ru>
Sent: Wednesday, October 25, 2023 10:21 AM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>; Kirill Reshke <res...@yandex-team.ru>; Vsevolod Grabelnikov <vsg...@yandex-team.ru>; Aleksei Luzan <alexey...@yandex-team.ru>; Владимир Бородин <d0u...@yandex-team.ru>; Дмитрий Смаль <mia...@yandex-team.ru>
Subject: Yezzey is functioning. Yeneid is coming. The Future of compute-storage separation in GP.

!! External Email

[0] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fyezzey-gp%2Fyezzey%2Fblob%2Fv1.8%2Fnotes%2Fannounce.md&data=05%7C01%7Cinovick%40vmware.com%7C635647e6b9d94d1482ed08dbd5834aeb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C638338532143021585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=WGzBJv6S8%2BEvCUYC%2FgKrDUKJyBphzZ3O7D9%2FU4VLAqA%3D&reserved=0
[1] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgroups.google.com%2Fa%2Fgreenplum.org%2Fg%2Fgpdb-dev%2Fc%2FImJz6DlwT_A&data=05%7C01%7Cinovick%40vmware.com%7C635647e6b9d94d1482ed08dbd5834aeb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C638338532143021585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=fa8azKReb%2BmW9gN7J3KaWy%2BQMzrJ%2FtgPLMC0vTiwcMU%3D&reserved=0
[2] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fwal-g%2Fwal-g%2Freleases%2Ftag%2Fv2.0.0&data=05%7C01%7Cinovick%40vmware.com%7C635647e6b9d94d1482ed08dbd5834aeb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C638338532143021585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=U62DtOPT%2F6LI2r4tGgFm7MuWa4Dkqsy6q3aPeJsQmr8%3D&reserved=0
[3] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fcloud.yandex.com%2F&data=05%7C01%7Cinovick%40vmware.com%7C635647e6b9d94d1482ed08dbd5834aeb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C638338532143021585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=6mReH%2Fe3KOfSqjosDzonNEJ7j%2BE5HRxCOKnXZq2uc%2B0%3D&reserved=0
[4] https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fgreenplum-db%2Fgpdb%2Fpull%2F16045&data=05%7C01%7Cinovick%40vmware.com%7C635647e6b9d94d1482ed08dbd5834aeb%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C1%7C638338532143021585%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C2000%7C%7C%7C&sdata=Z4zbq60CJJrVAGBZowzywrHmTdN%2B8pgNQVJiR7EJizw%3D&reserved=0

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Ashwin Agrawal

unread,

Nov 8, 2023, 7:49:56 PM11/8/23

to Andrey M. Borodin, gpdb...@greenplum.org, Kirill Reshke, Vsevolod Grabelnikov, Aleksei Luzan, Владимир Бородин, Дмитрий Смаль

On Wed, Oct 25, 2023 at 10:53 AM Andrey M. Borodin <x4...@yandex-team.ru> wrote:

Hi folks,

As you may know, in September, we launched Yezzey [0] - an extension that allows for transparent switching between table storage on a local filesystem and cloud storage, and back in native format. A discussion on the topic is available here [1].

In fact, Yezzey is just one more step towards making GreenplumDB cloud-native and more manageable, easier to deploy and use. The previous step was point-in-time recovery using WAL-G. And in this post I want to talk about further steps forward.

To be fair, Greenplum isn't an overly-optimized analytical database. There is a lot of room for performance improvements. So many competing systems boast better execution, vectorized, and specialized data handling capabilities, query compilation, and optimization techniques, and so on, but none of this matters much. The MPP nature allows you to simply throw more hardware at the problem. And given the unique combination of analytical capabilities and open source nature, GreenplumDB does not need to fear result of hundred benchmarks.

Greenplum is not bad at benchmarks either :-)

Yes, anytime comparing super specialized databases to general purpose capabilities of Greenplum - benchmarks can show shortcomings. There is a lot of room for performance improvements 100% agree on and must be pursued as well, even if in today's world can throw more hardware at the problem (cloud vendors love to state that line and design solutions for it obviously :)). Efficient utilization of underlying hardware still remains a very relevant goal.

As our next steps, we are going to concentrate on the following topics:

Very interesting ideas and thanks a lot for sharing. Many of the ideas (catalog service, no support for HEAD tables) read very similar to HAWQ (Greenplum fork created many years back by Greenplum team for Hadoop ecosystem). Refer [1]

1. Auto Scaling: Currently, cdbhash() uses the number of segments as a hash parameter. This leads to scaling issues with gp_expand on a large cluster. We would like to implement an access method similar to AO, which materializes the table metadata and decouples hash range from particular segment. That will allow us to bind hash ranges to segments instantly, offloading the file to S3 and declaring it bound to the new segment. It can even be left in the cache on the original segment until it is evicted.
As part of our autoscaling strategy, we will be deprecating non-catalog tables in the HEAP data type. This will allow for zero-cost table addition and removal.

Would be helpful if you can share more on this materialized metadata aspect and how it would be leveraged for dynamic segment mapping. Also, will collocated joins between different tables continue to be a thing under this scheme?

Deprecating non-catalog tables in HEAP means all current native indexes are out of the picture as well. Ouch!. I am assuming you will be implementing a new min-max style indexing to improve IO efficiency then for sure to reduce traffic to S3.

2. Coordination service for sharing tables between clusters, fully S3-backed. Enables reading and writing from multiple clusters, eliminates the need for standby clusters, and opens up new data usage opportunities. Metadata is served by the coordination service when the source cluster is unavailable. Also it maintains writer locks during writes.

HAWQ did exactly the same. Let me upfront provide input that a lot of meta-data needs to be streamed to segments while dispatching queries (along with the plans) to accomplish this. There are many situations we have faced where queries at runtime fetch the catalog information since it is locally available. All that needs to be upfront figured out and streamed to segments for processing given they can't fetch at runtime anything locally (UDFs and type information and a lot of such details). It's fun though.

Sharing tables between clusters with coordinating locks for them - reads fun challenge to take on.

3. Materialized views incrementally maintained (a.k.a projections). One of the key strengths of GP is its ability to join large tables. However, distributed hash joins usually lead to significant data network utilization. To mitigate this effect, we can consider materialized views using different distribution keys, which can make the Motion unnecessary. To enable this feature, we will need to introduce a new relation type, the projection, a specific kind of materialized view, and instruct the planner to include it in its considerations.

This is one of the aspects I am really interested to hear more about. Not quite able to visualize the pattern for figuring out different distribution keys. Will the onus be on users or you will build some kind of ML based on query patterns to create these projections.

Plus, you didn't clarify how incrementally these views will be maintained - something upstream has been discussing for years with no core implementation so far. This topic seems slightly unrelated to other architecture aspects mentioned for making it Snowflaky... so can be discussed in a separate forum with more details. As can be applied to current architecture as well.

I am slightly torn between this point and the first one. Given the data is in S3 and not really tied to a given node, ideally motions in this new world should look very different if required. No point one node fetching the data from S3 and then over network streaming to another node, right?

4. Caching. Currently, Yezzey relies fully on object storage cache, but some S3 implementations charge customers according to the number of GET requests. To reduce these numbers, we need proper local caching in place.

5. ANN - ANN integration for AI. Currently, PG_Vector using HNSW has taken over the world. We need to integrate this feature into GP as well.

Commercial offerings from VMware for GPDB7 have these integrated so yes they do work great with GPDB.

6. QUIC - QUIC Motion, or at least Motion Compression, see [4]

Motion compression is 100% something we are interested in. We were busy with 7GA. Will definitely spend more time on this contribution and respond back to take it forward.

7. Coordination-less mode - Given the metadata table service from (2), we should be able to plan queries on each cluster node
8. Backup, Offload, and Table-Sharing Storage - Currently, Yezzey stores data in different object storage buckets. However, we can think about creating a backup bucket and using offload buckets in the same way, as homogeneous parts of the storage system. This would allow us to convert a local table to a Yezzey table instantly.
9. Built-in Time Travel - SELECTing a table at a specific point in the near past should be no problem, unless the visibility map is heavily modified. Also, dropping a table should be no problem; the table is stored in backup in exactly the same format, making table drops equivalent to table evictions from the cache.

No doubt, these steps are ambitious. It is likely that we will only be able to implement some subset of the plans. However, this is an ordered list of the things we want to achieve in Greenplum DB. As always, we commit to developing these projects in an open-source manner. Any discussion is greatly appreciated. I sincerely hope that some of the resulting developments will make their way into the upstream codebase

Great to see the ambitious forward looking plan based on Greenplum. Team is super happy and will love to continue discussions on finer details on each of these topics as you unfold and work on them. So, keep them coming and yes, more than happy to absorb relevant pieces in Greenplum core which helps to continue to keep it aligned with PostgreSQL architecture and open to all current use cases it caters too.

[1] https://hawq.apache.org/docs/userguide/2.3.0.0-incubating/overview/HAWQArchitecture.html