WAL-G, S3, and delta-backups

459 views
Skip to first unread message

Andrey Borodin

unread,
Jan 13, 2022, 10:44:18 AM1/13/22
to gpdb...@greenplum.org, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin, Иван Шаламов
Hi team!

At Yandex.Cloud we are working on efficient and reliable backups for Greenplum with WAL-G.
WAL-G uses physical backup and WAL to restore a cluster to some consistent recovery point. Recovery points can be created on demand by calling WAL-G or GP function directly. Basebackups and cluster history are stored in S3 in a compressed and encrypted format.
The most notable advantages of this approach are:
- extremely efficient: eg few hundreds of terrabytes are restored within some hours
- restore on point in time (you can have many points, e.g. every hour or every 10 minutes, but not exactly the same PITR as Postgres has)
- backup does not block DDL
These advantages come at a cost, e.g.:
- you can only restore the whole cluster
- restore will consume all available network bandwidth. Or FS throughput. But anyway the operation can be stressful for the infrastructure.

Delta backups for PostgreSQL
WAL-G for Postgres stores basebackup (physical copy of cluster files) and history (WAL files). Basebackup is restored with high parallelism to utilize all available network bandwidth. WAL is applied by the Postgres startup process, one by one record at a time. It's disgusting to watch 1 CPU utilized for 100% on 64 CPU box.
To alleviate this WAL-G has delta backups. Just like basebackup, but we only store blocks that are changed since the previous backup. SLRU\VM\FSM etc goes as-is in every delta backup. Delta backup can be seen as a compressed bunch of WAL files. These delta backups are applied in parallel a few orders of magnitude faster than WAL.

Delta backups for Greenplum
Greenplum heap tables can be backuped exactly as PostgreSQL tables. But the AO\AOCS tables are of special interest. AFAIK AO segments are immutable, if they are present in one backup it would make sense to just store a link on their S3 image. This opens interesting perspectives.

When detla\full backup is done we could store only new AO segments. This is a very low-hanging fruit that could save a lot of network bandwidth.
When the segment runs low on disk quota we could just kill some segments locally and make a note: when the segment will be requested - read it from S3.
For this, we could provide some archive_command-like mechanics. For example, we could name it segment_archive_command\segment_restore_command. We could start from manually archiving segments, then move to automatic eviction of unnecessary data to S3.
Finally, I'd be happy if we could do cluster resharding through S3: when you create a new shard just mark all segments as "evicted" and the segment is ready to run.

This could allow very fast adding\deleting new segments. But this is only possible if we can move the AO segment file intact.

Thanks for reading so far. What do you think?

Best regards, Andrey Borodin.

Ashwin Agrawal

unread,
Jan 13, 2022, 8:36:57 PM1/13/22
to Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Vladimir Borodin, gpdb...@greenplum.org, Иван Шаламов, Кирилл Решке
Exciting, acknowledged. Will have full tech reply next week. Thank you so much for initiating the discussion.

Andrew Borodin

unread,
Jan 14, 2022, 12:15:09 AM1/14/22
to Ashwin Agrawal, Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Vladimir Borodin, gpdb...@greenplum.org, Иван Шаламов, Кирилл Решке
On Fri, Jan 14, 2022 at 6:36 AM Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Exciting, acknowledged. Will have full tech reply next week. Thank you so much for initiating the discussion.

Thanks, Ashwin!
And sorry for posting twice. My first message from corp mail was not
delivered, so I used my personal email. And then it was delivered :)
So, I'll continue from the personal box.

We will start with the simplest part: backup AO files only once,
maintain reference count, reuse the file whenever possible. To start a
project we need to solve one small technical problem. When WAL-G sees
a file on FS we need to identify if the file is AO segment or not. And
if it's AO WAL-G needs to generate some unique key for the file. But
the latter seems trivial - we can just take a file path+name.
What is the most efficient way to tell AO segment from regular heap
segment? Can we do this without interacting with DB?

Thanks!

Best regards, Andrey Borodin.

Ashwin Agrawal

unread,
Jan 18, 2022, 8:14:15 PM1/18/22
to Andrey Borodin, Greenplum Developers, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin, Иван Шаламов
On Thu, Jan 13, 2022 at 7:44 AM Andrey Borodin <x4...@yandex-team.ru> wrote:
Hi team!

At Yandex.Cloud we are working on efficient and reliable backups for Greenplum with WAL-G.
WAL-G uses physical backup and WAL to restore a cluster to some consistent recovery point. Recovery points can be created on demand by calling WAL-G or GP function directly. Basebackups and cluster history are stored in S3 in a compressed and encrypted format.

Curious to learn more on what's stored as part of cluster history?
 
The most notable advantages of this approach are:
- extremely efficient: eg few hundreds of terrabytes are restored within some hours
- restore on point in time (you can have many points, e.g. every hour or every 10 minutes, but not exactly the same PITR as Postgres has)
- backup does not block DDL

Really nice advantages compared to logical backup, really good to provide this as an alternative to users.

Delta backups for Greenplum
Greenplum heap tables can be backuped exactly as PostgreSQL tables. But the AO\AOCS tables are of special interest. AFAIK AO segments are immutable, if they are present in one backup it would make sense to just store a link on their S3 image. This opens interesting perspectives.

Need to define immutable in this context. Based on reading of the whole thread, seem immutable is miss-understood for AO segment files.

AO segments files are never written logically in-place (yes immutable that way). Segment files do grow over time and are written at the end of logical EOF - End-Of-File on next inserts/writes (so files are mutated from that perspective).

Lets take this example:
CREATE TABLE AO(a int) WITH (APPENDONLY=TRUE);
BEGIN;
INSERT INTO AO SELECT * FROM generate_series(0,100);
ABORT;
INSERT INTO AO SELECT * FROM generate_series(0,100);

This would write to the segment file lets say from offset 0-100. Though as aborted the next insert will overwrite the same 0-100 offset. Physically looking at the file, no growth will be seen but the contents have changed/mutated.

Important to note - AO tables can't use infinite number of segment files, there is finite number. Max 127 segment number (files for row oriented tables and logical entity representing physical files for column oriented tables) can be used by AO tables. This limitation comes from the number of bits reserved in TID (AOTID) to represent the segment file number for indexes to work. At least in current state this is the limitation. If we wish to alter the design and make them not have this limitation that's totally different story.

When detla\full backup is done we could store only new AO segments. This is a very low-hanging fruit that could save a lot of network bandwidth.

Detecting new segment files after backup would be easy. Though curious to learn what logic will be used to detect when to backup an existing segment file vs skip it? (as that might be a more regular case).  As explained above we do add/modify data to segments files. Also, after cycle of vacuum and compaction, segment files do get reused as well.

gpbackup uses a modcount field from pg_aoseg/pg_aocsseg tables (at segment file level) for increment logical backups. Not sure if this could be something which can be leveraged for physical delta backups as well. Though requires catalog lookup and not something just purely based on filesystem.

When the segment runs low on disk quota we could just kill some segments locally and make a note: when the segment will be requested - read it from S3.
For this, we could provide some archive_command-like mechanics. For example, we could name it segment_archive_command\segment_restore_command. We could start from manually archiving segments, then move to automatic eviction of unnecessary data to S3.

Interesting to offload AO files to S3 in native GPDB form. Seems mostly good for partitions/tables with zero (or near zero) write activity. Users currently do this mostly via converting AO partitions to external partitions, though loose out keeping them in native GPDB AO format and hence loose compression and such. Seems a good longer term goal once delta backup and such is completely hashed out.

Well similar goal maybe can be accomplished if had S3 compatible filesystem (which could do caching and such) mounted as a different tablespace. Then its just matter of altering tables tablespace from xfs to S3 compatible (imaginative) FS.

Finally, I'd be happy if we could do cluster resharding through S3: when you create a new shard just mark all segments as "evicted" and the segment is ready to run.

The main aspect of resharding is extracting the required data out from the current physical location (segment files). How much its important to make the data available to new nodes, same way its important to remove non-belonging data from old nodes. This removal aspect is not discussed by the proposal.

Sure with this we can give users a quick fake feeling of cluster expansion but on query will have to get the data for the new nodes quickly available. For AO tables need to populate meta-data tables like pg_aoseg at time of expansion, even if wish to pull the AO segment files lazily. To populate these meta-data tables need to know the EOF/tup-count and such. Hence, missing how without doing the real work would we accomplish the stated goal. I am missing to connect the dots here.

I am trying to see why S3 is enabler for this resharding optimization. Resharding optimization, agree, lowers the barrier of segment creation by lazy shuffling of only required/queried data instead of proactively doing it for every table. Though even today new nodes can be added online without any data shuffling and data movement happens in background by gpexpand. So, using S3 or not, if can lazily pull data, we should be able to apply the concept even without S3.

Main aspect is how to keep the data separated to easily do resharding, with or without S3. Or eliminate the need altogether of static hashing and runtime need to hash the things based on number of workers. All seem much bigger architectural changes and not small incremental steps forward.

Excited to see and take forward the delta backups and WAL-G support for Greenplum as initial iteration. All other things read very (snow)flaky as of now :-)

--
Ashwin Agrawal (VMware)

Ashwin Agrawal

unread,
Jan 18, 2022, 8:31:20 PM1/18/22
to Andrew Borodin, Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Vladimir Borodin, Greenplum Developers, Иван Шаламов, Кирилл Решке
On Thu, Jan 13, 2022 at 9:15 PM Andrew Borodin <ambor...@gmail.com> wrote:
And sorry for posting twice. My first message from corp mail was not
delivered, so I used my personal email. And then it was delivered :)
So, I'll continue from the personal box.

Sounds good, no probs.
 
We will start with the simplest part: backup AO files only once,
maintain reference count, reuse the file whenever possible. To start a
project we need to solve one small technical problem. When WAL-G sees
a file on FS we need to identify if the file is AO segment or not. And
if it's AO WAL-G needs to generate some unique key for the file. But
the latter seems trivial - we can just take a file path+name.
What is the most efficient way to tell AO segment from regular heap
segment? Can we do this without interacting with DB?

Yes, it's a challenge currently, nothing at filesystem level helps us to detect HEAP vs AO relfilenode. We faced a similar issue for checksum verification in pg_basebackup as well, refer [1]. Even upstream currently for tableam nothing exists to detect AM type at filesystem level.

In the past when discussing, some solutions were to either add a MAGIC header to AO files or create a special fork to indicate its AO tables. All the ideas require on-disk change and are not easy to upgrade. Alternatives explored were to maybe try reading BLOCK_SIZE worth of data and see if it fits the HEAP format or not. Again, all hacks are not a very conclusive way to know for sure just at filesystem level.

I think for now the safest way is to look at the catalog and then know whether it's AO or not. I will continue to think more on it and see if I can have further ideas.


-- 
Ashwin Agrawal (VMware)

Andrew Borodin

unread,
Apr 15, 2022, 12:02:18 AM4/15/22
to Ashwin Agrawal, Andrey Borodin, Greenplum Developers, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin, Иван Шаламов
Hi all!

On Wed, Jan 19, 2022 at 6:14 AM Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Excited to see and take forward the delta backups and WAL-G support for Greenplum as initial iteration. All other things read very (snow)flaky as of now :-)
>

Dan had implemented sharing of AO files between backups [0]. According
to our calculations this will save us ~90% of backup traffic and
storage. Gradually moving on to (snow)flacky things.

Best regards, Andrey Borodin.

[0] https://github.com/wal-g/wal-g/pull/1257

Andrey Borodin

unread,
Jun 2, 2022, 8:28:44 AM6/2/22
to Greenplum Developers, ambor...@gmail.com, Andrey Borodin, Greenplum Developers, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin, Ashwin Agrawal
On Friday, April 15, 2022 at 9:02:18 AM UTC+5 ambor...@gmail.com wrote:
Gradually moving on to (snow)flacky things.

Few weeks ago we have released WAL-G 2.0 with a support for Greenplum online backups [0]. Current implementation employs shared storage of AOCS segments, which in effect is almost as effective as Postgres incremental backups. But we will push it further gradually to real delta backup of AOCS segments based on logical EOF.

Meanwhile we are starting to work on automatic offload of AOCS segments to S3. And most appropriate place to detect that file is missing from FS seems to be Postrges smgr subsystem. Luckily, similar work is done in Postgres by engineers from Neon. So we propose to restore smgr and reuse pluggable smgr.

What do you think?

 

Ashwin Agrawal

unread,
Jun 6, 2022, 2:45:08 PM6/6/22
to Andrey Borodin, Greenplum Developers, ambor...@gmail.com, Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin
On Thu, Jun 2, 2022 at 5:28 AM Andrey Borodin <x...@double.cloud> wrote:
On Friday, April 15, 2022 at 9:02:18 AM UTC+5 ambor...@gmail.com wrote:
Gradually moving on to (snow)flacky things.

Few weeks ago we have released WAL-G 2.0 with a support for Greenplum online backups [0]. Current implementation employs shared storage of AOCS segments, which in effect is almost as effective as Postgres incremental backups. But we will push it further gradually to real delta backup of AOCS segments based on logical EOF.

Meanwhile we are starting to work on automatic offload of AOCS segments to S3. And most appropriate place to detect that file is missing from FS seems to be Postrges smgr subsystem. Luckily, similar work is done in Postgres by engineers from Neon. So we propose to restore smgr and reuse pluggable smgr.

What do you think?

Sounds right direction/approach for implementation. Will take a look at the PR.

--
Ashwin Agrawal (VMware)

Andrey Borodin

unread,
Apr 7, 2023, 12:31:13 PM4/7/23
to Ashwin Agrawal, Andrey Borodin, Greenplum Developers, Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin
Hello world!

On Mon, Jun 6, 2022 at 11:45 PM Ashwin Agrawal <ashwi...@gmail.com> wrote:
>
> Sounds right direction/approach for implementation. Will take a look at the PR.

I'm here for a small status update. The project goes on.

So far we have a working version of automatic S3 data offloading. It
is available here [0], but it only works with the patched GP6 version
from here [1]. During the development process a lot of hidden
obstacles were uncovered. Mainly around WAL replay of offloading on
standbys.
Currently data caching is a big TODO item not approached yet. Also,
the current version requires manual actions to inform the system which
relations can be offloaded without significant workload interference.

According to our estimates we hope to have a production-ready version in 2023.

Thanks!


Best regards

[0] https://github.com/yezzey-gp/yezzey
[1] https://github.com/yezzey-gp/gp_modified_for_yezzey

Andrey Borodin

unread,
Sep 28, 2023, 11:08:03 AM9/28/23
to Greenplum Developers, Andrey Borodin, Andrey Borodin, Greenplum Developers, Andrey Borodin, Daniil Zakhlystov, Dmitry Pavlov, Кирилл Решке, Vladimir Borodin, Ashwin Agrawal
Hi!

пятница, 7 апреля 2023 г. в 19:31:13 UTC+3, Andrey Borodin:

According to our estimates we hope to have a production-ready version in 2023.

We have released Yezzey version 1.8 and launched publicly avaiable Managed Greenplum with S3 data offloading.
According to our benchmarks [0] S3 tables are slightly slower then tables read from local ssds.

Any feedback or pull requests will be greatly appreciated.

Best regards, Andrey Borodin.

 
Reply all
Reply to author
Forward
0 new messages