Hi folks,
I want to start a discussion on whether we should have wal_compression GUC on or off by default.
The current default value is off.
Postgres has wal_compression off by default since the introduction of the GUC.
However, as we move forward with Greenplum 7, it is good to evaluate whether we should turn wal_compression on by default.
Greenplum 7 is different compared to Postgres as we strictly use ZSTD as the compression algorithm.
In Postgres, the user has the option to use other compression algorithms, e.g., pglz, lz4, and zstd.
Furthermore, the compression level in ZSTD is tunable. It can go from 1 to 22. As of right now, we are using the default value of compression level 3.
Compression level 3 is the recommended value by ZSTD as it hits most general use cases.
The question is, should we change this value?
I’ve done some performance testing to see the potential benefits of having wal_compression be set to on along with the exploration of different ZSTD compression level.
The performance testing were conducted on two different Greenplum 7 clusters, a multi-node and a single-node setup:
The GUC settings used for most of the test:
Summary
Beyond this point will be the test information, the data, and analysis of the results.
Note: the main test performed here is based on a prior testing from the Postgres mailing list.
The test creates, inserts, and updates the values.
Initially, only the updates are measured, however, I’ve modified the tests to get some interesting data.
9 short and 1 long, short changed
hundred tiny fields, all changed
hundred tiny fields, half changed
hundred tiny fields, half nulled
one short and one long field, no change
ten tiny fields, all changed
two short fields, both changed
two short fields, no change
two short fields, one changed
Also, I will mostly present the aggregated data of the above test since showing each test result would be too large.
The scale
mentioned below is a multiplier when inserting rows.
1000 scale will be around 14GB – 21GB
5000 scale will be around 40GB – 55GB
10000 scale will be around 55GB – 83GB
Single node test
scenario: create + insert
type | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
heap | 1 | -0.1% | -0.4% | -0.4%
scenario: update
type | scale | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
AO/CO| 1000 | 3 | 2.1% | -0.3% | -0.2%
heap | 1000 | 1 | -4.9% | -31.1% | -31.3%
heap | 1000 | 3 | 7% | -31% | -31%
heap | 1000 | 19 | 1442.0% | -30.5% | -33.3%
heap | 5000 | 1 | -3.2% | -31.0% | -31.6%
heap | 5000 | 3 | 4.4% | -31.3% | N/A
heap | 10000 | 3 | 3% | -31% | -32%
heap | 50000 | 1 | -3.9% | -31.0% | -31.6%
scenario: create + insert + update
type | scale | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
heap | 1000 | 1 | -1.8% | -19.8% | -20.0%
GUC wal_consistency_checking=heap
type | scale | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
heap | 1000 | 1 | -43.1% | -86.4% | -86.7%
Based on the result from the scenario: update table, as you increase the compression level, it will also
Increasing the duration and in return compress the data even more.
However, the increase in duration is very costly compared to how much compression we get in return.
Hence, the test beyond this point will be limited to compression level 1 and 3 (default).
We have a very similar result that was done on Postgres.
The results show that the full page image (FPI) size no longer decreases at a certain point even with
the increase of the ZSTD compression level.
descr | rel_size | fpi_size | time_sA
int column no compression | 429 MB | 727 MB | 13.15
int column ztsd default level | 429 MB | 523 MB | 14.23
int column zstd level 1 | 429 MB | 524 MB | 13.94
int column zstd level 10 | 429 MB | 523 MB | 23.46
int column zstd level 19 | 429 MB | 523 MB | 103.71
int/text no compression | 344 MB | 558 MB | 10.08
int/text zstd default level | 344 MB | 415 MB | 11.48
int/text zstd level 1 | 344 MB | 418 MB | 11.25
int/text zstd level 10 | 344 MB | 415 MB | 20.59
int/text zstd level 19 | 344 MB | 413 MB | 62.64
Multi-node test
scenario: create only
type | scale | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
heap | 1000 | 3 | 0.1% | -7.8% | -4.1%
scenario: update
type | scale | zstd compression level | duration delta | wal_generated delta | wal_bytes delta
heap | 5000 | 3 | -9.6% | -29.2% | -28.1%
heap | 10000 | 1 | -11.7% | -31.0% | -33.7%
heap | 10000 | 3 | -10.7% | -27.4% | -27.8%
The surprising result here is that the multi-node actually gained a 10% speed up compared to the single node.
The wal_bytes and wal_generated are within the expected range of 20-30% reduction in size.
perf capturing
All perf records are captured on host (sdw1).
The CPU % values below are retrieved from the flamegraph that is generated based on the perf record.
The flamegraph showed that the extra CPU usage is mainly from
ZSTD_compress
and ZSTD_getErrorString.
Multi-node mirrorless
testname | compression=3 on | off
9 short and 1 long, short changed | 5.55% | 0.90%
hundred tiny fields, all changed | 2.56% | 1.06%
hundred tiny fields, half changed | 2.83% | 1.05%
hundred tiny fields, half nulled | 3.07% | 1.05%
one short and one long field, no change | 1.56% | 0.90%
ten tiny fields, all changed | 10.98% | 2.04%
two short fields, both changed | 11.06% | 1.82%
two short fields, no change | 11.63% | 1.82%
two short fields, one changed | 12.07% | 1.75%
Multi-node with mirrors
testname | compression=3 on | off
9 short and 1 long, short changed | 1.64% | 1.25%
hundred tiny fields, all changed | 3.81% | 4.13%
hundred tiny fields, half changed | 3.63% | 3.94%
hundred tiny fields, half nulled | 3.81% | 3.48%
one short and one long field, no change | 1.36% | 1.24%
ten tiny fields, all changed | 6.52% | 5.74%
two short fields, both changed | 7.35% | 6.60%
two short fields, no change | 4.05% | 3.32%
two short fields, one changed | 4.29% | 3.80%
testname | compression=1 on | off
9 short and 1 long, short changed | 1.72% | 1.20%
hundred tiny fields, all changed | 3.84% | 3.73%
hundred tiny fields, half changed | 4.18% | 3.85%
hundred tiny fields, half nulled | 3.89% | 3.66%
one short and one long field, no change | 1.28% | 1.35%
ten tiny fields, all changed | 7.09% | 6.03%
two short fields, both changed | 7.02% | 6.68%
two short fields, no change | 3.81% | 3.95%
two short fields, one changed | 4.03% | 3.85%
testname | compression=3 on with mirror caught up | off with mirror caught up
9 short and 1 long, short changed | 1.79% | 1.37%
hundred tiny fields, all changed | 3.84% | 3.10%
hundred tiny fields, half changed | 3.87% | 3.75%
hundred tiny fields, half nulled | 3.83% | 3.59%
one short and one long field, no change | 1.41% | 1.29%
ten tiny fields, all changed | 7.37% | 6.89%
two short fields, both changed | 7.01% | 7.23%
two short fields, no change | 4.23% | 4.05%
two short fields, one changed | 3.68% | 3.54%
Based on the results here, even with waiting for mirror to decompress and replay the WAL, it still does not take a significant amount of CPU time.
Multi-node with mirror uses less overall CPU for Postgres compared to the mirrorless. This is probably because the data needs to be sent to mirrored?
It seems that the overall Postgres CPU usage for wal_compression on compared to off increased by 0.5% to 1.5%.
The increase in CPU cycles for Postgres doesn’t seem to be very much for a host with 16 segments (8 primary + 8 mirrors).
TPC-DS load of 1TB with ZSTD compression 3
AO/CO
off | 1003.607694 seconds
on | 962.35554 seconds
duration delta (off -> on) -4.11%
Heap
off | 1551.47638 seconds
on | 1241.385109 seconds
duration delta (off -> on) -19.99%
Surprisingly, even AO/CO received a 4% benefit from turning compression on. This might be related to auxiliary tables being heap.
As for TPC-DS with heap tables, we can see an almost +20% improvement in speed. (Note: I had another run that came to about +6% improvement)
But overall, this is a significant improvement for the TPC-DS load.
Memory
ZSTD uses its own memory allocation, which means tracking the memory consumption is not simple. Unfortunately, the feature for using custom memory allocation for ZSTD is still not stable. This is something we may need to consider in the future.
The memory consumption did not increase much for the single node test; the max memory usage went from 8.8% to 9.0%. The memory usage was captured with a watch on ps of all postgres instances every half a second.
Keep in mind that as we increase the amount of segments per host, the most memory ZSTD will use as each segment will do its own WAL compression.
I would like to hear your thoughts regarding wal_compression being on as default. If so, which ZSTD compression level? 1 or 3?
Feedback on what could should be investigated looked into further to better determine which direction we should go would also be great.
Also, if anything is unclear please let me know.
Thank you very much.
Regards,
Marbin
> Multi-node with mirrors
> testname | compression=3 on | off
> 9 short and 1 long, short changed | 1.64% | 1.25%
What exactly does 1.64% mean? Is it the total CPU time of zstd relative
to all PG processes or relative to all processes?
Also what does 1.25% mean? (since compression is off, how is there a
number?)
> testname | compression=3 on with mirror caught up | off with mirror caught up
This is the most interesting configuration, because we are also
factoring in CPU overhead from decompression. The numbers are definitely
encouraging.
Do you have a variant of this: compression=1 on with mirror caught up?
> TPC-DS load of 1TB with ZSTD compression 3
> AO/CO
> off | 1003.607694 seconds
Wow, it took just 1000s for loading 1TB of data? And was this in a
mirrored multi-host setting?
> The memory consumption did not increase much for the single node test; the max memory usage went from 8.8% to 9.0%.
Which test case did this correspond to?
At any rate, I don't think we have to worry about memory too much. If
there aren't any leaks etc, which your experiments seem to have
eliminated, then we are good. This is because, at a time, a backend can
only be compressing a 32K page. The input size to the compression
algorithm will always be 32K. So, all good!
Some follow-up questions:
1. What is the zstd version that you used?
2. Did you have indexes on the tables for any of the runs?
These are my conclusions so far (I will wait for your replies to my
questions before considering these final):
1. We should definitely enable it by default in guc.c.
However, there is no point in doing wal compression if there is no
network to push WAL to. So, this means that we shouldn't enable it in a
mirrorless setting (edge case being mirrorless cluster with WAL archive
turned on where there might still be some benefit)
2. Even though we may be able to squeeze some CPU cycles from setting
compresslevel = 1, given the marginal improvements in CPU your
experiments show, I would lean toward favoring the default
compresslevel = 3. This is because:
* It is what upstream does.
* It is the one that ZSTD advocates and the setting that the zstd
dev team would pay more attention to and optimize in newer releases.
* There is not enough of a CPU gap between level=3 and level=1.
* There might be a workload in which level=3 pulls ahead in terms of
WAL compression ratio (and thus overall duration).
> Local machine: *** Zstandard CLI (64-bit) v1.5.5, by Yann Collet ***
> Multi-node machine: 1.4.4+dfsg-3ubuntu0.1
Interesting. I checked the min required version for zstd in our code:
if test "$with_zstd" = yes; then
dnl zstd_errors.h was renamed from error_public.h in v1.1.1
PKG_CHECK_MODULES([ZSTD], [libzstd >= 1.1.1])
fi
I think it's high time we bump it. I think we will be more confident in
turning wal_compression on by default that way! (PG requires at least
1.4.0 on devel)
Any thoughts on what min version to require? (My opinion is to require
at least 1.4.0)
> There might be more testing needed for mirrorless and the result may be specific for each datastore.
> For example, vSAN by default does not do "data locality".
> It pools all the disks groups (groups the disks within a host) together to create a vSAN datastore and
> all the writes go through the vSAN network. VMDKs are split into components and may reside on different disk > groups.
> So, if vSAN is not doing the compression by default, there is a possibility for some performance gain.
I see, so for high WAL volume, it might increase throughput in such a
setting. This discussion can also be extended to cloud-based deployments
as well with non-local storage, I guess.
With regard to the numbers you posted for the mirrorless deployment, like:
testname | compression=3 on | off
9 short and 1 long, short changed | 5.55% | 0.90%
Even though we take a hit on PG process CPU usage, what was the net
effect on the overall duration of each test? Was the spike in CPU usage
offset by networked-storage-writes?
> I think 1.4.0 is a good next step and will future proof us.
> There are some advanced functionalities that became stable for 1.4.0 and
> possibly used by base backup ZSTD. I'm not sure when we will be pulling in ZSTD for base backup,
> but once that happens, a higher version of ZSTD will be a requirement.
We don't have any plans for that at the moment. We don't even do the
gzip compression that is available with pg_basebackup in 7X from
gprecoverseg. But, we definitely should IMO. I will let others weigh
in, in terms of priority. But, just to be clear, even if we backport
that support in pg_basebackup, 1.4.0 should be more than enough, right.
Or did you see anything that would benefit from going even higher in
pg_basebackup?
It includes a few bug fixes, including a fix for a rare data corruption bug, which could only be triggered in a niche use case, when doing all of the following: using multithreading mode, with an overlap size >= 512 MB, using a strategy >=
ZSTD_btlazy, and compressing more than 4 GB. None of the default compression levels meet these requirements (not even--ultraones).
That said, when we raise the PR to turn wal_compression on by default,
let's bump the version in a separate commit.
It might need some dev/release pipeline changes as well.
> For mirrorless deployment, going from off to on, on is slower by 4 to 8%.
> Unfortunately, I do not have the metrics for network-storage-writes since I do not have access to the vSphere environment.
> I think more investigation will be needed to iron things out for mirrorless deployment on vSphere, but with the current findings. It is safe to say that there will be a performance hit if wal_compression is on for mirrorless deployments.
Understood.
This is mainly a question of what we document. I think we can say that
wal_compression *might not* provide a benefit in mirrorless settings.
It completely depends on the environment - non local GCP storage can be
so slow (felt the pain last year when experimenting some readahead
stuff), that the extra CPU might be worth it for high WAL volume
workloads!