Catalog Autovacuum

65 views

Skip to first unread message

Andrew

unread,

Feb 8, 2023, 11:24:08 AM2/8/23

to gpdb...@greenplum.org

# TL;DR

* Bloated catalogs can cause slowness for all user workloads and is a common cause for customer tickets.

* We have deemed catalog-only autovacuum to be prompt and fair on user workloads and a good

preventative measure for GPDB7 clusters

* The default scope of autovacuum will be restricted to tables under pg_catalog and pg_toast only

for all databases in the cluster (We have introduced gp_autovacuum_scope for users to adjust this)

* We will be exploring viability of autovacuum on user tables soon.

* Catalog-only autovacuum will execute as a daemon on each segment individually.

* Catalog-only autovacuum will bypass resource queues and possibly resource groups as well.

* We are keeping the autovacuum threshold GUCs to their PG provided defaults for catalog-only

autovacuum, since catalog tables are heap tables.

***

# Catalog Autovacuum

## Motivation

* Regularly vacuuming a database is vital for performance

* Hundreds of customer tickets citing “bloat”

* Catalog tables in particular should be low-cost to vacuum and are high-impact

* Every command/query involves catalog lookups

* pg_class, pg_attribute, pg_type etc can easily get bloated over time. It is very easy to

accidentally generate a lot of catalog bloat with normal OLAP usage patterns (e.g. creating

intermediate tables inside complex UDFs with loops, creating intermediate tables in numerous

explicit transaction blocks etc)

## Question 1: Promptness

* It should kick in as soon as threshold bloat is achieved

* With some acceptable delay

* Controlled by autovacuum_naptime and the number of databases

* If the installation has N databases, a new worker will be launched ~ every

autovacuum_naptime/N seconds

* If a server has 60 databases and autovacuum_naptime is set to 60 seconds, then a autovacuum

worker starts every second

## Question 2: What exactly is “catalog”?

### PG

* pg_catalog, pg_toast

### GP

* pg_catalog, pg_toast, pg_aoseg

#### pg_aoseg

* Regular vacuum on AO tables already vacuums aoseg table.

* These AO catalog tables will have a very different bloat profile as compared to regular

catalog tables. Their bloat will be DML dependent and not DDL dependent, making the amount of

bloat a fraction of the tuples in the parent table. Bloat will only be exacerbated by OLTP

workloads on these tables. Since we are worried enough to disable autovacuum on user tables,

should we exclude these too?

* OTOH one may argue that toast tables also fall in this bucket and they are actually processed

alongside regular tables in the autovacuum workflow.

* Contention with user-table vacuums?

* The portion that does vacuum aux tables in regular AO vacuum will NOT conflict with vacuum

on the parent AO table. So we are good here.

* Thus we have decided to add another level to the autovacuum GUC to be: `catalog`, `catalog_ao_aux`

and `user` (TBD) where `catalog` will be pg_catalog and pg_toast whereas `catalog_ao_aux` will be

pg_catalog + pg_toast + pg_aoseg. [1]

* We have also decided to add an option to VACUUM, `AO_AUX_ONLY`, that will indicate that only the

auxiliary tables for the indicated AO table should be vacuumed.

## Question 3: Mode of execution

* Running autovacuum on the QD and then having each VACUUM dispatch to segments is advantageous? Or

is the current idea to execute on each segment better?

* Running from coordinator has the advantage of predictability. All the segments will execute

the vacuum together.

* This approach is NOT viable at all for user tables. This is because of potential data

skew. The coordinator does not have enough info for triggering decisions, only the QEs

do. This is only viable for catalog tables.

* We primarily care about predictability since we don't want catalog-only autovacuum to be

suddenly unfair (specially on segments with a lot of skew). GP's performance is often decided

by the slowest segment. The concern is that if there is heavy skew on any given segment and

autovacuum kicks in and becomes unfair, we will end up with even worse performance.

* This is a concern only if catalog-only-autovacuum is unfair. We do not see evidence of

that being the case.

## Question 4: Fairness

* Does it adversely affect system performance?

* This is what most of this presentation is about.

* Too much IO?

* Too much CPU consumption?

* Too much memory consumption?

* Can it create OOMs?

* Interaction w/ resource management

* Resource queues?

* Automatic vacuums will bypass resource queues, which is fine since we want to only

restrict user workloads.

* This is because of how ResLockPortal() is called (only for role=dispatch and non-su).

* Autovacuum executes as a superuser (BOOTSTRAP_SUPERUSERID(10))

* Resource groups?

* Autovacuum/autoanalyze will go into the admin resource group. Do we want to bypass

that?

* Even emergency (anti-wr) autovacuum will go into the admin resource group. We should

bypass for this at least.

* A discussion of this topic has been started

[here](https://groups.google.com/a/greenplum.org/g/gpdb-dev/c/J5PcER9A4t4/m/-2G1v20THQAJ)

## Catalog Bloat

We used a shell script and psql to create a table in aborted commits in a loop every 2 seconds One

round creates 2,909 dead tuples on pg_class, and 15,015 dead tuples on pg_attribute. Plus some on

others such as pg_type.

```sh

BEGIN;

CREATE TABLE bloat_tbl(i int, j int, k int, l int, m int, n int, o int, p int)

DISTRIBUTED BY (i) PARTITION BY RANGE (j) (START (0) END (1000) EVERY (1));

ABORT;

```

## Testing overview

* We profiled a cluster with catalog bloat, with and without autovacuum enabled

* All tests conducted with a 3-segment cluster on a developer workstation

* Settings were kept to defaults, except for autovacuum and autovacuum logging

* The entire cluster was shut down, cleaned out, and rebuilt between each test run

* As a followup based on the comments on our original presentation, we re-ran all analyses with seed

bloat present to test pressure on shared_buffers. Specifically, we ran genBloat into a bloatdb

until it was ~10GB, then stopped and ran the tests as described.

* These test results followed the same patterns both with and without the presence of seed bloat, so

we show only the results with seed bloat here.

## TPC-DS Methodology

We ran TPC-DS on a developer workstation, 3 segment cluster. TPC-DS defaults to AOCO tables, we

tested both those and heap.

Options set for testing:

```sh

GEN_DATA_SCALE="10" <------- 10GB of data loaded and used

MULTI_USER_COUNT="2" <------- 2 concurrent users running a suite of queries

```

## TPCS-DS Results – AOCO

### No Autovacuum

Load (seconds) 251

Analyze (seconds) 133

1 User Queries (seconds) 817

Concurrent Queries (seconds) 2246

Throughput Test Time (seconds) 1128

TPC-DS v2.2.0 Score: 26948

### With Autovacuum

Load (seconds) 278

Analyze (seconds) 154

1 User Queries (seconds) 831

Concurrent Queries (seconds) 2257

Throughput Test Time (seconds) 1133

TPC-DS v2.2.0 Score: 25861

## TPCS-DS Results – Heap

### No Autovacuum

Load (seconds) 291

Analyze (seconds) 125

1 User Queries (seconds) 848

Concurrent Queries (seconds) 2343

Throughput Test Time (seconds) 1172

TPC-DS v2.2.0 Score: 25014

### With Autovacuum

Load (seconds) 300

Analyze (seconds) 140

1 User Queries (seconds) 893

Concurrent Queries (seconds) 2438

Throughput Test Time (seconds) 1224

TPC-DS v2.2.0 Score: 23989

## CPU Profiling Methodology

We used the perf tool to capture CPU load for 120 seconds during the 05-sql portion of TPC-DS.

This was started manually, and is somewhat subject to normal variance on the workstation. We

attempted to hold variables constant for AV vs no-AV runs so that they would be comparable

Commands:

```sh

sudo perf record -a -- sleep 120

sudo perf diff noAVperf.data withAVperf.data --dsos …/bin/postgres

```

## CPU Profiling Results

* Note: baseline is no AV, delta is with AV

### AOCO

Baseline Delta Abs Symbol

2.48% +0.60% [.] ExecScanHashBucket

2.83% -0.47% [.] slot_deform_heap_tuple

0.75% -0.35% [.] fill_val

0.75% -0.34% [.] heap_compute_data_size

2.84% +0.31% [.] aocs_getnext

0.48% -0.26% [.] sub_abs

2.20% -0.26% [.] AllocSetAlloc

0.56% -0.24% [.] heap_fill_tuple

8.96% +0.23% [.] pg_strtok

0.31% -0.23% [.] div_var

0.51% -0.22% [.] hash_bytes

### Heap

Baseline Delta Abs Symbol

6.86% +4.91% [.] slot_deform_heap_tuple

4.57% -2.11% [.] ExecScanHashBucket

5.85% +1.12% [.] ExecInterpExpr

1.14% -1.02% [.] tuplehash_lookup_hash_internal

0.44% +0.98% [.] heapgettup_pagemode

1.17% -0.74% [.] tuplehash_insert_hash_internal

0.26% +0.72% [.] heapgetpage

0.68% -0.68% [.] pg_mblen

0.73% -0.66% [.] BufFileRead

2.99% +0.64% [.] pg_checksum_block

0.88% -0.61% [.] hash_bytes

1.09% -0.59% [.] tts_minimal_store_tuple

0.58% -0.58% [.] pg_mbstrlen_with_len

## Memory Profiling Methodology

We used a combination of code review and instrumentation to capture this information

Autovacuum uses its own memory context, "Autovacuum Launcher". Each table vacuumed clears and

re-uses a small child context, "Autovacuum Portal". “Vacuum” context is a child of this portal

context, and is used within vacuum() for the actual work. The portal context is kept small and not

allowed to accumulate across tables, and overhead for vacuuming is also kept small. This code keeps

allocation overhead very small, there is not much memory usage and maintenance_work_mem is fully

respected. Thus the design of this section of the code results in very low risk of OOM crashes.

We tracked the memory usage of both “Autovacuum Launcher” and “Vacuum” contexts throughout TPC-DS

and report the maximum memory allocated per AV worker at any given time.

## Memory Profiling Results

### AOCO

**AV worker: Grand total: 270336 bytes in 7 blocks; 140240 free (11 chunks); 130096 used**

So, maximum 270kb allocated per worker.

### Heap

**AV worker: Grand total: 40960 bytes in 4 blocks; 24168 free (11 chunks); 16792 used**

So, maximum 41kb allocated per worker.

## Disk Profiling Methodology

To measure disk read/write load, we began with bcc’s filetop.py.

Heavily modified to capture and aggregate file usage by PID until cancelled

To identify AV Worker PID’s, instrumented StartAutoVacWorker to also write a pidfile to /tmp so that

short-lived PIDs can be differentiated

In general, this section of results was most challenging to capture, and has the most room for

improved methodology.

## Disk Profiling Results

### No Autovacuum AOCO

PID READS WRITES RKBYTES WKBYTES CMD

541293 87724 69051 2741.38 2211.31 postgres: other <-- TOP PID

Total postgres file usage: READS: 9,018,723, WRITES: 7,487,920, RKBYTES: 277,713.1, WKBYTES: 2695.06

### With Autovacuum AOCO

PID READS WRITES RKBYTES WKBYTES CMD

463016 97977 41935 3061.78 1465.26 postgres: other <-- TOP PID

481228 6349 5402 186.59 173.22 postgres: autovacuum <-- TOP AUTOVACUUM PID

Total postgres file usage: READS: 9,423,600, WRITES: 7,374,779, RKBYTES: 288,774.42, WKBYTES: 189.1

### No Autovacuum Heap

PID READS WRITES RKBYTES WKBYTES CMD

296239 262851 225224 8214.09 7034.01 postgres: other <--TOP PID

Total postgres file usage: READS: 30,950,945, WRITES: 9,929,437, RKBYTES: 966,533.21, WKBYTES: 4267.8

### With Autovacuum Heap

PID READS WRITES RKBYTES WKBYTES CMD

217785 282199 197937 8818.72 6180.8 postgres: other <-- TOP PID

235871 7507 6958 224.05 219.56 postgres: autovacuum <-- TOP AUTOVACUUM PID

Total postgres file usage: READS: 32,273,330, WRITES: 9,954,598, RKBYTES: 1,005,771.19, WKBYTES: 0.0

## References

1. [Autovacuum Scope GUC PR](https://github.com/greenplum-db/gpdb/pull/14854)

2. Citus presentation on tuning autovacuum:

* Watch https://www.youtube.com/watch?v=D832gi8Qrv4

* Blog: https://www.citusdata.com/blog/2022/07/28/debugging-postgres-autovacuum-problems-13-tips/

3. Robert Haas on tuning maintenance_work_mem:

* http://rhaas.blogspot.com/2019/01/how-much-maintenanceworkmem-do-i-need.html

4. https://dba.stackexchange.com/questions/21068/aggressive-autovacuum-on-postgresql

5. Azure Postgres autovacuum recommendations:

* https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-autovacuum-tuning

6. https://dba.stackexchange.com/questions/261158/postgresql-9-6-12-autovacuum-constantly-running-on-system-tables

Brent Doil

unread,

May 17, 2023, 3:25:32 PM5/17/23

to Andrew, gpdb...@greenplum.org

We followed up on the catalog autovacuum performance and used the following testing steps to further analyze performance on a larger scale.

Used TPC-DS with scale factor 10 as the workload.
Compiled server with '-O3'
gpdemo database with primaries and mirrors on separate NVME disks

fsync = on
gp_vmem_protect_limit = 32768
max_connection = 750

Scenario 1: Generate catalog bloat, then run TPC-DS
Scenario 2: Begin with no bloat, generate bloat while running TPC-DS
Want to generate a lot of bloat quickly:

Used dynamic SQL to build DDL for 1200 col (bool) heap table with 1000 partitions
Ran BEGIN..CREATE TABLE..ABORT via script. Command ran twice each across 6 databases.
Result was ~730 million dead catalog tuples in the cluster.
For scenario 2 I ran a single BEGIN..CREATE TABLE..ABORT transaction in a tight loop.

Metrics used:

DB perf:

TPC-DS v2 score
gp_stat_progress_vacuum

Disk I/O

sudo iotop -b -o -a -d 60 -t
iostat -zxs 60

perf record -F 800 -g --call-graph dwarf -o /data/perf_test_default.out -a -- sleep 300

Summary of findings:

If there is significant catalog bloat, concurrent AV during TPC-DS with default AV settings is a performance hit of 20-30%. NOTE: Tests were run on a local demo cluster with primaries and mirrors on separate mounts.
Starting with no catalog bloat, generating consistent bloat during TPC-DS with default AV settings is a performance hit of ~15%
AV generates a non-trivial amount of WAL. How quickly WAL is sync'd to the mirrors was the primary bottleneck.
VACUUM uses a 256KB ring buffer instead of shared_buffers. When a page in the ring buffer is dirtied, it's not evicted. When the page is filled with dirty pages, a WAL flushesis forced. Due to the small buffer size, this causes a lot of WAL flushes. This issue has been resolved in a recent upstream discussion. https://www.postgresql.org/message-id/20230111182720....@awork3.anarazel.de Even so, increasing shared_buffers can improve performance by reducing cache misses when reading into the ring buffer.
The segment AV launchers act independently, and each use their own calculation autovacuum_vacuum_cost_limit. This needs to be taken into consideration if segments are sharing disks or NICs
Catalog tables vary widely in average size, pg_attribute/pg_type/pg_class/pg_stat_last_operation can be very large while others are small.
A default autovacuum_vacuum_threshold = 50 and autovacuum_vacuum_scale_factor = 0.2 means the table is eligible for AV when it has 50 + 20% of table size dead tuples.
Recommendation is to reduce the scale factor so AV is triggered more often with less overall work to do on any given table.

autovacuum_vacuum_threshold = 500 - bump this up from 50 to 500 so we aren't constantly triggering AV and small catalog updates.
autovacuum_vacuum_scale_factor = 0.05 - reduce scale factor to 5% of table size

autovacuum_work_mem is used by AV to allocate memory for the TID array of dead tuples while scanning a relation. If the array fills up, AV performs vacuum/pruning/index cleanup on those TIDs and when finishes will resume the scan. We can reduce the number of scans by setting this value higher. Note that by default it's -1 and falls back to maintenance_work_mem.
The recommendation is to increase slightly so fewer scans are performed on larger catalog tables.

autovacuum_work_mem = 128MB

The default cost model on PG12 treats disk writes as 2x the cost of disk reads. In reality the skew is much further toward writes. Upstream has changed the default cost for vacuum_cost_page_miss from 10 to 2. Recommendation is to follow suit. Note that this will also affect normal vacuum.

vacuum_cost_page_miss = 2

To modify the aggressiveness of AV, autovacuum_vacuum_cost_limit is the primary tuning knob. When this limit is hit, the worker will sleep for autovacuum_vacuum_cost_delay.
It's generally better to keep autovacuum_vacuum_cost_delay low to avoid large bursts of IO followed by troughs. Recommend tuning between 1ms-5ms.
autovacuum_vacuum_cost_limit can be increased to make AV more aggressive. Set to 1000 sees a performance hit of 20% while increasing AV throughput by ~10-15%.
shared_buffers should be increased by default. Need to follow-up on tuning this.
Most AV settings can be changed with a gpconfig and gpstop -u. Currently, settings are updated each time a new table is chosen. Cost limit, cost delay, and AV altogether can be modified this way to change AV priority. Upstream has recently introduced the ability to change these settings while working on a single table. This might be useful for customers in the future. https://github.com/postgres/postgres/commit/7d71d3dd080b9b147402db3365fe498f74704231
Generally, CPU and memory constraints aren't a concern for AV workers. Memory is limited by autovacuum_work_mem/maintenance_work_mem. perf reporting shows CPU cycles for AV code paths as 3-5% of overall on system in the worst case. IO is a more realistic concern. AV competes with WAL reader/writers, bgwriters, checkpointers etc. for disk IO and these can be a very significant portion of overall IO.
Disk IO for all these background processes should be constrained by resource groups. The best way to allocate IO for these tasks is by throttling latency first, not IOPS or throughput. Underallocating IO resources for AV/WAL workers can bring the system to a halt.
The biggest performance boost was increasing wait_for_replication_threshold from 1024 to 4096. At the default 1024, local disks were saturated at 3k IOPS and 250MB/s. Upping to 4096 increased these to 4-5k IOPS and 1.5GB/s. Shipping the WAL over network needs to be throttled, so modifying this is not recommended in production clusters. An advanced tuning guide may want to touch on this.
gpload is a big culprit for creating bloat. It creates and drops external tables as part of the workflow, bloating pg_foreign_table and pg_attribute. gpload also generates the bloat in a single transaction, resulting in a large burst of AV work when it commits/aborts. More work should be done to either improve gpload or push customers to alternative tools for loading data.

The next step is to perform a similar exercise focused on autoanalyze.

Regards,

Brent

From: Andrew <gpdb...@andrewrepp.com>
Sent: Wednesday, February 8, 2023 11:23 AM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Catalog Autovacuum

!! External Email

!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.

Reply all

Reply to author

Forward

0 new messages