# TL;DR
* Bloated catalogs can cause slowness for all user workloads and is a common cause for customer tickets.
* We have deemed catalog-only autovacuum to be prompt and fair on user workloads and a good
preventative measure for GPDB7 clusters
* The default scope of autovacuum will be restricted to tables under pg_catalog and pg_toast only
for all databases in the cluster (We have introduced gp_autovacuum_scope for users to adjust this)
* We will be exploring viability of autovacuum on user tables soon.
* Catalog-only autovacuum will execute as a daemon on each segment individually.
* Catalog-only autovacuum will bypass resource queues and possibly resource groups as well.
* We are keeping the autovacuum threshold GUCs to their PG provided defaults for catalog-only
autovacuum, since catalog tables are heap tables.
***
# Catalog Autovacuum
## Motivation
* Regularly vacuuming a database is vital for performance
* Hundreds of customer tickets citing “bloat”
* Catalog tables in particular should be low-cost to vacuum and are high-impact
* Every command/query involves catalog lookups
* pg_class, pg_attribute, pg_type etc can easily get bloated over time. It is very easy to
accidentally generate a lot of catalog bloat with normal OLAP usage patterns (e.g. creating
intermediate tables inside complex UDFs with loops, creating intermediate tables in numerous
explicit transaction blocks etc)
## Question 1: Promptness
* It should kick in as soon as threshold bloat is achieved
* With some acceptable delay
* Controlled by autovacuum_naptime and the number of databases
* If the installation has N databases, a new worker will be launched ~ every
autovacuum_naptime/N seconds
* If a server has 60 databases and autovacuum_naptime is set to 60 seconds, then a autovacuum
worker starts every second
## Question 2: What exactly is “catalog”?
### PG
* pg_catalog, pg_toast
### GP
* pg_catalog, pg_toast, pg_aoseg
#### pg_aoseg
* Regular vacuum on AO tables already vacuums aoseg table.
* These AO catalog tables will have a very different bloat profile as compared to regular
catalog tables. Their bloat will be DML dependent and not DDL dependent, making the amount of
bloat a fraction of the tuples in the parent table. Bloat will only be exacerbated by OLTP
workloads on these tables. Since we are worried enough to disable autovacuum on user tables,
should we exclude these too?
* OTOH one may argue that toast tables also fall in this bucket and they are actually processed
alongside regular tables in the autovacuum workflow.
* Contention with user-table vacuums?
* The portion that does vacuum aux tables in regular AO vacuum will NOT conflict with vacuum
on the parent AO table. So we are good here.
* Thus we have decided to add another level to the autovacuum GUC to be: `catalog`, `catalog_ao_aux`
and `user` (TBD) where `catalog` will be pg_catalog and pg_toast whereas `catalog_ao_aux` will be
pg_catalog + pg_toast + pg_aoseg. [1]
* We have also decided to add an option to VACUUM, `AO_AUX_ONLY`, that will indicate that only the
auxiliary tables for the indicated AO table should be vacuumed.
## Question 3: Mode of execution
* Running autovacuum on the QD and then having each VACUUM dispatch to segments is advantageous? Or
is the current idea to execute on each segment better?
* Running from coordinator has the advantage of predictability. All the segments will execute
the vacuum together.
* This approach is NOT viable at all for user tables. This is because of potential data
skew. The coordinator does not have enough info for triggering decisions, only the QEs
do. This is only viable for catalog tables.
* We primarily care about predictability since we don't want catalog-only autovacuum to be
suddenly unfair (specially on segments with a lot of skew). GP's performance is often decided
by the slowest segment. The concern is that if there is heavy skew on any given segment and
autovacuum kicks in and becomes unfair, we will end up with even worse performance.
* This is a concern only if catalog-only-autovacuum is unfair. We do not see evidence of
that being the case.
## Question 4: Fairness
* Does it adversely affect system performance?
* This is what most of this presentation is about.
* Too much IO?
* Too much CPU consumption?
* Too much memory consumption?
* Can it create OOMs?
* Interaction w/ resource management
* Resource queues?
* Automatic vacuums will bypass resource queues, which is fine since we want to only
restrict user workloads.
* This is because of how ResLockPortal() is called (only for role=dispatch and non-su).
* Autovacuum executes as a superuser (BOOTSTRAP_SUPERUSERID(10))
* Resource groups?
* Autovacuum/autoanalyze will go into the admin resource group. Do we want to bypass
that?
* Even emergency (anti-wr) autovacuum will go into the admin resource group. We should
bypass for this at least.
* A discussion of this topic has been started
## Catalog Bloat
We used a shell script and psql to create a table in aborted commits in a loop every 2 seconds One
round creates 2,909 dead tuples on pg_class, and 15,015 dead tuples on pg_attribute. Plus some on
others such as pg_type.
```sh
BEGIN;
CREATE TABLE bloat_tbl(i int, j int, k int, l int, m int, n int, o int, p int)
DISTRIBUTED BY (i) PARTITION BY RANGE (j) (START (0) END (1000) EVERY (1));
ABORT;
```
## Testing overview
* We profiled a cluster with catalog bloat, with and without autovacuum enabled
* All tests conducted with a 3-segment cluster on a developer workstation
* Settings were kept to defaults, except for autovacuum and autovacuum logging
* The entire cluster was shut down, cleaned out, and rebuilt between each test run
* As a followup based on the comments on our original presentation, we re-ran all analyses with seed
bloat present to test pressure on shared_buffers. Specifically, we ran genBloat into a bloatdb
until it was ~10GB, then stopped and ran the tests as described.
* These test results followed the same patterns both with and without the presence of seed bloat, so
we show only the results with seed bloat here.
## TPC-DS Methodology
We ran TPC-DS on a developer workstation, 3 segment cluster. TPC-DS defaults to AOCO tables, we
tested both those and heap.
Options set for testing:
```sh
GEN_DATA_SCALE="10" <------- 10GB of data loaded and used
MULTI_USER_COUNT="2" <------- 2 concurrent users running a suite of queries
```
## TPCS-DS Results – AOCO
### No Autovacuum
Load (seconds) 251
Analyze (seconds) 133
1 User Queries (seconds) 817
Concurrent Queries (seconds) 2246
Throughput Test Time (seconds) 1128
TPC-DS v2.2.0 Score: 26948
### With Autovacuum
Load (seconds) 278
Analyze (seconds) 154
1 User Queries (seconds) 831
Concurrent Queries (seconds) 2257
Throughput Test Time (seconds) 1133
TPC-DS v2.2.0 Score: 25861
## TPCS-DS Results – Heap
### No Autovacuum
Load (seconds) 291
Analyze (seconds) 125
1 User Queries (seconds) 848
Concurrent Queries (seconds) 2343
Throughput Test Time (seconds) 1172
TPC-DS v2.2.0 Score: 25014
### With Autovacuum
Load (seconds) 300
Analyze (seconds) 140
1 User Queries (seconds) 893
Concurrent Queries (seconds) 2438
Throughput Test Time (seconds) 1224
TPC-DS v2.2.0 Score: 23989
## CPU Profiling Methodology
We used the perf tool to capture CPU load for 120 seconds during the 05-sql portion of TPC-DS.
This was started manually, and is somewhat subject to normal variance on the workstation. We
attempted to hold variables constant for AV vs no-AV runs so that they would be comparable
Commands:
```sh
sudo perf record -a -- sleep 120
sudo perf diff noAVperf.data withAVperf.data --dsos …/bin/postgres
```
## CPU Profiling Results
* Note: baseline is no AV, delta is with AV
### AOCO
Baseline Delta Abs Symbol
2.48% +0.60% [.] ExecScanHashBucket
2.83% -0.47% [.] slot_deform_heap_tuple
0.75% -0.35% [.] fill_val
0.75% -0.34% [.] heap_compute_data_size
2.84% +0.31% [.] aocs_getnext
0.48% -0.26% [.] sub_abs
2.20% -0.26% [.] AllocSetAlloc
0.56% -0.24% [.] heap_fill_tuple
8.96% +0.23% [.] pg_strtok
0.31% -0.23% [.] div_var
0.51% -0.22% [.] hash_bytes
### Heap
Baseline Delta Abs Symbol
6.86% +4.91% [.] slot_deform_heap_tuple
4.57% -2.11% [.] ExecScanHashBucket
5.85% +1.12% [.] ExecInterpExpr
1.14% -1.02% [.] tuplehash_lookup_hash_internal
0.44% +0.98% [.] heapgettup_pagemode
1.17% -0.74% [.] tuplehash_insert_hash_internal
0.26% +0.72% [.] heapgetpage
0.68% -0.68% [.] pg_mblen
0.73% -0.66% [.] BufFileRead
2.99% +0.64% [.] pg_checksum_block
0.88% -0.61% [.] hash_bytes
1.09% -0.59% [.] tts_minimal_store_tuple
0.58% -0.58% [.] pg_mbstrlen_with_len
## Memory Profiling Methodology
We used a combination of code review and instrumentation to capture this information
Autovacuum uses its own memory context, "Autovacuum Launcher". Each table vacuumed clears and
re-uses a small child context, "Autovacuum Portal". “Vacuum” context is a child of this portal
context, and is used within vacuum() for the actual work. The portal context is kept small and not
allowed to accumulate across tables, and overhead for vacuuming is also kept small. This code keeps
allocation overhead very small, there is not much memory usage and maintenance_work_mem is fully
respected. Thus the design of this section of the code results in very low risk of OOM crashes.
We tracked the memory usage of both “Autovacuum Launcher” and “Vacuum” contexts throughout TPC-DS
and report the maximum memory allocated per AV worker at any given time.
## Memory Profiling Results
### AOCO
**AV worker: Grand total: 270336 bytes in 7 blocks; 140240 free (11 chunks); 130096 used**
So, maximum 270kb allocated per worker.
### Heap
**AV worker: Grand total: 40960 bytes in 4 blocks; 24168 free (11 chunks); 16792 used**
So, maximum 41kb allocated per worker.
## Disk Profiling Methodology
To measure disk read/write load, we began with bcc’s filetop.py.
Heavily modified to capture and aggregate file usage by PID until cancelled
To identify AV Worker PID’s, instrumented StartAutoVacWorker to also write a pidfile to /tmp so that
short-lived PIDs can be differentiated
In general, this section of results was most challenging to capture, and has the most room for
improved methodology.
## Disk Profiling Results
### No Autovacuum AOCO
PID READS WRITES RKBYTES WKBYTES CMD
541293 87724 69051 2741.38 2211.31 postgres: other <-- TOP PID
Total postgres file usage: READS: 9,018,723, WRITES: 7,487,920, RKBYTES: 277,713.1, WKBYTES: 2695.06
### With Autovacuum AOCO
PID READS WRITES RKBYTES WKBYTES CMD
463016 97977 41935 3061.78 1465.26 postgres: other <-- TOP PID
481228 6349 5402 186.59 173.22 postgres: autovacuum <-- TOP AUTOVACUUM PID
Total postgres file usage: READS: 9,423,600, WRITES: 7,374,779, RKBYTES: 288,774.42, WKBYTES: 189.1
### No Autovacuum Heap
PID READS WRITES RKBYTES WKBYTES CMD
296239 262851 225224 8214.09 7034.01 postgres: other <--TOP PID
Total postgres file usage: READS: 30,950,945, WRITES: 9,929,437, RKBYTES: 966,533.21, WKBYTES: 4267.8
### With Autovacuum Heap
PID READS WRITES RKBYTES WKBYTES CMD
217785 282199 197937 8818.72 6180.8 postgres: other <-- TOP PID
235871 7507 6958 224.05 219.56 postgres: autovacuum <-- TOP AUTOVACUUM PID
Total postgres file usage: READS: 32,273,330, WRITES: 9,954,598, RKBYTES: 1,005,771.19, WKBYTES: 0.0
## References
2. Citus presentation on tuning autovacuum:
3. Robert Haas on tuning maintenance_work_mem:
5. Azure Postgres autovacuum recommendations: