Catalog Autovacuum

65 views
Skip to first unread message

Andrew

unread,
Feb 8, 2023, 11:24:08 AM2/8/23
to gpdb...@greenplum.org
# TL;DR
* Bloated catalogs can cause slowness for all user workloads and is a common cause for customer tickets.
* We have deemed catalog-only autovacuum to be prompt and fair on user workloads and a good
  preventative measure for GPDB7 clusters
* The default scope of autovacuum will be restricted to tables under pg_catalog and pg_toast only
  for all databases in the cluster (We have introduced gp_autovacuum_scope for users to adjust this)
* We will be exploring viability of autovacuum on user tables soon.
* Catalog-only autovacuum will execute as a daemon on each segment individually.
* Catalog-only autovacuum will bypass resource queues and possibly resource groups as well.
* We are keeping the autovacuum threshold GUCs to their PG provided defaults for catalog-only
  autovacuum, since catalog tables are heap tables.
***

# Catalog Autovacuum
## Motivation
* Regularly vacuuming a database is vital for performance
* Hundreds of customer tickets citing “bloat”
* Catalog tables in particular should be low-cost to vacuum and are high-impact
    * Every command/query involves catalog lookups
* pg_class, pg_attribute, pg_type etc can easily get bloated over time. It is very easy to
  accidentally generate a lot of catalog bloat with normal OLAP usage patterns (e.g. creating
  intermediate tables inside complex UDFs with loops, creating intermediate tables in numerous
  explicit transaction blocks etc)

## Question 1: Promptness
* It should kick in as soon as threshold bloat is achieved
    * With some acceptable delay
* Controlled by autovacuum_naptime and the number of databases
    * If the installation has N databases, a new worker will be launched ~ every
      autovacuum_naptime/N seconds
    * If a server has 60 databases and autovacuum_naptime is set to 60 seconds, then a autovacuum
      worker starts every second

## Question 2: What exactly is “catalog”?
### PG
* pg_catalog, pg_toast

### GP
* pg_catalog, pg_toast, pg_aoseg

#### pg_aoseg
* Regular vacuum on AO tables already vacuums aoseg table.
* These AO catalog tables will have a very different bloat profile as compared to regular
  catalog tables. Their bloat will be DML dependent and not DDL dependent, making the amount of
  bloat a fraction of the tuples in the parent table. Bloat will only be exacerbated by OLTP
  workloads on these tables. Since we are worried enough to disable autovacuum on user tables,
  should we exclude these too?
* OTOH one may argue that toast tables also fall in this bucket and they are actually processed
  alongside regular tables in the autovacuum workflow.
* Contention with user-table vacuums?
    * The portion that does vacuum aux tables in regular AO vacuum will NOT conflict with vacuum
      on the parent AO table. So we are good here.
* Thus we have decided to add another level to the autovacuum GUC to be: `catalog`, `catalog_ao_aux`
  and `user` (TBD) where `catalog` will be pg_catalog and pg_toast whereas `catalog_ao_aux` will be
  pg_catalog + pg_toast + pg_aoseg. [1]
* We have also decided to add an option to VACUUM, `AO_AUX_ONLY`, that will indicate that only the
  auxiliary tables for the indicated AO table should be vacuumed.


## Question 3: Mode of execution
* Running autovacuum on the QD and then having each VACUUM dispatch to segments is advantageous? Or
  is the current idea to execute on each segment better?
    * Running from coordinator has the advantage of predictability. All the segments will execute
      the vacuum together.
    * This approach is NOT viable at all for user tables. This is because of potential data
      skew. The coordinator does not have enough info for triggering decisions, only the QEs
      do. This is only viable for catalog tables.
    * We primarily care about predictability since we don't want catalog-only autovacuum to be
      suddenly unfair (specially on segments with a lot of skew). GP's performance is often decided
      by the slowest segment. The concern is that if there is heavy skew on any given segment and
      autovacuum kicks in and becomes unfair, we will end up with even worse performance.
        * This is a concern only if catalog-only-autovacuum is unfair. We do not see evidence of
          that being the case.

## Question 4: Fairness
* Does it adversely affect system performance?
    * This is what most of this presentation is about.
    * Too much IO?
    * Too much CPU consumption?
    * Too much memory consumption?
        * Can it create OOMs?
    * Interaction w/ resource management
        * Resource queues?
            * Automatic vacuums will bypass resource queues, which is fine since we want to only
              restrict user workloads.
            * This is because of how ResLockPortal() is called (only for role=dispatch and non-su).
            * Autovacuum executes as a superuser (BOOTSTRAP_SUPERUSERID(10))
        * Resource groups?
            * Autovacuum/autoanalyze will go into the admin resource group. Do we want to bypass
              that?
            * Even emergency (anti-wr) autovacuum will go into the admin resource group. We should
              bypass for this at least.
        * A discussion of this topic has been started

## Catalog Bloat
We used a shell script and psql to create a table in aborted commits in a loop every 2 seconds One
round creates 2,909 dead tuples on pg_class, and 15,015 dead tuples on pg_attribute. Plus some on
others such as pg_type.

```sh
BEGIN;
CREATE TABLE bloat_tbl(i int, j int, k int, l int, m int, n int, o int, p int)
DISTRIBUTED BY (i) PARTITION BY RANGE (j) (START (0) END (1000) EVERY (1));
ABORT;
```

## Testing overview
* We profiled a cluster with catalog bloat, with and without autovacuum enabled
* All tests conducted with a 3-segment cluster on a developer workstation
* Settings were kept to defaults, except for autovacuum and autovacuum logging
* The entire cluster was shut down, cleaned out, and rebuilt between each test run
* As a followup based on the comments on our original presentation, we re-ran all analyses with seed
  bloat present to test pressure on shared_buffers. Specifically, we ran genBloat into a bloatdb
  until it was ~10GB, then stopped and ran the tests as described.
* These test results followed the same patterns both with and without the presence of seed bloat, so
  we show only the results with seed bloat here.


## TPC-DS Methodology
We ran TPC-DS on a developer workstation, 3 segment cluster.  TPC-DS defaults to AOCO tables, we
tested both those and heap.

Options set for testing:
```sh
GEN_DATA_SCALE="10" <------- 10GB of data loaded and used
MULTI_USER_COUNT="2" <------- 2 concurrent users running a suite of queries
```

## TPCS-DS Results – AOCO
### No Autovacuum
Load (seconds) 251
Analyze (seconds) 133
1 User Queries (seconds) 817
Concurrent Queries (seconds) 2246
Throughput Test Time (seconds) 1128

TPC-DS v2.2.0 Score: 26948

### With Autovacuum
Load (seconds) 278
Analyze (seconds) 154
1 User Queries (seconds) 831
Concurrent Queries (seconds) 2257
Throughput Test Time (seconds) 1133

TPC-DS v2.2.0 Score: 25861

## TPCS-DS Results – Heap
### No Autovacuum
Load (seconds) 291
Analyze (seconds) 125
1 User Queries (seconds) 848
Concurrent Queries (seconds) 2343
Throughput Test Time (seconds) 1172

TPC-DS v2.2.0 Score: 25014

### With Autovacuum
Load (seconds) 300
Analyze (seconds) 140
1 User Queries (seconds) 893
Concurrent Queries (seconds) 2438
Throughput Test Time (seconds) 1224

TPC-DS v2.2.0 Score: 23989


## CPU Profiling Methodology
We used the perf tool to capture CPU load for 120 seconds during the 05-sql portion of TPC-DS.

This was started manually, and is somewhat subject to normal variance on the workstation.  We
attempted to hold variables constant for AV vs no-AV runs so that they would be comparable

Commands:
```sh
sudo perf record -a -- sleep 120
sudo perf diff noAVperf.data withAVperf.data --dsos …/bin/postgres
```

## CPU Profiling Results
* Note: baseline is no AV, delta is with AV
### AOCO
  Baseline  Delta Abs  Symbol
     2.48%     +0.60%      [.] ExecScanHashBucket
     2.83%     -0.47%      [.] slot_deform_heap_tuple
     0.75%     -0.35%      [.] fill_val
     0.75%     -0.34%      [.] heap_compute_data_size
     2.84%     +0.31%      [.] aocs_getnext
     0.48%     -0.26%      [.] sub_abs
     2.20%     -0.26%      [.] AllocSetAlloc
     0.56%     -0.24%      [.] heap_fill_tuple
     8.96%     +0.23%      [.] pg_strtok
     0.31%     -0.23%      [.] div_var
     0.51%     -0.22%      [.] hash_bytes


### Heap
  Baseline  Delta Abs  Symbol
     6.86%     +4.91%      [.] slot_deform_heap_tuple
     4.57%     -2.11%      [.] ExecScanHashBucket
     5.85%     +1.12%      [.] ExecInterpExpr
     1.14%     -1.02%      [.] tuplehash_lookup_hash_internal
     0.44%     +0.98%      [.] heapgettup_pagemode
     1.17%     -0.74%      [.] tuplehash_insert_hash_internal
     0.26%     +0.72%      [.] heapgetpage
     0.68%     -0.68%      [.] pg_mblen
     0.73%     -0.66%      [.] BufFileRead
     2.99%     +0.64%      [.] pg_checksum_block
     0.88%     -0.61%      [.] hash_bytes
     1.09%     -0.59%      [.] tts_minimal_store_tuple
     0.58%     -0.58%      [.] pg_mbstrlen_with_len

## Memory Profiling Methodology
We used a combination of code review and instrumentation to capture this information

Autovacuum uses its own memory context, "Autovacuum Launcher".  Each table vacuumed clears and
re-uses a small child context, "Autovacuum Portal".  “Vacuum” context is a child of this portal
context, and is used within vacuum() for the actual work.  The portal context is kept small and not
allowed to accumulate across tables, and overhead for vacuuming is also kept small.  This code keeps
allocation overhead very small, there is not much memory usage and maintenance_work_mem is fully
respected.  Thus the design of this section of the code results in very low risk of OOM crashes.

We tracked the memory usage of both “Autovacuum Launcher” and “Vacuum” contexts throughout TPC-DS
and report the maximum memory allocated per AV worker at any given time.

## Memory Profiling Results
### AOCO
**AV worker: Grand total: 270336 bytes in 7 blocks; 140240 free (11 chunks); 130096 used**
So, maximum 270kb allocated per worker.

### Heap
**AV worker: Grand total: 40960 bytes in 4 blocks; 24168 free (11 chunks); 16792 used**
So, maximum 41kb allocated per worker.

## Disk Profiling Methodology
To measure disk read/write load, we began with bcc’s filetop.py.

Heavily modified to capture and aggregate file usage by PID until cancelled

To identify AV Worker PID’s, instrumented StartAutoVacWorker to also write a pidfile to /tmp so that
short-lived PIDs can be differentiated

In general, this section of results was most challenging to capture, and has the most room for
improved methodology.

## Disk Profiling Results
### No Autovacuum AOCO
PID READS  WRITES RKBYTES  WKBYTES           CMD
541293      87724   69051   2741.38     2211.31     postgres: other <-- TOP PID

Total postgres file usage: READS: 9,018,723, WRITES: 7,487,920, RKBYTES: 277,713.1, WKBYTES:  2695.06

### With Autovacuum AOCO
PID READS  WRITES RKBYTES  WKBYTES        CMD
463016      97977   41935   3061.78     1465.26         postgres: other <-- TOP PID
481228      6349    5402    186.59      173.22          postgres: autovacuum <-- TOP AUTOVACUUM PID

Total postgres file usage: READS: 9,423,600, WRITES: 7,374,779, RKBYTES: 288,774.42, WKBYTES:    189.1

### No Autovacuum Heap
PID READS  WRITES RKBYTES  WKBYTES           CMD
296239      262851  225224  8214.09     7034.01     postgres: other <--TOP PID

Total postgres file usage: READS: 30,950,945, WRITES: 9,929,437, RKBYTES: 966,533.21, WKBYTES:   4267.8

### With Autovacuum Heap
PID READS  WRITES RKBYTES  WKBYTES        CMD
217785      282199  197937  8818.72     6180.8          postgres: other <-- TOP PID
235871      7507    6958    224.05      219.56          postgres: autovacuum <-- TOP AUTOVACUUM PID

Total postgres file usage: READS: 32,273,330, WRITES: 9,954,598, RKBYTES: 1,005,771.19, WKBYTES:      0.0

## References
2. Citus presentation on tuning autovacuum:
3. Robert Haas on tuning maintenance_work_mem:
5. Azure Postgres autovacuum recommendations:



Brent Doil

unread,
May 17, 2023, 3:25:32 PM5/17/23
to Andrew, gpdb...@greenplum.org
We followed up on the catalog autovacuum performance and used the following testing steps to further analyze performance on a larger scale.
  • Used TPC-DS with scale factor 10 as the workload.​
  • Compiled server with '-O3'​
  • gpdemo database with primaries and mirrors on separate NVME disks​
    • fsync = on​
    • gp_vmem_protect_limit = 32768​
    • max_connection = 750​
  • Scenario 1: Generate catalog bloat, then run TPC-DS​
  • Scenario 2: Begin with no bloat, generate bloat while running TPC-DS​
  • Want to generate a lot of bloat quickly:​
    • Used dynamic SQL to build DDL for 1200 col (bool) heap table with 1000 partitions​
    • Ran BEGIN..CREATE TABLE..ABORT via script. Command ran twice each across 6 databases.​
    • Result was ~730 million dead catalog tuples in the cluster.​
    • For scenario 2 I ran a single BEGIN..CREATE TABLE..ABORT transaction in a tight loop.
Metrics used:
  • DB perf:
    • TPC-DS v2 score
    • gp_stat_progress_vacuum
  • Disk I/O
    • sudo iotop -b -o -a -d 60 -t
    • iostat -zxs 60
  • CPU
    • perf record -F 800 -g --call-graph dwarf -o /data/perf_test_default.out -a -- sleep 300

Summary of findings:
  • If there is significant catalog bloat, concurrent AV during TPC-DS with default AV settings is a performance hit of 20-30%NOTE: Tests were run on a local demo cluster with primaries and mirrors on separate mounts.
  • Starting with no catalog bloat, generating consistent bloat during TPC-DS with default AV settings is a performance hit of ~15%

  • AV generates a non-trivial amount of WAL. How quickly WAL is sync'd to the mirrors was the primary bottleneck.

  • VACUUM uses a 256KB ring buffer instead of shared_buffers. When a page in the ring buffer is dirtied, it's not evicted. When the page is filled with dirty pages, a WAL flushesis forced. Due to the small buffer size, this causes a lot of WAL flushes. This issue has been resolved in a recent upstream discussion. https://www.postgresql.org/message-id/20230111182720....@awork3.anarazel.de Even so, increasing shared_buffers can improve performance by reducing cache misses when reading into the ring buffer.

  • The segment AV launchers act independently, and each use their own calculation autovacuum_vacuum_cost_limit. This needs to be taken into consideration if segments are sharing disks or NICs

  • Catalog tables vary widely in average size, pg_attribute/pg_type/pg_class/pg_stat_last_operation can be very large while others are small.

  • A default autovacuum_vacuum_threshold = 50 and autovacuum_vacuum_scale_factor = 0.2 means the table is eligible for AV when it has 50 + 20% of table size dead tuples.

  • Recommendation is to reduce the scale factor so AV is triggered more often with less overall work to do on any given table. 
    • autovacuum_vacuum_threshold = 500 - bump this up from 50 to 500 so we aren't constantly triggering AV and small catalog updates.
    • autovacuum_vacuum_scale_factor = 0.05 - reduce scale factor to 5% of table size
  • autovacuum_work_mem is used by AV to allocate memory for the TID array of dead tuples while scanning a relation. If the array fills up, AV performs vacuum/pruning/index cleanup on those TIDs and when finishes will resume the scan. We can reduce the number of scans by setting this value higher. Note that by default it's -1 and falls back to maintenance_work_mem.
  • The recommendation is to increase slightly so fewer scans are performed on larger catalog tables.
    • autovacuum_work_mem = 128MB

  • The default cost model on PG12 treats disk writes as 2x the cost of disk reads. In reality the skew is much further toward writes. Upstream has changed the default cost for vacuum_cost_page_miss from 10 to 2. Recommendation is to follow suit. Note that this will also affect normal vacuum.
    • vacuum_cost_page_miss = 2

  • To modify the aggressiveness of AV, autovacuum_vacuum_cost_limit is the primary tuning knob. When this limit is hit, the worker will sleep for autovacuum_vacuum_cost_delay. 

  • It's generally better to keep autovacuum_vacuum_cost_delay low to avoid large bursts of IO followed by troughs. Recommend tuning between 1ms-5ms.

  • autovacuum_vacuum_cost_limit can be increased to make AV more aggressive. Set to 1000 sees a performance hit of 20% while increasing AV throughput by ~10-15%.

  • shared_buffers should be increased by default. Need to follow-up on tuning this.

  • Most AV settings can be changed with a gpconfig and gpstop -u. Currently, settings are updated each time a new table is chosen. Cost limit, cost delay, and AV altogether can be modified this way to change AV priority. Upstream has recently introduced the ability to change these settings while working on a single table. This might be useful for customers in the future. https://github.com/postgres/postgres/commit/7d71d3dd080b9b147402db3365fe498f74704231

  • Generally, CPU and memory constraints aren't a concern for AV workers. Memory is limited by autovacuum_work_mem/maintenance_work_mem. perf reporting shows CPU cycles for AV code paths as 3-5% of overall on system in the worst case. IO is a more realistic concern. AV competes with WAL reader/writers, bgwriters, checkpointers etc. for disk IO and these can be a very significant portion of overall IO. 

  • Disk IO for all these background processes should be constrained by resource groups. The best way to allocate IO for these tasks is by throttling latency first, not IOPS or throughput. Underallocating IO resources for AV/WAL workers can bring the system to a halt.

  • The biggest performance boost was increasing wait_for_replication_threshold from 1024 to 4096. At the default 1024, local disks were saturated at 3k IOPS and 250MB/s. Upping to 4096 increased these to 4-5k IOPS and 1.5GB/s. Shipping the WAL over network needs to be throttled, so modifying this is not recommended in production clusters. An advanced tuning guide may want to touch on this.

  • gpload is a big culprit for creating bloat. It creates and drops external tables as part of the workflow, bloating pg_foreign_table and pg_attribute. gpload also generates the bloat in a single transaction, resulting in a large burst of AV work when it commits/aborts. More work should be done to either improve gpload or push customers to alternative tools for loading data.
The next step is to perform a similar exercise focused on autoanalyze.

Regards,
Brent



From: Andrew <gpdb...@andrewrepp.com>
Sent: Wednesday, February 8, 2023 11:23 AM
To: gpdb...@greenplum.org <gpdb...@greenplum.org>
Subject: Catalog Autovacuum
 
!! External Email
!! External Email: This email originated from outside of the organization. Do not click links or open attachments unless you recognize the sender.
Reply all
Reply to author
Forward
0 new messages