Jira (PDB-4941) Evaluate current table-specific parameters (autovacuum, etc.)

15 views
Skip to first unread message

Rob Browning (Jira)

unread,
Oct 22, 2020, 3:00:03 PM10/22/20
to puppe...@googlegroups.com
Rob Browning created an issue
 
PuppetDB / Task PDB-4941
Evaluate current table-specific parameters (autovacuum, etc.)
Issue Type: Task Task
Assignee: Unassigned
Created: 2020/10/22 11:59 AM
Priority: Normal Normal
Reporter: Rob Browning

In light of the recent considerations around autovacuuming, double-check the per-table settings pdb currently establishes in migrations.

 

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Rob Browning (Jira)

unread,
Oct 22, 2020, 3:25:03 PM10/22/20
to puppe...@googlegroups.com

Austin Blatt (Jira)

unread,
Nov 4, 2020, 12:46:03 PM11/4/20
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Nov 18, 2020, 2:53:03 PM11/18/20
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Jan 12, 2021, 5:18:05 PM1/12/21
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Jan 13, 2021, 2:12:03 PM1/13/21
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Jan 20, 2021, 5:27:02 PM1/20/21
to puppe...@googlegroups.com
Zachary Kent commented on Task PDB-4941
 
Re: Evaluate current table-specific parameters (autovacuum, etc.)

We currently set a table specific autovacuum_vacuum_scale_factor setting on the factset, catalog, and certnames tables. It is set to 0.75 on the catalog and certnames tables and 0.80 on the factset table. This setting controls the threshold of dead rows that must be hit before an autovacuum of the table will be scheduled. See Thresholds and Scale Factors section of the linked article for background info on this setting. This setting is toggled in two places depending on whether it is a FOSS or PE install. In FOSS the table settings are added in a migration here. In PE the settings are duplicated in the pe_databases::postgresql_settings::table_settings class.
 
I believe setting the scale factor to 0.80 means that in a steady state assuming a 30 min agent run interval the table will get vacuumed every 24 mins and will bloat up to 44% dead space before being vacuumed. It was found, sometime in the past, that spacing out autovacuums in this way helped reduce CPU use for customers. See comments in this PR-2393 for related details. Setting the scale_factor this high seems to go against what's generally recommended in the PostgreSQL docs but given what was observed in the field it may be the right choice in our case. Another observation is that our update pattern being throttled by the agent run interval means there should usually be only ~ 1 dead row per certname in the tables mentioned above before they are vacuumed.

We recently had issues with some customers related to table bloat and autovacuum not keeping up which lead to the creation of this ticket. See comments in PE-27630 for more detailed background on what was done to resolve these issues. To summarize the solution in that case, the autovacuum_vacuum_cost_limit was increased to allow the autovacuum workers to do more work before they were forced to sleep for autovacuum_vacuum_cost_delay. See this ZD-35697 ticket for background on the problem at a specific customer site. Note the excellent write up from Dylan Ratcliffe copied into the ZD ticket linked above about some of the debugging that was done to figure out this issue.

One outstanding question that came from some of the recent changes mentioned above was about the impact of setting autovacuum_analyze_scale_factor to 0.04. This change was made when the autovacuum_vacuum_scale_factor was originally decreased from the default of 0.20 to 0.08. Later when the autovacuum_vacuum_scale_factor was increased to 0.80 the autovacuum_analyze_scale_factor remained set at 0.04. See here for where the pe_databases module sets this value. Some of the concerns voiced by Nick W. about this change can be found here in the slack history. It seems possible that we may want to increase the value of this setting back to the default of 0.10. In the steady state with a 30 min agent run interval we will be analyzing tables every ~72 seconds. There are some potential issues with this analyze process keeping a transaction held open and preventing clean up of some rows (see slack message linked above). We similarly decreased the autovacuum_analyze_scale_factor to 0.01 on the catalog_resources table recently see PDB-4640 for details.

Regardless of any potential changes to the vacuum/analyze_scale_factors mentioned above, the changes we're making to no longer hold sync summary query transactions open should help to reduce the amount of dead rows which are ineligible for clean up during an autovacuum due to open transactions. Once the change to the sync summary queries is merged we should reach out to Support and see if they know if any customers would be willing to gather some before/after metrics related to the autovacuum process and table bloat. If we see improvement to the amount of bloat being created we could then see if customers would be willing to tweak the vacuum/analyze_scale_factor settings to see if we're able to reduce bloat further and decrease the frequency of the analyze runs without hurting query perf. We don't currently have an easy way to test these changes at any reasonable scale/load in house, so getting customers on board to help test this may be our best option.

Zachary Kent (Jira)

unread,
Jan 20, 2021, 5:29:03 PM1/20/21
to puppe...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages