Jira (PDB-4948) Improve report/resource_event GC coordination with in flight queries

31 views
Skip to first unread message

Zachary Kent (Jira)

unread,
Oct 29, 2020, 5:31:04 PM10/29/20
to puppe...@googlegroups.com
Zachary Kent created an issue
 
PuppetDB / Improvement PDB-4948
Improve report/resource_event GC coordination with in flight queries
Issue Type: Improvement Improvement
Assignee: Unassigned
Created: 2020/10/29 2:30 PM
Priority: Normal Normal
Reporter: Zachary Kent

In a hotfix targeted at 2019.8 we added an interrupter thread to help coordinate report/resource_event GC with sync queries in this PR. While we hope that this change allows PDB sync to avoid full deadlocks with GC as seen in PE-30087 it's still possible that GC could conflict with other long running queries outside of sync. It's also possible that GC could get unlucky and need multiple tries to delete a partition which could cause multiple errors in the logs while sync gets cancelled repeatedly.

A more complete solution would be to allow GC to "bulldoze" other in flight queries which are blocking the AccessExclusiveLock it needs to drop an old partition. This could be accomplished by using pg_cancel_backend(<pid>) in coordination with querying pg_locks to see which queries are blocking the lock GC needs. Doing something along these lines would protect against all queries and not just those being performed by the local PDB during sync.

If we do this we'll want to audit the error handling /retry behavior of all queries we can think of in PDB and PE because this could cause GC to kill any inflight query.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)
Atlassian logo

Zachary Kent (Jira)

unread,
Nov 16, 2020, 2:34:04 PM11/16/20
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Nov 18, 2020, 2:52:03 PM11/18/20
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Nov 18, 2020, 7:09:03 PM11/18/20
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Dec 18, 2020, 5:46:02 PM12/18/20
to puppe...@googlegroups.com
Zachary Kent commented on Improvement PDB-4948
 
Re: Improve report/resource_event GC coordination with in flight queries

The work proposed in the PR related to this ticket implements a query killer which will cancel any in flight queries repeatedly while GC explicitly locks the tables it needs in order to drop a report/resource_event partition. The work there is close to being done but still needs to be reviewed and there are a couple details which should be addressed before it's merged (see TODO comments in PR).

I still have some concerns related to the query killing approach implemented in the PR which I'll attempt to explain below.

The main issue with the PDB GC process on partitioned tables is that the drop table command in Postgres requires an AccessExclusiveLock which needs to wait on any in flight queries and will block any subsequent queries until the lock is granted. If there are long running queries in front of the lock request this can hurt PDB response times and in the worst case cause outages.

In recent releases we've added timeouts to the PDB GC process so it won't wait forever on a lock and have attempted to limit/cancel our internal sync queries when GC and sync overlap. The sync summary query has been the worst offender we know of wrt long running queries. There is ongoing work to make sync summary query transaction even shorter which should further limit the impact it can have on GC. See PDB-2420 (It's not described well in the ticket, but the work there should greatly reduce the amount of time a sync summary query is held open). When the improvements to the sync summary query is released there should be fewer situations where queries can cause GC to wait for more than 5 minutes (current default timeout) on the lock it needs.

Due to the way the reports table and its partitions are structured we need to get an AccessExclusiveLock on reports, certnames, environments, producers, and reports_statuses during GC. When you explicitly lock these tables in Postgres the locking is done one-by-one even if you issue the lock statement for multiple tables at once. See parameters name section of: PG lock docs

Because each table is locked individually the query killer needs to account for any queries which could come in between locking the individual tables. This means that in flight queries will need to get cancelled multiple times while we individually lock all the tables required for GC. We're unsure exactly how this will impact the clients talking to PDB. There are quite a few race conditions we'll need to consider with this approach and we'll need to audit all the clients talking to PDB in PE to make sure they have the retry logic to handle this. The fact we need to issue the command to kill queries multiple times and are unable to get the needed locks in a single command makes this approach less predictable and more risky.

With the ongoing work to limit query times we could consider leaving the current timeout approach in place and save this work off to the side in the chance we hit an issue that requires a hotfix related to the GC timeouts in the future. We haven't yet heard reports of issues with GC starvation in those customers running the GC timeout approach released in PE 2019.8.2. It's possible that if we improve PDB query perf we may never need to take on this added complexity/risk.

cc/ Margaret Lee ^^ is a rough explanation of why I think it might be wise to hold off on implementing this until we know for sure that it's needed.

Zachary Kent (Jira)

unread,
Dec 19, 2020, 12:15:04 PM12/19/20
to puppe...@googlegroups.com
Zachary Kent commented on Improvement PDB-4948

Had a thought related to the issues described above last night. We may be able to get around the need to rely on explicitly locking the tables before issuing the drop command. The multiple explicit lock commands caused the issue with having multiple windows where other queries could come in which would need to be canceled repeatedly.

We could possibly issue the drop command directly without any explicit locking in the transaction and kill queries if the GC pid ever gets blocked. It's similar to the idea above but could greatly reduce the length of the window(s) where other queries could come in which would need to be killed again. If we do this we'll want to look into how Postgres internally locks the objects it needs during a drop call and if it locks them one at a time. If it did still lock things one at a time there will be windows where other queries could slot in between the locking, but it will be much shorter than issuing multiple lock commands to the backend. 

We'll still want to audit PE to make sure any clients can handle their queries getting canceled properly, but this approach may be more acceptable and less risky than the one outlined above. 

 

Margaret Lee (Jira)

unread,
Jan 4, 2021, 11:38:02 AM1/4/21
to puppe...@googlegroups.com
Margaret Lee commented on Improvement PDB-4948

Zachary Kent thanks for the detailed explanation. I agree that the second idea you had sounds less risky. Do you think the pros outweigh the risks for the second option? If we are making a lot of improvements that may help and this might not be an issue is it worth keeping the work around in case we need it as you mentioned and see if the other improvements help our customers first?

Zachary Kent (Jira)

unread,
Jan 4, 2021, 5:08:04 PM1/4/21
to puppe...@googlegroups.com
Zachary Kent commented on Improvement PDB-4948

Margaret Lee I'm still not sure whether it's worth the added risk even with the second option. I went ahead and reworked the PR I have up with a commit that switches to the second option so we'll have it if needed. If we decide we do need this, the remaining work will be to review the PR and add a feature flag so we can shut it off in case it causes trouble.

I'd still like to see if the upcoming changes to sync summary queries make people hitting the lock_timeout even less common.

Margaret Lee (Jira)

unread,
Jan 5, 2021, 10:48:03 AM1/5/21
to puppe...@googlegroups.com

Zachary Kent (Jira)

unread,
Jan 13, 2021, 2:10:04 PM1/13/21
to puppe...@googlegroups.com

Eric Thompson (Jira)

unread,
Jan 27, 2021, 2:18:05 PM1/27/21
to puppe...@googlegroups.com
Eric Thompson updated an issue
Change By: Eric Thompson
Sprint: HA 2021-01-27 , HA 2021-02-10

Austin Blatt (Jira)

unread,
Feb 5, 2021, 12:18:04 PM2/5/21
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
Fix Version/s: PDB 7.1.0
Fix Version/s: PDB 6.14.0

Zachary Kent (Jira)

unread,
Feb 5, 2021, 12:49:03 PM2/5/21
to puppe...@googlegroups.com
Zachary Kent updated an issue
Change By: Zachary Kent
Release Notes: Enhancement
Release Notes Summary: Added a query-bulldozer which is spawned during periodic GC when PuppetDB attempts to drop partitioned tables. The bulldozer will cancel any queries blocking the GC process from getting the AccessExclusiveLocks it needs in order to drop a partition. See the https://puppet.com/docs/puppetdb/latest/configure.html#experimental-environment-variables section of the docs for infomation on the   PDB_GC_QUERY_BULLDOZER_TIMEOUT_MS setting which allows users to disable the query-bulldozer if needed.
Reply all
Reply to author
Forward
0 new messages