| The work proposed in the PR related to this ticket implements a query killer which will cancel any in flight queries repeatedly while GC explicitly locks the tables it needs in order to drop a report/resource_event partition. The work there is close to being done but still needs to be reviewed and there are a couple details which should be addressed before it's merged (see TODO comments in PR). I still have some concerns related to the query killing approach implemented in the PR which I'll attempt to explain below. The main issue with the PDB GC process on partitioned tables is that the drop table command in Postgres requires an AccessExclusiveLock which needs to wait on any in flight queries and will block any subsequent queries until the lock is granted. If there are long running queries in front of the lock request this can hurt PDB response times and in the worst case cause outages. In recent releases we've added timeouts to the PDB GC process so it won't wait forever on a lock and have attempted to limit/cancel our internal sync queries when GC and sync overlap. The sync summary query has been the worst offender we know of wrt long running queries. There is ongoing work to make sync summary query transaction even shorter which should further limit the impact it can have on GC. See PDB-2420 (It's not described well in the ticket, but the work there should greatly reduce the amount of time a sync summary query is held open). When the improvements to the sync summary query is released there should be fewer situations where queries can cause GC to wait for more than 5 minutes (current default timeout) on the lock it needs. Due to the way the reports table and its partitions are structured we need to get an AccessExclusiveLock on reports, certnames, environments, producers, and reports_statuses during GC. When you explicitly lock these tables in Postgres the locking is done one-by-one even if you issue the lock statement for multiple tables at once. See parameters name section of: PG lock docs Because each table is locked individually the query killer needs to account for any queries which could come in between locking the individual tables. This means that in flight queries will need to get cancelled multiple times while we individually lock all the tables required for GC. We're unsure exactly how this will impact the clients talking to PDB. There are quite a few race conditions we'll need to consider with this approach and we'll need to audit all the clients talking to PDB in PE to make sure they have the retry logic to handle this. The fact we need to issue the command to kill queries multiple times and are unable to get the needed locks in a single command makes this approach less predictable and more risky. With the ongoing work to limit query times we could consider leaving the current timeout approach in place and save this work off to the side in the chance we hit an issue that requires a hotfix related to the GC timeouts in the future. We haven't yet heard reports of issues with GC starvation in those customers running the GC timeout approach released in PE 2019.8.2. It's possible that if we improve PDB query perf we may never need to take on this added complexity/risk. cc/ Margaret Lee ^^ is a rough explanation of why I think it might be wise to hold off on implementing this until we know for sure that it's needed. |