Jira (PDB-5548) Partition GC's query bulldozer not guaranteed to make progress

17 views
Skip to first unread message

Austin Blatt (Jira)

unread,
Oct 5, 2022, 3:25:02 PM10/5/22
to puppe...@googlegroups.com
Austin Blatt created an issue
 
PuppetDB / Bug PDB-5548
Partition GC's query bulldozer not guaranteed to make progress
Issue Type: Bug Bug
Assignee: Unassigned
Created: 2022/10/05 12:24 PM
Priority: Normal Normal
Reporter: Austin Blatt

The query bulldozer's purpose is to clear out transactions that are holding onto locks that the GC operation needs, but in its current state it is not guaranteed to make progress.

Firstly, if a transaction is idle pg_cancel_backend doesn't result in aborting the transaction, meaning none of the locks are freed.

code
=> BEGIN TRANSACTION;
BEGIN
=*> select * from reports;

pg_cancel_backend called here from another terminal

=*> select * from certnames;
=*> COMMIT;
COMMIT
code

using pg_terminate_backend solves that by immediately causing a ROLLBACK and terminating the connection

=> BEGIN TRANSACTION;
BEGIN
 I puppetdb=*> select * from reports;
 
terminate called here -- pg_locks reports locks are immediately freed
 
=*> select * from certnames;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
=> COMMIT;
WARNING:  there is no transaction in progress
COMMIT

Secondly, the bulldozer has a thread sleep of 1 second, and the Postgres deadlock detector also waits for 1 second, so we only get one chance to free a deadlocked drop table command before it is terminated by postgres. I lazily acquires multiple locks in the GC routine, so one attempt may not be enough. We should sleep for a shorter interval to give ourselves multiple chances to "free" the deadlocked GC query.

In the latest version of PE we could also consider increasing the deadlock detectors timeout.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.20.11#820011-sha1:0629dd8)
Atlassian logo

Austin Blatt (Jira)

unread,
Oct 5, 2022, 3:26:03 PM10/5/22
to puppe...@googlegroups.com
Austin Blatt updated an issue
Change By: Austin Blatt
The query bulldozer's purpose is to clear out transactions that are holding onto locks that the GC operation needs, but in its current state it is not guaranteed to make progress.

Firstly, if a transaction is idle {{pg_cancel_backend}} doesn't result in aborting the transaction, meaning none of the locks are freed.

{
{ code} }
=> BEGIN TRANSACTION;
BEGIN
=*> select * from reports;

pg_cancel_backend called here from another terminal

=*> select * from certnames;
=*> COMMIT;
COMMIT
{
{ code} }

using {{pg_terminate_backend}} solves that by immediately causing a ROLLBACK and terminating the connection
{code}
=> BEGIN TRANSACTION;
BEGIN

I puppetdb=*> select * from reports;

terminate called here -- pg_locks reports locks are immediately freed

=*> select * from certnames;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
=> COMMIT;
WARNING:  there is no transaction in progress
COMMIT
{code}


Secondly, the bulldozer has a thread sleep of 1 second, and the Postgres deadlock detector also waits for 1 second, so we only get one chance to free a deadlocked {{drop table}} command before it is terminated by postgres. I lazily acquires multiple locks in the GC routine, so one attempt may not be enough. We should sleep for a shorter interval to give ourselves multiple chances to "free" the deadlocked GC query.

In the latest version of PE we could also consider increasing the deadlock detectors timeout.

Rob Browning (Jira)

unread,
Oct 20, 2022, 11:55:01 AM10/20/22
to puppe...@googlegroups.com

Cas Donoghue (Jira)

unread,
Oct 26, 2022, 1:20:01 PM10/26/22
to puppe...@googlegroups.com

Rob Browning (Jira)

unread,
Oct 28, 2022, 2:10:04 AM10/28/22
to puppe...@googlegroups.com

Austin Blatt (Jira)

unread,
Nov 3, 2022, 1:55:03 PM11/3/22
to puppe...@googlegroups.com

Rob Browning (Jira)

unread,
Nov 7, 2022, 11:08:03 AM11/7/22
to puppe...@googlegroups.com
Rob Browning updated an issue
Change By: Rob Browning
Fix Version/s: PDB 6.22.2
Fix Version/s: PDB 7.12.0

Rob Browning (Jira)

unread,
Nov 7, 2022, 11:23:02 AM11/7/22
to puppe...@googlegroups.com
Rob Browning updated an issue
Change By: Rob Browning
Release Notes: Bug Fix
Release Notes Summary: PuppetDB should drop old partitions (e.g. reports) more effectively.

Because dropping the partitions currently requires an exclusive lock that can block all all other queries, PuppetDB tries to eliminate any queries that are blocking a drop attempt, but the method previously used to stop queries was insufficient.  In addition, as constructed right now, competing queries can deadlock with the drop attempt.  If a deadlock doesn't resolve within the configured `deadlock_timeout`, PostgreSQL's deadlock detector will break the deadlock, and to do so in some cases it has chosen to kill the partition drop attempt every time.

To fix this problem, PuppetDB now terminates any queries competing with a drop attempt via `pg_terminate_backend` instead of `pg_cancel_backend`.  This should be very reliable (cancel was more advisory), and PuppetDB also makes sure that it has the connection it needs for the terminations before it makes a drop attempt.  Finally, it will try to issue those terminations more frequently than it did before (every fifth of a second instead of every second).

Rob Browning (Jira)

unread,
Nov 7, 2022, 11:42:03 AM11/7/22
to puppe...@googlegroups.com
Rob Browning updated an issue
Change By: Rob Browning
Release Notes Summary:
PuppetDB should drop old partitions (e.g. reports) more effectively.

Because dropping the partitions currently requires an exclusive lock that can block all all other queries, PuppetDB tries to eliminate any queries that are blocking a drop attempt, but the method previously used to stop queries was insufficient.  In addition, as constructed right now, competing queries can deadlock with the drop attempt.  If a deadlock doesn't resolve within the configured `deadlock_timeout`, PostgreSQL's deadlock detector will break the deadlock, and to do so in some cases it has chosen to kill the partition drop attempt every time.

To fix this problem, PuppetDB now terminates any queries competing with a drop attempt via `pg_terminate_backend` instead of `pg_cancel_backend`.  This should be very reliable (cancel was more advisory), and PuppetDB also makes sure that it has the connection it needs for the terminations before it makes making a drop attempt.  Finally, it will try to issue those terminations termination requests more frequently than it did before (every fifth of a second instead of every second).
Reply all
Reply to author
Forward
0 new messages