Jira (PDB-5548) Partition GC's query bulldozer not guaranteed to make progress

Austin Blatt (Jira)

unread,

Oct 5, 2022, 3:25:02 PM10/5/22

to puppe...@googlegroups.com

Austin Blatt created an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Issue Type:	Bug
Assignee:	Unassigned
Created:	2022/10/05 12:24 PM
Priority:	Normal
Reporter:	Austin Blatt

The query bulldozer's purpose is to clear out transactions that are holding onto locks that the GC operation needs, but in its current state it is not guaranteed to make progress.

Firstly, if a transaction is idle pg_cancel_backend doesn't result in aborting the transaction, meaning none of the locks are freed.

code
=> BEGIN TRANSACTION;
BEGIN
=*> select * from reports;

pg_cancel_backend called here from another terminal

=*> select * from certnames;
=*> COMMIT;
COMMIT
code

using pg_terminate_backend solves that by immediately causing a ROLLBACK and terminating the connection

=> BEGIN TRANSACTION;

BEGIN

 I puppetdb=*> select * from reports;

terminate called here -- pg_locks reports locks are immediately freed

=*> select * from certnames;

FATAL:  terminating connection due to administrator command

server closed the connection unexpectedly

        This probably means the server terminated abnormally

        before or while processing the request.

The connection to the server was lost. Attempting reset: Succeeded.

=> COMMIT;

WARNING:  there is no transaction in progress

COMMIT

Secondly, the bulldozer has a thread sleep of 1 second, and the Postgres deadlock detector also waits for 1 second, so we only get one chance to free a deadlocked drop table command before it is terminated by postgres. I lazily acquires multiple locks in the GC routine, so one attempt may not be enough. We should sleep for a shorter interval to give ourselves multiple chances to "free" the deadlocked GC query.

In the latest version of PE we could also consider increasing the deadlock detectors timeout.

Add Comment

This message was sent by Atlassian Jira (v8.20.11#820011-sha1:0629dd8)

Austin Blatt (Jira)

unread,

Oct 5, 2022, 3:26:03 PM10/5/22

to puppe...@googlegroups.com

Austin Blatt updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Austin Blatt

The query bulldozer's purpose is to clear out transactions that are holding onto locks that the GC operation needs, but in its current state it is not guaranteed to make progress.

Firstly, if a transaction is idle {{pg_cancel_backend}} doesn't result in aborting the transaction, meaning none of the locks are freed.

{ { code} }
=> BEGIN TRANSACTION;
BEGIN
=*> select * from reports;

pg_cancel_backend called here from another terminal

=*> select * from certnames;
=*> COMMIT;
COMMIT
{ { code} }

using {{pg_terminate_backend}} solves that by immediately causing a ROLLBACK and terminating the connection

{code}
=> BEGIN TRANSACTION;
BEGIN

I puppetdb=*> select * from reports;

terminate called here -- pg_locks reports locks are immediately freed

=*> select * from certnames;
FATAL:  terminating connection due to administrator command
server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
The connection to the server was lost. Attempting reset: Succeeded.
=> COMMIT;
WARNING:  there is no transaction in progress
COMMIT

{code}

Secondly, the bulldozer has a thread sleep of 1 second, and the Postgres deadlock detector also waits for 1 second, so we only get one chance to free a deadlocked {{drop table}} command before it is terminated by postgres. I lazily acquires multiple locks in the GC routine, so one attempt may not be enough. We should sleep for a shorter interval to give ourselves multiple chances to "free" the deadlocked GC query.

In the latest version of PE we could also consider increasing the deadlock detectors timeout.

Add Comment

Rob Browning (Jira)

unread,

Oct 20, 2022, 11:55:01 AM10/20/22

to puppe...@googlegroups.com

Rob Browning updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Rob Browning
Team:	Skeletor

Add Comment

Cas Donoghue (Jira)

unread,

Oct 26, 2022, 1:20:01 PM10/26/22

to puppe...@googlegroups.com

Cas Donoghue updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Cas Donoghue
Sprint:	Skeletor 11/09/2022

Add Comment

Rob Browning (Jira)

unread,

Oct 28, 2022, 2:10:04 AM10/28/22

to puppe...@googlegroups.com

Rob Browning assigned an issue to Rob Browning

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Rob Browning
Assignee:	Rob Browning

Add Comment

Austin Blatt (Jira)

unread,

Nov 3, 2022, 1:55:03 PM11/3/22

to puppe...@googlegroups.com

Austin Blatt updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Austin Blatt
Epic Link:	PDB-5552

Add Comment

Rob Browning (Jira)

unread,

Nov 7, 2022, 11:08:03 AM11/7/22

to puppe...@googlegroups.com

Rob Browning updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Rob Browning
Fix Version/s:	PDB 6.22.2
Fix Version/s:	PDB 7.12.0

Add Comment

Rob Browning (Jira)

unread,

Nov 7, 2022, 11:23:02 AM11/7/22

to puppe...@googlegroups.com

Rob Browning updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Rob Browning
Release Notes:	Bug Fix
Release Notes Summary:	PuppetDB should drop old partitions (e.g. reports) more effectively. Because dropping the partitions currently requires an exclusive lock that can block all all other queries, PuppetDB tries to eliminate any queries that are blocking a drop attempt, but the method previously used to stop queries was insufficient. In addition, as constructed right now, competing queries can deadlock with the drop attempt. If a deadlock doesn't resolve within the configured `deadlock_timeout`, PostgreSQL's deadlock detector will break the deadlock, and to do so in some cases it has chosen to kill the partition drop attempt every time. To fix this problem, PuppetDB now terminates any queries competing with a drop attempt via `pg_terminate_backend` instead of `pg_cancel_backend`. This should be very reliable (cancel was more advisory), and PuppetDB also makes sure that it has the connection it needs for the terminations before it makes a drop attempt. Finally, it will try to issue those terminations more frequently than it did before (every fifth of a second instead of every second).

Add Comment

Rob Browning (Jira)

unread,

Nov 7, 2022, 11:42:03 AM11/7/22

to puppe...@googlegroups.com

Rob Browning updated an issue

PuppetDB /

PDB-5548

Partition GC's query bulldozer not guaranteed to make progress

Change By:	Rob Browning
Release Notes Summary:

PuppetDB should drop old partitions (e.g. reports) more effectively.

Because dropping the partitions currently requires an exclusive lock that can block all all other queries, PuppetDB tries to eliminate any queries that are blocking a drop attempt, but the method previously used to stop queries was insufficient. In addition, as constructed right now, competing queries can deadlock with the drop attempt. If a deadlock doesn't resolve within the configured `deadlock_timeout`, PostgreSQL's deadlock detector will break the deadlock, and to do so in some cases it has chosen to kill the partition drop attempt every time.

To fix this problem, PuppetDB now terminates any queries competing with a drop attempt via `pg_terminate_backend` instead of `pg_cancel_backend`. This should be very reliable (cancel was more advisory), and PuppetDB also makes sure that it has the connection it needs for the terminations before it makes making a drop attempt. Finally, it will try to issue those terminations termination requests more frequently than it did before (every fifth of a second instead of every second).

Add Comment

Reply all

Reply to author

Forward