Jira (PDB-4932) Sync summary queries can hang

Issue Type:	Bug
Assignee:	Unassigned
Created:	2020/10/15 3:35 PM
Priority:	Major
Reporter:	Zachary Kent

There seems to be situations where PDB sync summary query transactions can remain open and PDB sync stops logging and hangs. When this happens sync will stop until PDB is restarted. It's also possible that running SELECT pg_cancel_backend(pid); on the query will restore sync, but this is less certain to work then a full PDB restart.

We recently saw this issue when a PDB was in the middle of pulling reports from a replica and the replica was upgraded. link to related slack msgs

Order of events:

primary started its report sync at: 2020-10-13T02:26:55.984Z
replica received a shutdown signal at: 2020-10-13T02:27:23.876Z
replica saw the errors in the following gist during shutdown: shutdown-error-gist

After the replica was shutdown and upgraded the sync on the primary never logged again and there was an open sync summary query observed in pg_stat_activity which stayed open idle in transaction waiting on ClientRead.

We recently added a thread interrupter for sync in ~~PDB-4909~~ but it seems like there are still edge cases that this work didn't cover.

It's possible that adding a statement_timeout to sync queries would help avoid this issue.

We'll work towards reproducing this error in the coming days will update the ticket with what we find.

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)

zendesk.jira (Jira)

unread,

Oct 18, 2020, 9:18:03 PM10/18/20

to puppe...@googlegroups.com

zendesk.jira updated an issue

PuppetDB /

Change By:	zendesk.jira
Zendesk Ticket Count:	1
Zendesk Ticket IDs:	41188

zendesk.jira (Jira)

unread,

Oct 18, 2020, 9:18:04 PM10/18/20

to puppe...@googlegroups.com

zendesk.jira updated an issue

PuppetDB /

Change By:	zendesk.jira
Labels:	jira_escalated

Zachary Kent (Jira)

unread,

Dec 1, 2020, 11:54:02 AM12/1/20

to puppe...@googlegroups.com

Zachary Kent updated an issue

PuppetDB /

Change By:	Zachary Kent
Epic Link:	PDB-4969

Zachary Kent (Jira)

unread,

Dec 1, 2020, 11:54:03 AM12/1/20

to puppe...@googlegroups.com

Zachary Kent assigned an issue to Zachary Kent

PuppetDB /

Change By:	Zachary Kent
Assignee:	Zachary Kent

Zachary Kent (Jira)

unread,

Dec 1, 2020, 8:29:03 PM12/1/20

to puppe...@googlegroups.com

Zachary Kent commented on

Re: Sync summary queries can hang

A customer recently hit an issue where it appeared that PDB sync got stuck in similar way to what's described in this ticket. In the new case the problem presented differently, but we believe the difference is due to some of the mitigations which have been recently added to address long running PDB sync queries.

In the more recent case PDB was reloaded after some issues were seen with the service being intermittently unreachable. After the reload we noticed that the start up sync successfully completed but every subsequent periodic sync was reporting that it was unable to run with the following log message:

[sync] Refusing to sync from ... Sync already in progress

This was caused by the global currently-syncing atom being set to true and never getting reset because a periodic sync got stuck.

We were able to get a thread dump from PDB when it was in this state and confirmed some of our suspicions. It appears that the shutdown of the at-at threadpool we use to schedule tasks was still waiting to gracefully shutdown a periodic sync thread which got stuck at some point before the reload.

Evidence of this can be seen in the thread dump. at-at uses a future to shutdown and reset the thread pool we use to schedule sync. See at-at stop-and-reset-pool! call. PDB calls this function when stopping the PDB TK sync service seen here. Because this happens in a future in the at-at library it doesn't block and will wait until all threads running in the pool finish the job they're working on. When we look at the elapsed time of this thread (121 hrs) it roughly lines up with the reload of the PuppetDB service and when the thread dump was taken afterwards. Providing evidence that the at-at stop-and-reset-pool! call was stuck waiting on a thread in the pool to finish. at-at pool shutdown thread from dump.

We also see evidence of a sync thread being stuck in the thread dump where it looks like a it's waiting to deref a promise in the puppetlabs/clj-http-client. The elapsed time of this thread (485 hrs) indicates that a periodic sync got stuck sometime before PDB was reloaded and was never cleared during the reload. Before a periodic sync being stuck like this would have caused issues with leaving a long running query held open, but due to the 2hr statement_timeout the customer had in place for the pe-puppetdb user Postgres was able to continue to function even though sync was stuck. stuck sync thread from dump.

iiuc, when TK services get a SIGHUP the signal gets intercepted and the stop/start method of the service and its TK deps get called but it doesn't fully shutdown the JVM. This could have caused the behaviour we noticed where the global currently-syncing atom was never reset due to the hung sync and at-at shutdown and as a result any periodic syncs after the reload reported that another sync was already in progress. The full start/stop (SIGTERM) of the PDB service corrected this issue because it forced the stuck threads to get shutdown and reset the state in the currently-syncing atom.

We're attempting to recreate a situation in which sync can get stuck in this way but haven't been able to yet. We believe it's due to the clj-http-client blocking on a promise during shutdown which doesn't get delivered. See where clj-http-client derefs promise here This deref is one of the last things in the call stack for the stuck sync thread linked above.

If we're unable to recreate the error case needed to trigger this issue we're considering using a different client for sync which doesn't have the same error handling as a way to work around/avoid this issue.

I'll update the ticket as we find out more. In the meantime if sync stops running a full stop/start of the PuppetDB service should resolve any issues where sync has become wedged and avoid the issue where periodic sync thinks another sync in already in progress.

Zachary Kent (Jira)

unread,

Dec 2, 2020, 2:28:04 PM12/2/20

to puppe...@googlegroups.com

Zachary Kent updated an issue

PuppetDB /

Change By:	Zachary Kent
Story Points:	8

Zachary Kent (Jira)

unread,

Dec 7, 2020, 7:43:02 PM12/7/20

to puppe...@googlegroups.com

Zachary Kent commented on

Re: Sync summary queries can hang

We're planning on adding additional logging and a way to capture stack traces from the sync thread automatically if we're unable to interrupt sync. This should allow us to better diagnose where sync can hang and rule out possible issues caused by the recent sync interruption changes. For now I'm moving this ticket into the suspended column but will link it to the ticket Rob Browning is planning on making for the sync interrupt logging/stack trace capture work.

zendesk.jira (Jira)

unread,

Dec 8, 2020, 12:00:03 AM12/8/20

to puppe...@googlegroups.com

zendesk.jira updated an issue

PuppetDB /

Change By:	zendesk.jira
Zendesk Ticket Count:	1 2
Zendesk Ticket IDs:	41188 ,42072