| A customer recently hit an issue where it appeared that PDB sync got stuck in similar way to what's described in this ticket. In the new case the problem presented differently, but we believe the difference is due to some of the mitigations which have been recently added to address long running PDB sync queries. In the more recent case PDB was reloaded after some issues were seen with the service being intermittently unreachable. After the reload we noticed that the start up sync successfully completed but every subsequent periodic sync was reporting that it was unable to run with the following log message:
[sync] Refusing to sync from ... Sync already in progress
|
This was caused by the global currently-syncing atom being set to true and never getting reset because a periodic sync got stuck. We were able to get a thread dump from PDB when it was in this state and confirmed some of our suspicions. It appears that the shutdown of the at-at threadpool we use to schedule tasks was still waiting to gracefully shutdown a periodic sync thread which got stuck at some point before the reload. Evidence of this can be seen in the thread dump. at-at uses a future to shutdown and reset the thread pool we use to schedule sync. See at-at stop-and-reset-pool! call. PDB calls this function when stopping the PDB TK sync service seen here. Because this happens in a future in the at-at library it doesn't block and will wait until all threads running in the pool finish the job they're working on. When we look at the elapsed time of this thread (121 hrs) it roughly lines up with the reload of the PuppetDB service and when the thread dump was taken afterwards. Providing evidence that the at-at stop-and-reset-pool! call was stuck waiting on a thread in the pool to finish. at-at pool shutdown thread from dump. We also see evidence of a sync thread being stuck in the thread dump where it looks like a it's waiting to deref a promise in the puppetlabs/clj-http-client. The elapsed time of this thread (485 hrs) indicates that a periodic sync got stuck sometime before PDB was reloaded and was never cleared during the reload. Before a periodic sync being stuck like this would have caused issues with leaving a long running query held open, but due to the 2hr statement_timeout the customer had in place for the pe-puppetdb user Postgres was able to continue to function even though sync was stuck. stuck sync thread from dump. iiuc, when TK services get a SIGHUP the signal gets intercepted and the stop/start method of the service and its TK deps get called but it doesn't fully shutdown the JVM. This could have caused the behaviour we noticed where the global currently-syncing atom was never reset due to the hung sync and at-at shutdown and as a result any periodic syncs after the reload reported that another sync was already in progress. The full start/stop (SIGTERM) of the PDB service corrected this issue because it forced the stuck threads to get shutdown and reset the state in the currently-syncing atom. We're attempting to recreate a situation in which sync can get stuck in this way but haven't been able to yet. We believe it's due to the clj-http-client blocking on a promise during shutdown which doesn't get delivered. See where clj-http-client derefs promise here This deref is one of the last things in the call stack for the stuck sync thread linked above. If we're unable to recreate the error case needed to trigger this issue we're considering using a different client for sync which doesn't have the same error handling as a way to work around/avoid this issue. I'll update the ticket as we find out more. In the meantime if sync stops running a full stop/start of the PuppetDB service should resolve any issues where sync has become wedged and avoid the issue where periodic sync thinks another sync in already in progress. |