| Spoke with the FOSS person who reported the issue an got the results of the following query on their system where the update factset query was still stuck after ~9 days at that point.
select a.pid, |
pg_blocking_pids(a.pid) as blocked_by, |
a.query as blocked_query, |
array_agg(l.mode) as mode, |
array_agg(l.relation::regclass) as regclass, |
array_agg(l.granted) as granted |
from pg_stat_activity a join pg_locks l on a.pid = l.pid |
where cardinality(pg_blocking_pids(a.pid)) > 0 |
group by a.pid, a.query;
|
Result gist In the results you can see that an update factset (pid:10624) and autovacuum.analyze (pid:13060) query are blocked waiting on locks from pid:4971 (the stuck query). You can see the start times of these and the two other oldest blocked update factset queries below (queries have been truncated):
pid | 9471 |
query_start | 2021-02-09 17:08:11.511969-05 |
query | UPDATE factsets SET timestamp = $1,environment_id = $2,produc |
|
pid | 13060 |
query_start | 2021-02-09 17:13:09.901283-05 |
query | autovacuum: ANALYZE public.factsets |
|
pid | 10624 |
query_start | 2021-02-09 17:54:03.121967-05 |
query | UPDATE factsets SET timestamp = $1,environment_id = $2,produc |
|
pid | 19375 |
query_start | 2021-02-09 18:46:18.568418-05 |
query | UPDATE factsets SET timestamp = $1,environment_id = $2,produc
|
I'm not really sure what's going on here but I've noticed a few things I'll describe below: In the results in the gist above the pid:10624 query appears to be blocked by pid:4971 waiting on a Sharelock. After a bit of searching around on the web it looks like the Sharelock is probably against the transaction for pid:4971. It's possible for updates like this to deadlock with one another but the examples I've found seem to be the kind that the postgres deadlock detector can find and interrupt the offending queries. Because the stuck query (pid:4971) isn't waiting on any locks I don't think there is a AB/BA type deadlock that postgres can detect. The autovacuum analyze query(pid:13060) is blocked on pid:4971 waiting for an Exclusivelock on the idx_factsets_jsonb_merged index. This is the (stable||volatile) GIN index that pid:4971 was seen doing all the I/O against. Every other update factset query is queued behind pid:10624 and the updates that came before in the queue waiting for an exclusive lock on the factsets table. It's possible that a query which hit the gin_pending_list_limt on an update would have to do the work of updating the GIN index during the update and the table bloat would make this process even slower than normal. I don't think any additional slowness there due to bloat would explain the query being stuck for days though. |