Jira (PDB-5027) Investigate factset update getting stuck in DataFileRead

46 views

Skip to first unread message

Zachary Kent (Jira)

unread,

Feb 16, 2021, 4:38:04 PM2/16/21

to puppe...@googlegroups.com

Zachary Kent created an issue

PuppetDB /

PDB-5027

Investigate factset update getting stuck in DataFileRead

Issue Type:	Bug
Assignee:	Unassigned
Created:	2021/02/16 1:37 PM
Priority:	Normal
Reporter:	Zachary Kent

We recently had a FOSS person running PDB 6.2.0 report an issue that looked similar to one we've seen at a PE site recently (see restricted comment for link to internal slack message about PE issue). In this case PDB stopped processing commands (maybe only reports) and stopped logging messages until the disk filled up 4 hours later due to commands being enqueued. See FOSS slack thread for more context.

Looking into this issue we found that there were multiple commands blocked waiting on locks behind an update factset command which was stuck waiting on wait_event=DataFileRead. This query appeared to be stuck for ~2 days at that point. We also noticed that there was quite a bit of bloat in the factsets and resource_events tables. In the similar PE case CS was able to resolve the issue by running pg_cancel_backend(pid) and running pg_repack on the bloated tables.

state of bloat in the factset/resource_events tables:

schemaname |        relname        | n_dead_tup | n_live_tup |        last_auto

vacuum

------------+-----------------------+------------+------------+-----------------

 public     | factsets              |   11067234 |      11266 |

 public     | resource_events       |   12623797 |   67360341 |

Poking around at the factset ingestion functions I noticed that we use a repeatable read transaction isolation level for factset submission and rely on serialization errors to cancel competing concurrent transactions. See here and the call site of that func for more detail. We should check if repeatable read causes transactions to grab stronger row level locks that may cause other queries to park behind an update factset query. We also need to check if there is anyway two concurrent updates for factsets could deadlock with each other in a way the Postgres deadlock detector wouldn't be able to see.

See this thread in the Postgres slack channel for more related information. It looks like the FOSS person was able to find out that the Postgres pid for the stuck update factset query was doing a bunch of I/O in these calls: ginEntryInsert > ginInsertItemPointers > ginFindLeafPage. It may be possible that when a table is super bloated updating the gin index for the jsonb factsets might be slowing things up some. Overall I don't think anything other than a deadlock of some type can account for an update query being stuck open for two days though, but that still wouldn't explain the constant I/O that the query was doing. It would still be beneficial to look into the autovacuum settings on the factsets table and how gin indexes on jsonb columns get updated when working on this ticket. Check GIN fast update technique to determine if setting fastupdate would be useful in our case.

Add Comment

This message was sent by Atlassian Jira (v8.5.2#805002-sha1:a66f935)

Zachary Kent (Jira)

unread,

Feb 23, 2021, 7:28:01 PM2/23/21

to puppe...@googlegroups.com

Zachary Kent commented on

PDB-5027

Re: Investigate factset update getting stuck in DataFileRead

Spoke with the FOSS person who reported the issue an got the results of the following query on their system where the update factset query was still stuck after ~9 days at that point.

select a.pid,

  pg_blocking_pids(a.pid) as blocked_by,

  a.query as blocked_query,

  array_agg(l.mode) as mode,

  array_agg(l.relation::regclass) as regclass,

  array_agg(l.granted) as granted

from pg_stat_activity a join pg_locks l on a.pid = l.pid

where cardinality(pg_blocking_pids(a.pid)) > 0

group by a.pid, a.query;

Result gist

In the results you can see that an update factset (pid:10624) and autovacuum.analyze (pid:13060) query are blocked waiting on locks from pid:4971 (the stuck query).

You can see the start times of these and the two other oldest blocked update factset queries below (queries have been truncated):

pid                   | 9471

query_start      | 2021-02-09 17:08:11.511969-05

query               | UPDATE factsets SET timestamp = $1,environment_id = $2,produc

pid                   | 13060

query_start      | 2021-02-09 17:13:09.901283-05

query               | autovacuum: ANALYZE public.factsets

pid                   | 10624

query_start      | 2021-02-09 17:54:03.121967-05

query               | UPDATE factsets SET timestamp = $1,environment_id = $2,produc

pid                   | 19375

query_start      | 2021-02-09 18:46:18.568418-05

query               | UPDATE factsets SET timestamp = $1,environment_id = $2,produc

I'm not really sure what's going on here but I've noticed a few things I'll describe below:

In the results in the gist above the pid:10624 query appears to be blocked by pid:4971 waiting on a Sharelock. After a bit of searching around on the web it looks like the Sharelock is probably against the transaction for pid:4971. It's possible for updates like this to deadlock with one another but the examples I've found seem to be the kind that the postgres deadlock detector can find and interrupt the offending queries. Because the stuck query (pid:4971) isn't waiting on any locks I don't think there is a AB/BA type deadlock that postgres can detect.

The autovacuum analyze query(pid:13060) is blocked on pid:4971 waiting for an Exclusivelock on the idx_factsets_jsonb_merged index. This is the (stable||volatile) GIN index that pid:4971 was seen doing all the I/O against.

Every other update factset query is queued behind pid:10624 and the updates that came before in the queue waiting for an exclusive lock on the factsets table.

It's possible that a query which hit the gin_pending_list_limt on an update would have to do the work of updating the GIN index during the update and the table bloat would make this process even slower than normal. I don't think any additional slowness there due to bloat would explain the query being stuck for days though.