Long WAL checkpoint durations

12 views
Skip to first unread message

Calle Pettersson

unread,
Jun 25, 2020, 1:42:29 AM6/25/20
to Prometheus Users
Hi,
Hoping for some insights on whether we're seeing issues due to scaling, or if something is odd.
We've started seeing very long WAL checkpoint durations on our largest Prometheus instance. A few days ago it was 20+ minutes, but tonight I got paged due to runaway disk usage, and when I checked the logs the last checkpoints took well over an hour. The wal/ directory is 70% of total disk usage (154G/215G).

Some instance configuration information:
It is running inside a GKE cluster on a dedicated node, with 8 cores/48G memory. We're setting tsdb retention limits to 7d and 200G. We're running 2.19.1 since a few days, before that 2.18.0.

As mentioned, this is a large instance, scraping ~18k targets and prometheus_tsdb_head_series is ~11.6M. Still, this has been working relatively stable for a few months, until something caused it to destabilize. Maybe we just reached some tipping point?

I've included logs below showing the increased duration over time since the last restart ~24h ago.

Let me know if there's any other info I can supply.

Best,
Calle

Logs:
{"caller":"head.go:645","component":"tsdb","level":"info","msg":"Replaying WAL and on-disk memory mappable chunks if any, this may take a while","ts":"2020-06-24T04:57:27.204Z"} {"caller":"head.go:682","component":"tsdb","level":"info","msg":"WAL checkpoint loaded","ts":"2020-06-24T05:16:15.391Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93060,"ts":"2020-06-24T05:16:18.986Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93061,"ts":"2020-06-24T05:16:22.684Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93062,"ts":"2020-06-24T05:16:26.556Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93063,"ts":"2020-06-24T05:16:31.028Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93064,"ts":"2020-06-24T05:16:38.358Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93065,"ts":"2020-06-24T05:16:42.451Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93066,"ts":"2020-06-24T05:16:43.950Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93067,"ts":"2020-06-24T05:16:48.091Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93068,"ts":"2020-06-24T05:16:52.752Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93069,"ts":"2020-06-24T05:16:56.866Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93070,"ts":"2020-06-24T05:17:00.974Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93071,"ts":"2020-06-24T05:17:05.708Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93072,"ts":"2020-06-24T05:17:09.800Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93073,"ts":"2020-06-24T05:17:26.553Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93074,"ts":"2020-06-24T05:17:29.828Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93075,"ts":"2020-06-24T05:17:33.221Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93076,"ts":"2020-06-24T05:17:36.731Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93077,"ts":"2020-06-24T05:17:40.160Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93078,"ts":"2020-06-24T05:17:41.211Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93079,"ts":"2020-06-24T05:17:44.711Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93080,"ts":"2020-06-24T05:17:48.304Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93081,"ts":"2020-06-24T05:17:51.804Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93082,"ts":"2020-06-24T05:17:55.257Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93083,"ts":"2020-06-24T05:17:58.633Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93084,"ts":"2020-06-24T05:17:59.922Z"} {"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93085,"ts":"2020-06-24T05:17:59.922Z"} {"caller":"head.go:709","component":"tsdb","duration":"20m32.718268711s","level":"info","msg":"WAL replay completed","ts":"2020-06-24T05:17:59.922Z"} {"caller":"head.go:792","component":"tsdb","duration":"10.702313139s","level":"info","msg":"Head GC completed","ts":"2020-06-24T05:20:02.911Z"} {"caller":"head.go:869","component":"tsdb","duration":"27m11.480969261s","first":93060,"last":93076,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T05:47:14.438Z"} {"caller":"head.go:792","component":"tsdb","duration":"10.491417535s","level":"info","msg":"Head GC completed","ts":"2020-06-24T05:49:03.732Z"} {"caller":"head.go:869","component":"tsdb","duration":"28m5.634599251s","first":93077,"last":93089,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T06:17:09.386Z"} {"caller":"head.go:792","component":"tsdb","duration":"10.926510097s","level":"info","msg":"Head GC completed","ts":"2020-06-24T06:19:18.329Z"} {"caller":"head.go:869","component":"tsdb","duration":"28m55.000491785s","first":93090,"last":93102,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T06:48:13.356Z"} {"caller":"head.go:792","component":"tsdb","duration":"10.849142474s","level":"info","msg":"Head GC completed","ts":"2020-06-24T06:50:06.635Z"} {"caller":"head.go:869","component":"tsdb","duration":"29m55.379508706s","first":93103,"last":93115,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T07:20:02.033Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.353482447s","level":"info","msg":"Head GC completed","ts":"2020-06-24T07:22:01.081Z"} {"caller":"head.go:869","component":"tsdb","duration":"31m47.883179757s","first":93116,"last":93128,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T07:53:48.997Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.379979457s","level":"info","msg":"Head GC completed","ts":"2020-06-24T07:55:45.663Z"} {"caller":"head.go:869","component":"tsdb","duration":"33m43.864027371s","first":93129,"last":93141,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T08:29:29.573Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.842346357s","level":"info","msg":"Head GC completed","ts":"2020-06-24T08:31:55.605Z"} {"caller":"head.go:869","component":"tsdb","duration":"34m44.030004224s","first":93142,"last":93156,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T09:06:39.650Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.816391001s","level":"info","msg":"Head GC completed","ts":"2020-06-24T09:08:43.310Z"} {"caller":"head.go:869","component":"tsdb","duration":"36m18.979807122s","first":93157,"last":93171,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T09:45:02.312Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.679089866s","level":"info","msg":"Head GC completed","ts":"2020-06-24T09:47:30.274Z"} {"caller":"head.go:869","component":"tsdb","duration":"37m28.519721249s","first":93172,"last":93188,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T10:24:58.848Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.642761491s","level":"info","msg":"Head GC completed","ts":"2020-06-24T10:27:02.968Z"} {"caller":"head.go:869","component":"tsdb","duration":"38m30.98126742s","first":93189,"last":93205,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T11:05:33.989Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.953396425s","level":"info","msg":"Head GC completed","ts":"2020-06-24T11:07:42.690Z"} {"caller":"head.go:869","component":"tsdb","duration":"40m41.750837742s","first":93206,"last":93222,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T11:48:24.524Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.924025363s","level":"info","msg":"Head GC completed","ts":"2020-06-24T11:50:34.565Z"} {"caller":"head.go:869","component":"tsdb","duration":"42m36.719910209s","first":93223,"last":93239,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T12:33:11.342Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.844043457s","level":"info","msg":"Head GC completed","ts":"2020-06-24T12:35:28.797Z"} {"caller":"head.go:869","component":"tsdb","duration":"43m41.251834633s","first":93240,"last":93258,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T13:19:10.143Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.81029093s","level":"info","msg":"Head GC completed","ts":"2020-06-24T13:21:22.759Z"} {"caller":"head.go:869","component":"tsdb","duration":"45m55.549433015s","first":93259,"last":93277,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T14:07:18.370Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.705922433s","level":"info","msg":"Head GC completed","ts":"2020-06-24T14:09:29.681Z"} {"caller":"head.go:869","component":"tsdb","duration":"46m36.224783951s","first":93278,"last":93298,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T14:56:05.977Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.997460717s","level":"info","msg":"Head GC completed","ts":"2020-06-24T14:58:21.086Z"} {"caller":"head.go:869","component":"tsdb","duration":"48m39.050349407s","first":93299,"last":93319,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T15:47:00.187Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.678145052s","level":"info","msg":"Head GC completed","ts":"2020-06-24T15:49:13.480Z"} {"caller":"head.go:869","component":"tsdb","duration":"50m54.193535944s","first":93320,"last":93342,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T16:40:07.722Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.087928789s","level":"info","msg":"Head GC completed","ts":"2020-06-24T16:42:26.278Z"} {"caller":"head.go:869","component":"tsdb","duration":"52m35.511813543s","first":93343,"last":93365,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T17:35:01.922Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.320304825s","level":"info","msg":"Head GC completed","ts":"2020-06-24T17:37:26.850Z"} {"caller":"head.go:869","component":"tsdb","duration":"55m13.524592427s","first":93366,"last":93389,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T18:32:40.419Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.558102685s","level":"info","msg":"Head GC completed","ts":"2020-06-24T18:35:08.768Z"} {"caller":"head.go:869","component":"tsdb","duration":"57m18.998055501s","first":93390,"last":93414,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T19:32:27.826Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.331336051s","level":"info","msg":"Head GC completed","ts":"2020-06-24T19:34:56.045Z"} {"caller":"head.go:869","component":"tsdb","duration":"59m26.168988542s","first":93415,"last":93439,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T20:34:22.252Z"} {"caller":"head.go:792","component":"tsdb","duration":"11.957175121s","level":"info","msg":"Head GC completed","ts":"2020-06-24T20:36:47.776Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h2m20.433188445s","first":93440,"last":93465,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T21:39:08.314Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.899107698s","level":"info","msg":"Head GC completed","ts":"2020-06-24T21:41:36.530Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h5m3.288078386s","first":93466,"last":93492,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T22:46:39.904Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.25223606s","level":"info","msg":"Head GC completed","ts":"2020-06-24T22:49:04.150Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h7m12.771595983s","first":93493,"last":93519,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T23:56:16.970Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.094735046s","level":"info","msg":"Head GC completed","ts":"2020-06-24T23:59:05.108Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h9m42.283215414s","first":93520,"last":93548,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T01:08:47.404Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.847163161s","level":"info","msg":"Head GC completed","ts":"2020-06-25T01:11:21.337Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h13m25.658134849s","first":93549,"last":93577,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T02:24:47.072Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.438982163s","level":"info","msg":"Head GC completed","ts":"2020-06-25T02:27:37.593Z"} {"caller":"head.go:869","component":"tsdb","duration":"1h15m12.294907388s","first":93578,"last":93606,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T03:42:50.066Z"} {"caller":"head.go:792","component":"tsdb","duration":"12.408087258s","level":"info","msg":"Head GC completed","ts":"2020-06-25T03:45:25.863Z"}

Reply all
Reply to author
Forward
0 new messages