Hi,
Hoping for some insights on whether we're seeing issues due to scaling, or if something is odd.
We've started seeing very long WAL checkpoint durations on our largest Prometheus instance. A few days ago it was 20+ minutes, but tonight I got paged due to runaway disk usage, and when I checked the logs the last checkpoints took well over an hour. The wal/ directory is 70% of total disk usage (154G/215G).
Some instance configuration information:
It is running inside a GKE cluster on a dedicated node, with 8 cores/48G memory. We're setting tsdb retention limits to 7d and 200G. We're running 2.19.1 since a few days, before that 2.18.0.
As mentioned, this is a large instance, scraping ~18k targets and prometheus_tsdb_head_series is ~11.6M. Still, this has been working relatively stable for a few months, until something caused it to destabilize. Maybe we just reached some tipping point?
I've included logs below showing the increased duration over time since the last restart ~24h ago.
Let me know if there's any other info I can supply.
Best,
Calle
Logs:
{"caller":"head.go:645","component":"tsdb","level":"info","msg":"Replaying WAL and on-disk memory mappable chunks if any, this may take a while","ts":"2020-06-24T04:57:27.204Z"}
{"caller":"head.go:682","component":"tsdb","level":"info","msg":"WAL checkpoint loaded","ts":"2020-06-24T05:16:15.391Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93060,"ts":"2020-06-24T05:16:18.986Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93061,"ts":"2020-06-24T05:16:22.684Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93062,"ts":"2020-06-24T05:16:26.556Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93063,"ts":"2020-06-24T05:16:31.028Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93064,"ts":"2020-06-24T05:16:38.358Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93065,"ts":"2020-06-24T05:16:42.451Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93066,"ts":"2020-06-24T05:16:43.950Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93067,"ts":"2020-06-24T05:16:48.091Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93068,"ts":"2020-06-24T05:16:52.752Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93069,"ts":"2020-06-24T05:16:56.866Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93070,"ts":"2020-06-24T05:17:00.974Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93071,"ts":"2020-06-24T05:17:05.708Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93072,"ts":"2020-06-24T05:17:09.800Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93073,"ts":"2020-06-24T05:17:26.553Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93074,"ts":"2020-06-24T05:17:29.828Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93075,"ts":"2020-06-24T05:17:33.221Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93076,"ts":"2020-06-24T05:17:36.731Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93077,"ts":"2020-06-24T05:17:40.160Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93078,"ts":"2020-06-24T05:17:41.211Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93079,"ts":"2020-06-24T05:17:44.711Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93080,"ts":"2020-06-24T05:17:48.304Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93081,"ts":"2020-06-24T05:17:51.804Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93082,"ts":"2020-06-24T05:17:55.257Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93083,"ts":"2020-06-24T05:17:58.633Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93084,"ts":"2020-06-24T05:17:59.922Z"}
{"caller":"head.go:706","component":"tsdb","level":"info","maxSegment":93085,"msg":"WAL segment loaded","segment":93085,"ts":"2020-06-24T05:17:59.922Z"}
{"caller":"head.go:709","component":"tsdb","duration":"20m32.718268711s","level":"info","msg":"WAL replay completed","ts":"2020-06-24T05:17:59.922Z"}
{"caller":"head.go:792","component":"tsdb","duration":"10.702313139s","level":"info","msg":"Head GC completed","ts":"2020-06-24T05:20:02.911Z"}
{"caller":"head.go:869","component":"tsdb","duration":"27m11.480969261s","first":93060,"last":93076,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T05:47:14.438Z"}
{"caller":"head.go:792","component":"tsdb","duration":"10.491417535s","level":"info","msg":"Head GC completed","ts":"2020-06-24T05:49:03.732Z"}
{"caller":"head.go:869","component":"tsdb","duration":"28m5.634599251s","first":93077,"last":93089,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T06:17:09.386Z"}
{"caller":"head.go:792","component":"tsdb","duration":"10.926510097s","level":"info","msg":"Head GC completed","ts":"2020-06-24T06:19:18.329Z"}
{"caller":"head.go:869","component":"tsdb","duration":"28m55.000491785s","first":93090,"last":93102,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T06:48:13.356Z"}
{"caller":"head.go:792","component":"tsdb","duration":"10.849142474s","level":"info","msg":"Head GC completed","ts":"2020-06-24T06:50:06.635Z"}
{"caller":"head.go:869","component":"tsdb","duration":"29m55.379508706s","first":93103,"last":93115,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T07:20:02.033Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.353482447s","level":"info","msg":"Head GC completed","ts":"2020-06-24T07:22:01.081Z"}
{"caller":"head.go:869","component":"tsdb","duration":"31m47.883179757s","first":93116,"last":93128,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T07:53:48.997Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.379979457s","level":"info","msg":"Head GC completed","ts":"2020-06-24T07:55:45.663Z"}
{"caller":"head.go:869","component":"tsdb","duration":"33m43.864027371s","first":93129,"last":93141,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T08:29:29.573Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.842346357s","level":"info","msg":"Head GC completed","ts":"2020-06-24T08:31:55.605Z"}
{"caller":"head.go:869","component":"tsdb","duration":"34m44.030004224s","first":93142,"last":93156,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T09:06:39.650Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.816391001s","level":"info","msg":"Head GC completed","ts":"2020-06-24T09:08:43.310Z"}
{"caller":"head.go:869","component":"tsdb","duration":"36m18.979807122s","first":93157,"last":93171,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T09:45:02.312Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.679089866s","level":"info","msg":"Head GC completed","ts":"2020-06-24T09:47:30.274Z"}
{"caller":"head.go:869","component":"tsdb","duration":"37m28.519721249s","first":93172,"last":93188,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T10:24:58.848Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.642761491s","level":"info","msg":"Head GC completed","ts":"2020-06-24T10:27:02.968Z"}
{"caller":"head.go:869","component":"tsdb","duration":"38m30.98126742s","first":93189,"last":93205,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T11:05:33.989Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.953396425s","level":"info","msg":"Head GC completed","ts":"2020-06-24T11:07:42.690Z"}
{"caller":"head.go:869","component":"tsdb","duration":"40m41.750837742s","first":93206,"last":93222,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T11:48:24.524Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.924025363s","level":"info","msg":"Head GC completed","ts":"2020-06-24T11:50:34.565Z"}
{"caller":"head.go:869","component":"tsdb","duration":"42m36.719910209s","first":93223,"last":93239,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T12:33:11.342Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.844043457s","level":"info","msg":"Head GC completed","ts":"2020-06-24T12:35:28.797Z"}
{"caller":"head.go:869","component":"tsdb","duration":"43m41.251834633s","first":93240,"last":93258,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T13:19:10.143Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.81029093s","level":"info","msg":"Head GC completed","ts":"2020-06-24T13:21:22.759Z"}
{"caller":"head.go:869","component":"tsdb","duration":"45m55.549433015s","first":93259,"last":93277,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T14:07:18.370Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.705922433s","level":"info","msg":"Head GC completed","ts":"2020-06-24T14:09:29.681Z"}
{"caller":"head.go:869","component":"tsdb","duration":"46m36.224783951s","first":93278,"last":93298,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T14:56:05.977Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.997460717s","level":"info","msg":"Head GC completed","ts":"2020-06-24T14:58:21.086Z"}
{"caller":"head.go:869","component":"tsdb","duration":"48m39.050349407s","first":93299,"last":93319,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T15:47:00.187Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.678145052s","level":"info","msg":"Head GC completed","ts":"2020-06-24T15:49:13.480Z"}
{"caller":"head.go:869","component":"tsdb","duration":"50m54.193535944s","first":93320,"last":93342,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T16:40:07.722Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.087928789s","level":"info","msg":"Head GC completed","ts":"2020-06-24T16:42:26.278Z"}
{"caller":"head.go:869","component":"tsdb","duration":"52m35.511813543s","first":93343,"last":93365,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T17:35:01.922Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.320304825s","level":"info","msg":"Head GC completed","ts":"2020-06-24T17:37:26.850Z"}
{"caller":"head.go:869","component":"tsdb","duration":"55m13.524592427s","first":93366,"last":93389,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T18:32:40.419Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.558102685s","level":"info","msg":"Head GC completed","ts":"2020-06-24T18:35:08.768Z"}
{"caller":"head.go:869","component":"tsdb","duration":"57m18.998055501s","first":93390,"last":93414,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T19:32:27.826Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.331336051s","level":"info","msg":"Head GC completed","ts":"2020-06-24T19:34:56.045Z"}
{"caller":"head.go:869","component":"tsdb","duration":"59m26.168988542s","first":93415,"last":93439,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T20:34:22.252Z"}
{"caller":"head.go:792","component":"tsdb","duration":"11.957175121s","level":"info","msg":"Head GC completed","ts":"2020-06-24T20:36:47.776Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h2m20.433188445s","first":93440,"last":93465,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T21:39:08.314Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.899107698s","level":"info","msg":"Head GC completed","ts":"2020-06-24T21:41:36.530Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h5m3.288078386s","first":93466,"last":93492,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T22:46:39.904Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.25223606s","level":"info","msg":"Head GC completed","ts":"2020-06-24T22:49:04.150Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h7m12.771595983s","first":93493,"last":93519,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-24T23:56:16.970Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.094735046s","level":"info","msg":"Head GC completed","ts":"2020-06-24T23:59:05.108Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h9m42.283215414s","first":93520,"last":93548,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T01:08:47.404Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.847163161s","level":"info","msg":"Head GC completed","ts":"2020-06-25T01:11:21.337Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h13m25.658134849s","first":93549,"last":93577,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T02:24:47.072Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.438982163s","level":"info","msg":"Head GC completed","ts":"2020-06-25T02:27:37.593Z"}
{"caller":"head.go:869","component":"tsdb","duration":"1h15m12.294907388s","first":93578,"last":93606,"level":"info","msg":"WAL checkpoint complete","ts":"2020-06-25T03:42:50.066Z"}
{"caller":"head.go:792","component":"tsdb","duration":"12.408087258s","level":"info","msg":"Head GC completed","ts":"2020-06-25T03:45:25.863Z"}