Hello,
We saw lots of these errors in our Prometheus servers this morning after a compaction had already finished around 8am, and then a TSDB reload finished around 08:11:00
====================
level=info ts=2019-10-25T07:00:24.094Z caller=compact.go:496 component=tsdb msg="write block" mint=1571976000000 maxt=1571983200000 ulid=01DR......... duration=24.051855065s
level=info ts=2019-10-25T07:00:30.328Z caller=head.go:596 component=tsdb msg="head GC completed" duration=2.1240968s
level=info ts=2019-10-25T07:00:58.919Z caller=head.go:666 component=tsdb msg="WAL checkpoint complete" first=4857 last=4871 duration=28.590403845s
====================
Around the same time, the CPU usage of the prometheus pods spiked really high as you can see in the graphs attached. We have 2 different replicas (on completely different K8s nodes) and they both spiked in CPU at the same time. Due to this, Grafana was unable to pull any metrics from Prometheus and we saw errors happening in Grafana.
There is a Thanos sidecar within the same pod doing a "remote read" from the prometheus instances, could this be related in any way?
I have found one or two other people reporting a similar error but not exactly the same thing so am wondering if anyone has seen this before?
Thanks.