We are using Prometheus operator
https://github.com/helm/charts/tree/master/stable/prometheus-operator 0.38.1 and Prometheus 2.18.2.
We are seeing a weird problem with this setup, Intermittently Prometheus's static endpoints like /-/ready, /-/healthy, /-/metrics become unresponsive and they never recover again.
Apparently kubelet checks readiness probes on these endpoints and after some failure, it will send a sigterm to the Prometheus. This is ok but due to these intermittent restarts, we lost metrics for the restarts period for like ~10 mins.
We have run 2 replicas of Prometheus pods and we have seen restarts or sigterm sent on the same time to both pods.
CPU and Memory requested is 8 Core, 24GB. And we are not having much load on the Prometheus in terms of time series(307K only).
Attaching the logs of a sigterm received pod in debug mode and configuration of Prometheus.
Its been a headache for us now, any help is highly appreciated.
Thanks
Ravindra Singh