Hi all,
We have a recurring problem with Prometheus repeatedly getting OOMKilled on startup while trying to process the write ahead log. I tried to look through Github issues but there was no solution or currently open issue as far as I could see.
We are running on Kubernetes in GKE using the prometheus-operator Helm chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24 hours maximum, so our Prometheus pods also get killed and automatically migrated by Kubernetes (the data is on a persistent volume of course). To avoid loss of metrics, we run two identically configured replicas with their own storage, scraping all the same targets.
We monitor numerous GCE VMs that do batch processing, running anywhere between a few minutes to several hours. This workload is bursty, fluctuating between tens and hundreds of VMs active at any time, so sometimes the Prometheus wal folder grows to between 10-15GB in size. Prometheus usually handles this workload with about half a CPU core and 8GB of RAM and if left to its own devices, the wal folder will shrink again when the load decreases.
The problem is that when there is a backlog and Prometheus is restarted (due to the preemptive VM going away), it will use several times more RAM to recover the wal folder. This often exhausts all the available memory on the Kubernetes worker, so Prometheus is killed by the OOM killed over and over again, until I log in and delete the wal folder, losing several hours of metrics. I have already doubled the size of the VMs just to accommodate Prometheus and I am reluctant to do this again. Running non-preemptive VMs would triple the cost of these instances and Prometheus might still get restarted when we roll out an update -- so this would probably not even solve the issue properly.
I don't know if there is something special in our use case, but I did come across a blog describing the same high memory usage behaviour on startup.
I feel that unless there is a fix I can do, this would warrant either a bug or feature request -- Prometheus should be able to recover without operator intervention or losing metrics. And for a process running on Kubernetes, we should be able to set memory "request" and "limit" values that are close to actual expected usage, rather than 3-4 times the steady state usage just to accommodate the memory requirements of the startup phase.
Please let me know what information I should provide, if any. I have some graph screenshots that would be relevant.
Many thanks,
Vik