Prometheus wal folder and memory usage on startup

161 views
Skip to first unread message

Viktor Radnai

unread,
Jul 1, 2020, 11:06:15 AM7/1/20
to promethe...@googlegroups.com
Hi all,

We have a recurring problem with Prometheus repeatedly getting OOMKilled on startup while trying to process the write ahead log. I tried to look through Github issues but there was no solution or currently open issue as far as I could see.

We are running on Kubernetes in GKE using the prometheus-operator Helm chart, using Google Cloud's Preemptible VMs. These VMs get killed every 24 hours maximum, so our Prometheus pods also get killed and automatically migrated by Kubernetes (the data is on a persistent volume of course). To avoid loss of metrics, we run two identically configured replicas with their own storage, scraping all the same targets.

We monitor numerous GCE VMs that do batch processing, running anywhere between a few minutes to several hours. This workload is bursty, fluctuating between tens and hundreds of VMs active at any time, so sometimes the Prometheus wal folder grows to  between 10-15GB in size. Prometheus usually handles this workload with about half a CPU core and 8GB of RAM and if left to its own devices, the wal folder will shrink again when the load decreases.

The problem is that when there is a backlog and Prometheus is restarted (due to the preemptive VM going away), it will use several times more RAM to recover the wal folder. This often exhausts all the available memory on the Kubernetes worker, so Prometheus is killed by the OOM killed over and over again, until I log in and delete the wal folder, losing several hours of metrics. I have already doubled the size of the VMs just to accommodate Prometheus and I am reluctant to do this again. Running non-preemptive VMs would triple the cost of these instances and Prometheus might still get restarted when we roll out an update -- so this would probably not even solve the issue properly.

I don't know if there is something special in our use case, but I did come across a blog describing the same high memory usage behaviour on startup.

I feel that unless there is a fix I can do, this would warrant either a bug or feature request -- Prometheus should be able to recover without operator intervention or losing metrics. And for a process running on Kubernetes, we should be able to set memory "request" and "limit" values that are close to actual expected usage, rather than 3-4 times the steady state usage just to accommodate the memory requirements of the startup phase.

Please let me know what information I should provide, if any. I have some graph screenshots that would be relevant.

Many thanks,
Vik

Matthias Rampke

unread,
Jul 1, 2020, 11:27:17 AM7/1/20
to Viktor Radnai, Prometheus Users
I have been thinking about this problem as well, since we ran into a similar issue yesterday. In our case, Prometheus had already failed to write out a TSDB block for a few hours but kept on piling data into the head block.

Could TSDB write out blocks during WAL recovery? Say, for every two hours' worth of WAL or even more frequently, it could pause recovery, write a block, delete the WAL up to that point, continue recovery. This would put something of a bound on the memory usage during recovery, and alleviate the issue that recovery from out-of-memory takes even more memory.

Would this help in your case?

/MR


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CANx-tGgY3vJ-dzyOjYMAu1dRvdsfO83Ux_Y0g7XAeKzPTmGWLQ%40mail.gmail.com.

Ben Kochie

unread,
Jul 1, 2020, 11:32:43 AM7/1/20
to Viktor Radnai, Prometheus Users
What version of Prometheus do you have deployed? We've made several major improvements to WAL handling and startup in the last couple of releases.

I would recommend upgrading to 2.19.2 if you haven't.

--

Viktor Radnai

unread,
Jul 1, 2020, 11:36:21 AM7/1/20
to Matthias Rampke, Prometheus Users
Hi Matthias,

Thanks, I think this should definitely help but not sure if it will always solve the problem. If I understand it correctly, the WAL holds 6 hours of data and in our experience the high water mark for memory usage seems to be about 3-4 times the WAL size. So while processing 2 hours, you might go higher than normal, but not severalt times higher.

What would be very nice if Prometheus would try to observe the rlimit set for maximum virtual memory size and flush the WAL when it gets close to that. When Prometheus starts up, it already prints the values (if set):
level=info ts=2020-07-01T15:31:50.711Z caller=main.go:341 vm_limits="(soft=unlimited, hard=unlimited)"

I tried setting these with a small Bash script wrapper and ulimit, but this resulted in a Golang OOM error and termination instead of the Linux OOM killer and termination :)

Many thanks,
Vik
--
My other sig is hilarious

Viktor Radnai

unread,
Jul 1, 2020, 11:39:40 AM7/1/20
to Ben Kochie, Prometheus Users
Hi Ben,

We are running 2.18.1 -- I will upgrade to 2.19.2 and see if this solves the problem. I currently have one of the two replicas in production crashlooping so I'll try to roll this out in the next few hours and report back.

Thanks,
Vik

Viktor Radnai

unread,
Jul 1, 2020, 1:55:48 PM7/1/20
to Ben Kochie, Prometheus Users
Hi again Ben,

Unfortunately upgrading to 2.19.2 does not solve the startup issue. Prometheus gets OOMKilled before even starting to parse the last 25 segments which represent the last 50 minutes worth of data. Based on this the estimated memory requirement should be somewhere between 60-70GB but the iworker node only has 52GB. The other Prometheus pod currently consumes 7.7GB.

The left of the graph is 2.18.1, the right is 2.19.2. I inadvertently reinstated a previously set 40GB memory limit and updated the replicaset to increase it back to 50GB -- this is the reason for the second Prometheus restart and the slightly higher plateau for the last two OOMs.

Unless there is a way to move some WAL segments out and the restore them later, I'll try to delete the last 50 minutes worth of segments to get the pod to come up.

Thanks,
Vik
Screenshot from 2020-07-01 18-42-54.png

Julien Pivotto

unread,
Jul 1, 2020, 2:08:09 PM7/1/20
to Viktor Radnai, Ben Kochie, Prometheus Users
When 2.19 will run then it will create mmaped head which will improve that.

I agree that starting 2.19 with a 2.18 wal won't make a change.

Viktor Radnai

unread,
Jul 1, 2020, 2:11:02 PM7/1/20
to Julien Pivotto, Ben Kochie, Prometheus Users
Hi Julien,

Thanks for clarifying that. In that case I'll see if the issue will recur with 2.19.2 in the next few weeks.

Vik

Viktor Radnai

unread,
Jul 8, 2020, 2:21:37 PM7/8/20
to Julien Pivotto, Ben Kochie, Prometheus Users
Hi Ben, Julien and all,

To follow up on my issue from last week, the OOM loop does occur even with Prometheus 2.19.2.

This time around the instance has just enough memory to complete WAL replay but it OOMs immediately after that, this could be an improvement or just a coincidence. The WAL folder is about 16GB and the OOM occurs at around 43GB (due to the Kubernetes worker running out of memory). Anything else I could try?

Thanks,
Vik

Ben Kochie

unread,
Jul 8, 2020, 3:06:47 PM7/8/20
to Viktor Radnai, Julien Pivotto, Prometheus Users
* Get a bigger server
* Reduce the number of metrics you collect
* Shard your server
Probably some combination of all of these.

Viktor Radnai

unread,
Jul 8, 2020, 6:14:49 PM7/8/20
to Ben Kochie, Prometheus Users
Thanks Ben, I think sharding may be the way to go, most likely using Thanos.

I would like to continue offering a single interface for all of production and on the few hundred VMs we scrape (which cause our problem), we only run the node and process exporters. I may stop collecting some metrics but probably not enough to make a difference. Perhaps there are too many labels on the data (we collect most GCE labels and metadata and the VMs are relatively short-lived).

I have already doubled the kubernetes node size and it would be wasteful to do so again in order to satisfy a memory spike that only occurs once or twice a week for up to an hour.

Anyway, if there isn't a feature request open already to limit Promeheus's memory usage on startup, then I would like to open one. What I have in mind is being able to set a limit (similar to Java's -Xmx flag) which closely matches the memory limit set on the Prometheus pod. If Prometheus could stay below this limit during startup then all would be well, it could be scheduled efficiently by Kubernetes, or the user could allocate the appropriate size VM. If Prometheus exceeds this limit and OOMKilled or exits with an error after startup while collecting metrics, then the limit is too small. This would be the ideal solution, but Matthias's proposal would probably help as well.

I understand that adding more memory would solve the problem but getting an infinite OOM loop during startup because recovering the WAL uses approximately 3-5 times the steady state memory usage on startup is difficult to accommodate. It really leaves two choices: 1 allocating much more RAM than needed, or 2. accepting the loss of the WAL and the last 6 hours of metrics every time this issue occurs. And the same problem would also occur if I ran Prometheus inside VMs, I would need to run a VM that costs 4 times as much as necessary for normal operation. At least on Kubernetes the excess capacity may be used up by the cluster, although currently we already have a smaller number of larger nodes than ideal for our workload just because of Prometheus.

Would this feature request make sense? Is it even remotely feasible?

Many thanks,
Vik

Ben Kochie

unread,
Jul 9, 2020, 2:25:45 AM7/9/20
to Viktor Radnai, Prometheus Users
There are no plans to add any memory limit flags like this, because it makes no sense. Prometheus memory allocation is already minimized as much as possible. Adding a memory limit would make Prometheus not function.

What we do plan is ways to minimize the necessary memory to do any work. For example, in 2.19 Prometheus added mmap for completed head chunks. For my production setup, it reduced memory needs by quite a lot.

But if you are scraping a huge amount of data, there is nothing we can do about it. For example, Prometheus needs to maintain an in-memory index of all target metrics and labels. You haven't mentioned what your prometheus_tsdb_head_series values are, or any details about the scale of metrics you're working with.
Reply all
Reply to author
Forward
0 new messages