Prometheus memory increases after retarting k8s target

Vu Nguyen

unread,

Oct 31, 2023, 11:07:24 AM10/31/23

to Prometheus Users

We have Prometheus v2.47.1 deployed on k8s; scraping 500k time series from a single target (*)

When restart the target, the number of time series in HEAD block jump to 1M (1), and Prometheus memory increases from the average of 2.5Gi to 3Gi. Leave Prometheus running for few WAL truncation cycles, the memory still not go back to the point before restarting the target even the number of time series in HEAD block back to 500K.

If I trigger another target restarts, that memory keeps going up. Here is the graph:

Could you please help us understand why the memory does not fallback to the initial point (*) before we restart/upgrade target?

[1] k8s pod restart will come up with a new IP - new instance label value; therefore, a new set of of 500K time series is generated.

ask_prometheus_user_memory_increase_after_target_restart_upgrade_v1.jpg

Vu Nguyen

unread,

Nov 1, 2023, 12:24:23 AM11/1/23

to Prometheus Users

Leaving the deployment running for a while after the 3rd restart of the target - 6 rounds of WAL truncation, the memory goes up to 3.7Gi comparing to 2.5Gi before doing the restart. There must be something that Prometheus holds back for this upgrade/restart scenario I guess.

ask_prometheus_user_memory_increase_after_target_restart_upgrade_v2.jpg

Vu Nguyen

unread,

Nov 3, 2023, 7:22:57 AM11/3/23

to Prometheus Users

If we go clean all data under /wal then restart Prometheus, then the memory comes back to the low point as it was before triggering the restart. But we don't want to apply that trick as we could lose 3h time span of data.

Bryan Boreham

unread,

Nov 6, 2023, 10:45:32 AM11/6/23

to Prometheus Users

I think this issue is relevant: https://github.com/prometheus/prometheus/issues/12286

I didn't follow your description of the symptoms;

> the memory goes up to 3.7Gi comparing to 2.5Gi

In your picture I see spikes at over 5Gi. The spikes are every 2 hours which would tie in to head compactions.

If you state what timezone your charts are in, or better show them in UTC, we could be more sure.

Note that working set and RSS is Linux' estimation of what the process is using; it is not concrete enough to reason from.

Suggest you add go_memstats_next_gc_bytes to your chart; this is tied to what the program is actually referencing.

A Go heap profile is even more concrete and detailed. See here.

Bryan

Vu Nguyen

unread,

Nov 23, 2023, 10:40:52 PM11/23/23

to Prometheus Users

Hi Bryan,

I managed to reproduce the problem and captured the data as you suggested.

First, here are the graphs in UTC timezone:

and for heap profiles, please have a look at the attachments.

Thank you for your supports.

heap_after.pprof.gz

heap_before.pprof.gz

Bryan Boreham

unread,

Nov 29, 2023, 1:13:13 PM11/29/23

to Prometheus Users

Thanks sending more details and profiles.

'heap_before.pprof' shows 1264MB in use and 'heap_after.pprof' shows 1273MB.

There are no material differences; the 'after' one has more memory used to track series removed after head compaction.

There are about 500,000 series objects in both profiles.

I am confused why nothing shows up as allocated during WAL reading - are you still deleting the WAL?

The memory visible in heap profiles is after garbage-collection, while go_memstats_next_gc_bytes is after heap growth, defaulting to 100% growth i.e. that metric should be twice the amount in the profile.

So either you picked very unlucky times to grab the profiles, or something else is inconsistent.

So, sorry but I cannot tie back what these profiles say to the symptom you described.

Regards,

Bryan

Vu Nguyen

unread,

Nov 30, 2023, 2:20:20 AM11/30/23

to Prometheus Users

Thank you very much for your supports.

> are you still deleting the WAL?

No, I did not delete WAL at all. What I did was restarting a pod that has 500K time series exposed.

> So either you picked very unlucky times to grab the profiles, or something else is inconsistent.

Do you have any suggestion when heap profiles should be captured? e.g. one at the point right before restarting the target, and the second is at 6h later after the target restarts?

Do you need any logs, metrics or anything else that could help spotting the issue easier?

Regards, Vu

Bryan Boreham

unread,

Nov 30, 2023, 6:16:48 AM11/30/23

to Vu Nguyen, Prometheus Users

> On 30 Nov 2023, at 07:20, Vu Nguyen <win...@gmail.com> wrote:
>
>

> Thank you very much for your supports.
>
> > are you still deleting the WAL?
>
> No, I did not delete WAL at all. What I did was restarting a pod that has 500K time series exposed.
>

Ah, if there is any label different (eg the pod name) then that creates new series.
So this would be consistent with the number of series bouncing up to 1,000,000.
Prometheus only clears stale series out of the head after compaction, so you see series and memory go up for a couple of hours then come down again.

In other words this is standard behaviour.

Bryan

Vu Nguyen

unread,

Nov 30, 2023, 8:30:44 AM11/30/23

to Bryan Boreham, Prometheus Users

I could see the number of time series are truncated to a half, from 1M to 500K but the memory does not go back to the point before target got restarted (1)

If I keep restarting the target the second time, the number of time series will jump up to 1M and will eventually drop to 500M after few HEAD truncation cycles but the memory now could be even higher than the case 1 above.

What confusing me here is that Prometheus memory does not return to the point before restarting the target even the time series in HEAD goes back to 500K.

Ben Kochie

unread,

Nov 30, 2023, 8:50:04 AM11/30/23

to Vu Nguyen, Bryan Boreham, Prometheus Users

It is. You can clearly see in your graphs that the Go memstats goes back down to the prior level.

Go is a garbage collected language, memory use is going to fluctuate over time as Prometheus operates and GC happens. It's not an exact value and never will be.

Memory use also depends on queries that are run, since each query requires some memory allocation. If your users have dashboards open with refresh, this will increase the amount of memory use.

There are GC options like `GOGC` that you can tune, but of course this will impact how often GC is run, which could negatively impact performance.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAF7KRaxGCRsfoW7Z7hTKwwx_vZzx9UddS9cKm-OztZG%3Duz90vw%40mail.gmail.com.

Vu Nguyen

unread,

Dec 4, 2023, 5:14:27 AM12/4/23

to Prometheus Users

Hi Ben, Bryan

> Go is a garbage collected language, memory use is going to fluctuate over time as Prometheus operates and GC happens. It's not an exact value and never will be

We don't expect Prometheus memory is *exactly* the same with value before the restart but are confusing why the memory keeps going up e.g. from 1.8G in avg to 2.3G even having the same time series in HEAD block.

With the graphs I shared above, the time series in the HEAD block has been truncated to the original value which is 500K around 09:00; the memory did came down from 3G to 2G at that time, but shortly it went up to ~2.8G even no changes in the number of time series scraping.

I did not setup dashboards with automatic refresh; and I think this issue is always reproducible.

Regards, Vu

Bryan Boreham

unread,

Dec 4, 2023, 5:51:45 AM12/4/23

to Prometheus Users

> why the memory keeps going up

In your pictures from 24th November, I see mostly flat lines across the chart.

"keeps going up" would be more like a slope.

> e.g. from 1.8G in avg to 2.3G

I think you're talking about RSS here? It helps to be very specific - name the metric, give the time.

RSS is Linux's guess of how much your process is using. It's not something you can make great deductions from, without other information.