High Memory Usage with Prometheus WAL Despite --enable-feature=memory-snapshot-on-shutdown

Bhanu Prakash

unread,

Nov 7, 2024, 3:53:51 AM11/7/24

to Prometheus Users

Hello Prometheus Community,

I’m encountering a memory usage issue with my Prometheus server, particularly during and after startup, and I’m hoping to get some insights on optimizing it.

Problem: Upon startup, my Prometheus instance consumes a large amount of memory, primarily due to the WAL (Write-Ahead Log) replay. To address this, I enabled --enable-feature=memory-snapshot-on-shutdown, expecting it to reduce the startup memory spike by eliminating the need for full WAL replay. However, I’m still seeing the memory usage spike to around 5GB on startup. Once started, Prometheus continues to hold this high memory, without releasing it back to the system.

Is there a recommended way to configure Prometheus to release memory post-startup?
Are there additional configurations or optimizations for large WAL files or memory management that could help?

Any guidance or suggestions would be greatly appreciated!

Thank you,
BhanuPrakash.

Ben Kochie

unread,

Nov 7, 2024, 3:57:48 AM11/7/24

to Bhanu Prakash, Prometheus Users

I would recommend getting a heap snapshot and posting it to https://pprof.me.

curl -o heap.pprof http://localhost:9090/debug/pprof/heap

Also including http://localhost:9090/tsdb-status would help.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/99e6725f-f757-45f4-80a4-98c60e2b5063n%40googlegroups.com.

Bhanu Prakash

unread,

Nov 7, 2024, 4:43:15 AM11/7/24

to Prometheus Users

Hi ben,

I have upload the heap file below is the link.
https://pprof.me/248ce153c305ede009283fef7d9fd15e/?profileType=profile%253Aalloc_objects%253Acount%253Aspace%253Abytes&color_by=filename

and tsdb-status

curl http://10.224.0.122:9090/api/v1/status/tsdb
{"status":"success","data":{"headStats":{"numSeries":162587,"numLabelPairs":8046,"chunkCount":199721,"minTime":1730966400000,"maxTime":1730971
996605},"seriesCountByMetricName":[{"name":"apiserver_request_duration_seconds_bucket","value":6624},{"name":"etcd_request_duration_seconds_bu
cket","value":5304},{"name":"apiserver_request_sli_duration_seconds_bucket","value":4400},{"name":"container_tasks_state","value":3280},{"name
":"kube_pod_status_reason","value":2930},{"name":"kube_pod_status_phase","value":2930},{"name":"container_memory_failures_total","value":2624}
,{"name":"kubelet_runtime_operations_duration_seconds_bucket","value":2235},{"name":"apiserver_request_body_size_bytes_bucket","value":2208},{
"name":"kube_replicaset_status_observed_generation","value":2060}],"labelValueCountByLabelName":[{"name":"__name__","value":1250},{"name":"nam
e","value":692},{"name":"id","value":608},{"name":"replicaset","value":517},{"name":"mountpoint","value":450},{"name":"lease","value":310},{"n
ame":"lease_holder","value":309},{"name":"uid","value":284},{"name":"container_id","value":272},{"name":"le","value":252}],"memoryInBytesByLab
elName":[{"name":"id","value":100200},{"name":"mountpoint","value":52087},{"name":"__name__","value":44903},{"name":"name","value":33040},{"na
me":"container_id","value":20944},{"name":"replicaset","value":15060},{"name":"lease_holder","value":10996},{"name":"lease","value":10506},{"n
ame":"uid","value":10224},{"name":"image_id","value":9545}],"seriesCountByLabelValuePair":[{"name":"app_kubernetes_io_managed_by=Helm","value"
:78324},{"name":"app_kubernetes_io_instance=loki-stack","value":78324},{"name":"job=kubernetes-service-endpoints","value":78324},{"name":"app_
kubernetes_io_component=metrics","value":78324},{"name":"app_kubernetes_io_part_of=kube-state-metrics","value":68200},{"name":"app_kubernetes_
io_version=2.8.0","value":68200},{"name":"helm_sh_chart=kube-state-metrics-4.30.0","value":68200},{"name":"app_kubernetes_io_name=kube-state-m
etrics","value":68200},{"name":"service=loki-stack-kube-state-metrics","value":66492},{"name":"node=aks-grafana-14132910-vmss000025","value":6
6269}]}}

Thanks,

Bhanu

Ben Kochie

unread,

Nov 7, 2024, 5:43:49 AM11/7/24

to Bhanu Prakash, Prometheus Users

That only shows an in-use memory of 573MB.

https://pprof.me/248ce153c305ede009283fef7d9fd15e/?profileType=profile%3Ainuse_space%3Abytes%3Aspace%3Abytes

Can you post a graph of process_resident_memory_bytes{job="prometheus"}?

What metric are you using to report?

To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/1dbeaf12-c074-45da-8679-c2370e150861n%40googlegroups.com.

Bhanu Prakash

unread,

Nov 7, 2024, 6:48:11 AM11/7/24

to Prometheus Users

Hi Ben,

I've attached a screenshot of the graph for process_resident_memory_bytes{job="prometheus"}.

Using the top command, I'm seeing 1000 MB memory usage, as shown in the screenshot below.

The graph indicates that Prometheus is using 573 MB, but the pod is showing 1009 MB of memory usage. Typically, after utilization, memory should be released, but it's been holding steady at around 1000 MB for the past 4 hours without decreasing.

We're using this pod in AKS specifically to store node and pod metrics.

I'm not exactly sure what's causing this behavior inside the pod.

Thanks,

Bhanu.

Screenshot (113).png

Screenshot (112).png

Reply all

Reply to author

Forward