High Memory Usage with Prometheus WAL Despite --enable-feature=memory-snapshot-on-shutdown

280 views
Skip to first unread message

Bhanu Prakash

unread,
Nov 7, 2024, 3:53:51 AM11/7/24
to Prometheus Users

Hello Prometheus Community,

I’m encountering a memory usage issue with my Prometheus server, particularly during and after startup, and I’m hoping to get some insights on optimizing it.

Problem: Upon startup, my Prometheus instance consumes a large amount of memory, primarily due to the WAL (Write-Ahead Log) replay. To address this, I enabled --enable-feature=memory-snapshot-on-shutdown, expecting it to reduce the startup memory spike by eliminating the need for full WAL replay. However, I’m still seeing the memory usage spike to around 5GB on startup. Once started, Prometheus continues to hold this high memory, without releasing it back to the system.

 Is there a recommended way to configure Prometheus to release memory post-startup?
Are there additional configurations or optimizations for large WAL files or memory management that could help?

Any guidance or suggestions would be greatly appreciated!

Thank you,
BhanuPrakash.

   

Ben Kochie

unread,
Nov 7, 2024, 3:57:48 AM11/7/24
to Bhanu Prakash, Prometheus Users
I would recommend getting a heap snapshot and posting it to https://pprof.me.


Also including http://localhost:9090/tsdb-status would help.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/prometheus-users/99e6725f-f757-45f4-80a4-98c60e2b5063n%40googlegroups.com.

Bhanu Prakash

unread,
Nov 7, 2024, 4:43:15 AM11/7/24
to Prometheus Users
Hi ben,

curl http://10.224.0.122:9090/api/v1/status/tsdb
{"status":"success","data":{"headStats":{"numSeries":162587,"numLabelPairs":8046,"chunkCount":199721,"minTime":1730966400000,"maxTime":1730971
996605},"seriesCountByMetricName":[{"name":"apiserver_request_duration_seconds_bucket","value":6624},{"name":"etcd_request_duration_seconds_bu
cket","value":5304},{"name":"apiserver_request_sli_duration_seconds_bucket","value":4400},{"name":"container_tasks_state","value":3280},{"name
":"kube_pod_status_reason","value":2930},{"name":"kube_pod_status_phase","value":2930},{"name":"container_memory_failures_total","value":2624}
,{"name":"kubelet_runtime_operations_duration_seconds_bucket","value":2235},{"name":"apiserver_request_body_size_bytes_bucket","value":2208},{
"name":"kube_replicaset_status_observed_generation","value":2060}],"labelValueCountByLabelName":[{"name":"__name__","value":1250},{"name":"nam
e","value":692},{"name":"id","value":608},{"name":"replicaset","value":517},{"name":"mountpoint","value":450},{"name":"lease","value":310},{"n
ame":"lease_holder","value":309},{"name":"uid","value":284},{"name":"container_id","value":272},{"name":"le","value":252}],"memoryInBytesByLab
elName":[{"name":"id","value":100200},{"name":"mountpoint","value":52087},{"name":"__name__","value":44903},{"name":"name","value":33040},{"na
me":"container_id","value":20944},{"name":"replicaset","value":15060},{"name":"lease_holder","value":10996},{"name":"lease","value":10506},{"n
ame":"uid","value":10224},{"name":"image_id","value":9545}],"seriesCountByLabelValuePair":[{"name":"app_kubernetes_io_managed_by=Helm","value"
:78324},{"name":"app_kubernetes_io_instance=loki-stack","value":78324},{"name":"job=kubernetes-service-endpoints","value":78324},{"name":"app_
kubernetes_io_component=metrics","value":78324},{"name":"app_kubernetes_io_part_of=kube-state-metrics","value":68200},{"name":"app_kubernetes_
io_version=2.8.0","value":68200},{"name":"helm_sh_chart=kube-state-metrics-4.30.0","value":68200},{"name":"app_kubernetes_io_name=kube-state-m
etrics","value":68200},{"name":"service=loki-stack-kube-state-metrics","value":66492},{"name":"node=aks-grafana-14132910-vmss000025","value":6
6269}]}}

Thanks,
Bhanu

Ben Kochie

unread,
Nov 7, 2024, 5:43:49 AM11/7/24
to Bhanu Prakash, Prometheus Users
That only shows an in-use memory of 573MB. 


Can you post a graph of process_resident_memory_bytes{job="prometheus"}?

What metric are you using to report?

Bhanu Prakash

unread,
Nov 7, 2024, 6:48:11 AM11/7/24
to Prometheus Users

Hi Ben,

I've attached a screenshot of the graph for process_resident_memory_bytes{job="prometheus"}.

Using the top command, I'm seeing 1000 MB memory usage, as shown in the screenshot below.

The graph indicates that Prometheus is using 573 MB, but the pod is showing 1009 MB of memory usage. Typically, after utilization, memory should be released, but it's been holding steady at around 1000 MB for the past 4 hours without decreasing.

We're using this pod in AKS specifically to store node and pod metrics.

I'm not exactly sure what's causing this behavior inside the pod.

Thanks,

Bhanu.

Screenshot (113).png
Screenshot (112).png
Reply all
Reply to author
Forward
0 new messages