Prometheus using a large amount of memory when managing storage.

1,440 views
Skip to first unread message

Chad Sesvold

unread,
Oct 15, 2021, 10:02:33 AM10/15/21
to Prometheus Users
We have been running prometheus for 3 or 4 years now.  In production we have 6 month retention and in non-production we have a retention of 45 days.  in production we are capturing 1.8 million metrics with 2,300 targets.  In non-prod we are capturing 800 K metrics with 2,200 targets.  The configuration is the same between the environments.  Both production and non-prod servers have 4 CPU and 24 GB of memory.  Production is using 160% CPU and 5,5 GB of memory.  Non Prod is running out of memory even after increasing the server memory to 64 GB.  This seemed to happen after patching non-prod to 2.30.3.  Production is on 2.30.0.  We are using NAS storage.  Non-prod is 500GB and production is 4 TB of storage. 

I have been doing several test in non production to isolate the issue to see if it is an issue with the number of targets or the storage.  I hav tried reducing the targets, and the retention time.  The results seem to be the same between 2.30.3 and 2.30.0.

prometheus-2.30.3
53MB No tagets clean storage
41GB No tagets storage history
5.5GB tagets clean storage
42GB tagets storage history

prometheus-2.30.0
2MB No tagets clean storage
50GB No tagets storage history
 4GB tagets clean storage
47GB tagets storage history

With less retention than production, non-prod with no targets is using 10x the memory as production even on the same hardware.  After adding targets, even with no history, the memory increase in non-prod until the OS kill prometheus due to out of memory.  I have increased the server from 24 GB to 32 GB to 64GB and prometheus memory never stabilizes.  I have tried removing target and it does seem to help.

There appears to be some sort of memory leak, but it is never aliens until it is aliens.  We are scraping most metrics every 15 second in production and have changed non-prod to every 30 second with the same results.  We are using consul for service discovery.  Not sure what else to look at.  Any suggestion on what to look at next?

This is my first time posting.  So I figured I would ask the community rather than submitting a bug in githob.


Brian Candler

unread,
Oct 15, 2021, 10:50:24 AM10/15/21
to Prometheus Users
Look at Status > TSDB Status from the web interface of both systems.  In particular, what does the first entry ("Head Stats") show for each system?

Do you have any idea of series churn, i.e. how many new series are being created and deleted per hour?  (Although if you're scraping a subset of the same targets on non-prod, then it shouldn't be any worse)

Prometheus exposes stats about its internal memory usage (go_memstats_*), can you see any difference between the two systems here?

Are you hitting the non-production system with queries?  If so, can you try not querying it for a while?

Otherwise, you can try replicating the production system *exactly* in the non-production one: same binaries, same configuration, same retention.  If it works differently then it's something about the environment.

I observe that NAS is *not* recommended as a storage backend for prometheus.  See https://prometheus.io/docs/prometheus/latest/storage/#operational-aspects (scroll to the yellow "CAUTION" box)

Chad Sesvold

unread,
Oct 15, 2021, 3:51:47 PM10/15/21
to Prometheus Users
At this point I am the only one running queries.  When I have no target defined the memory seems to be flat.

When I changed the follow in non-pro it seemed to stabilize the memory usage.

--storage.tsdb.max-block-duration 15d
--storage.tsdb.min-block-duration 1h

I will try copying the binaries and configs from prod to non-prod.  

I am planning at looking at Thanos instead of an NFS mount.  That is going to take some time.

I did add some file targets back in none-prod  for a total of 900 checks and prometheus leveled out at about 22 GB.

Prod - TSDB Status - Head Stats
Number of Series=2 million
Number of Chunks=11million
Number of Label=59k
PairsCurrent Min Time=2021-10-15T16:00:00.006Z (1634313600006)
Current Max Time=2021-10-15T18:51:07.414Z (1634323867414)

Non-Prod - TSDB Status - Head Stats
Number of Series=82k
Number of Chunks=400k
Number of Label=2k
PairsCurrent Min Time=2021-10-15T18:05:27.705Z (1634321127705)
Current Max Time=2021-10-15T18:50:58.200Z (1634323858200)

Prod 
Showing nodes accounting for 3939.82MB, 73.70% of 5345.97MB total
Dropped 292 nodes (cum <= 26.73MB)

Non-prod
Showing nodes accounting for 1.43GB, 91.54% of 1.56GB total
Dropped 133 nodes (cum <= 0.01GB)

Brian Candler

unread,
Oct 16, 2021, 4:36:36 AM10/16/21
to Prometheus Users
Is there a specific reason why you're tweaking the TSDB block durations?  That is, did you observe some problem with the defaults?  Otherwise I'd suggest you just run with defaults.

In any case, if the problem you're debugging is discrepancies between prod and non-prod, you should be running with the same flags in both.

On Friday, 15 October 2021 at 20:51:47 UTC+1 tass...@gmail.com wrote:
I did add some file targets back in none-prod  for a total of 900 checks and prometheus leveled out at about 22 GB.

Not sure what you mean by "900 checks" here.  Do you mean targets? Metrics?  Alerting rules?

And how are you determining the total RAM usage? (If you're getting OOM killer messages then you're definitely hitting the RAM limit.  It's worth mentioning that older versions of go tended not to hand back memory to the OS as aggressively, but they did mark the pages as reclaimable and the OS would reclaim these when under memory pressure.  But recent prometheus binaries should be built with a recent version of go - assuming you're using the official release binaries and not ones you've compiled yourself)
 
Prod - TSDB Status - Head Stats
Number of Series=2 million
Number of Chunks=11million
Number of Label=59k
PairsCurrent Min Time=2021-10-15T16:00:00.006Z (1634313600006)
Current Max Time=2021-10-15T18:51:07.414Z (1634323867414)

Non-Prod - TSDB Status - Head Stats
Number of Series=82k
Number of Chunks=400k
Number of Label=2k
PairsCurrent Min Time=2021-10-15T18:05:27.705Z (1634321127705)
Current Max Time=2021-10-15T18:50:58.200Z (1634323858200)


That suggests the non-prod should be using a lot less RAM - the number of head chunks in particular.
 
Prod 
Showing nodes accounting for 3939.82MB, 73.70% of 5345.97MB total
Dropped 292 nodes (cum <= 26.73MB)

Non-prod
Showing nodes accounting for 1.43GB, 91.54% of 1.56GB total
Dropped 133 nodes (cum <= 0.01GB)

What do you mean my "nodes" here?  And what are "dropped nodes"?

I'm looking at prometheus 2.29.2 here, so maybe there are some new stats in 2.30 that I can't see.

Ben Kochie

unread,
Oct 16, 2021, 5:47:32 AM10/16/21
to Chad Sesvold, Prometheus Users
On Fri, Oct 15, 2021 at 9:51 PM Chad Sesvold <tass...@gmail.com> wrote:
At this point I am the only one running queries.  When I have no target defined the memory seems to be flat.

When I changed the follow in non-pro it seemed to stabilize the memory usage.

--storage.tsdb.max-block-duration 15d
--storage.tsdb.min-block-duration 1h

These flags will actually make memory use worse. This will generate many more TSDB blocks than normal, which will cause Prometheus to need more memory to manage the indexes. However this is mostly needed for page cache memory. See my next comment.
 

I will try copying the binaries and configs from prod to non-prod.  

I am planning at looking at Thanos instead of an NFS mount.  That is going to take some time.

The retention and long-term storage in Prometheus has almost no effect on RSS needed to run Prometheus. Prometheus only needs memory (RSS) to manage the current 2 hours of data. After 2 hours, everything in memory is flushed to disk and mapped-in using a technique called "mmap". This means disk blocks are virtually mapped into memory (VSS). Then the Linux kernel uses page cache to manage what data is loaded. You can have terabytes of data in the TSDB and it only uses a small amount of RSS to manage the mappings.

As Brian said, you need to look at go_memstats_alloc_bytes and process_resident_memory_bytes for Prometheus. This will give you a better idea on what is being used.

Chad Sesvold

unread,
Oct 21, 2021, 11:20:19 AM10/21/21
to Prometheus Users
So I have been doing a little more testing.  I did find that we had some software installed on the non prod boxes that was causing some issues.  We were scaling metrics every 20 seconds.  My guess is that the software was slowing down prometheus writes.  I am guessing that I had a race condition of some kind.  Metrics were coming in at a higher rate then they could bw written to on the file system.  Once we disabled the software things seem to stabilize, but only after deleting all of the data.

The weird part is that with the 45 days worth of data there is still an issue starting prometheus with no targets.  I am wondering if prometheus was trying to update or convert the data store after going from 2.30.2 to 2.30.3.  Prod is on version 2.30.0.  Then again Have rolled back and patched so many time it could have caused issues with the date store.  Then you add on to of it I am using NFS instead of a local file system.  

prometheus, version 2.30.3 (branch: HEAD, revision: f29caccc42557f6a8ec30ea9b3c8c089391bd5df)
build user:       root 5cff4265f0e3
build date:       20211005-16:10:52
go version:       go1.17.1
platform:         linux/amd64

There are a couple of quick questions I have before considering this issues resolved.  I know that I can run sort_desc(scrape_duration_seconds) to get the amount of time to scrape the metrics.  This is helpful for determining scrape intervals.  Is there a metric that I can look at to tell if I am having a race condition when prometheus is writing metrics?  I am thinking there must be an easy way to rule out race conditions,

Chad Sesvold

unread,
Oct 21, 2021, 1:14:54 PM10/21/21
to Prometheus Users

I was just tweaking the TSDB block durations to match prod.  I was reading that we might want to reduce the TSDB block durations to help free up memory.

We are seeing OOM in the system logs.  I am watching memory using the following command.

watch "ps ax -o pcpu,rss,ppid,pid,stime,args | grep 'prometheus/prometheus' | grep -v grep"

I am not sure what is meant by node.  I was running go profiling tool found in https://source.coveo.com/2021/03/03/prometheus-memory/ and the output was from the go tool.  

go tool pprof -symbolize=remote -inuse_space https://monitoring.prod.cloud.coveo.com/debug/pprof/heap
Reply all
Reply to author
Forward
0 new messages