Memory usage spike of 20GB, included Grafana snapshots

855 views
Skip to first unread message

blakout

unread,
Apr 17, 2017, 2:09:04 PM4/17/17
to Prometheus Users
Previously posted: https://groups.google.com/forum/#!topic/prometheus-users/NL8TejT5FH0
Back with questions and hoping for some direction.

A few days ago there was a sudden and permanent increase in memory usage from Prometheus. My initial assumption is that someone has loaded a Grafana dashboard (and left it open in their browser) that has a damn computationally expensive query. However, I figured I'd share the snapshots and see if there are any other possibilities involved:

- increase of CPU Usage / Load Average
- increase of Memory Usage by 20GB in 10 seconds
- increase of I/O Activity: Page Out

- steady increase of Persistence Urgency (nearing rushed mode)
- steady increase of Chunks to Persist
- increase of Rule Evaluation Duration 
- increase of Bytes Read/Written: xvdf Write (Prometheus /data directory)
- Write increase of Chunk ops: transcode

Some notes:
- I've been working to include Recording Rules for Grafana dashboards (still very much a WIP)
- difficult part of resolving this: the entire company has access to Grafana, and we have ~hundred dashboards (which anyone can access/create/edit).

Are there other possible causes, or should I continue to focus on dashboards/recording rules?

Björn Rabenstein

unread,
Apr 18, 2017, 11:42:27 AM4/18/17
to blakout, Prometheus Users
On 17 April 2017 at 20:09, blakout <niko...@gmail.com> wrote:
> Some notes:
> - I've been working to include Recording Rules for Grafana dashboards (still
> very much a WIP)
> - difficult part of resolving this: the entire company has access to
> Grafana, and we have ~hundred dashboards (which anyone can
> access/create/edit).
>
> Are there other possible causes, or should I continue to focus on
> dashboards/recording rules?

Incidentally, I have just run into a very similar issue.

The most striking common symptom is that the Go heap size (I assume
that's what you show as `go_memory_bytes` in your dashboard) has much
higher maximums, but the minimums are essentially the same. That means
something is allocating short lived objects like crazy, taking more
memory than the whole storage layer (which has a baseline of <10GiB in
your case).

This is definitely not a query loading too many chunks into memory.

In my case, the likely culprit is a bunch of very expensive recording
rules that touch an enormous amount of data in an expensive operation
(in particular things like `deriv(...[90d])`. In this operations, a
lot of numbers are put onto the heap. In fact, more data volume is
generated on the heap (as uncompressed floats, which are intermediate
results of the calculation) than the total size of the (compressed)
data storage.

You could test if my theory also applies to your case by temporarily
remove expensive rules. Or have a longer rule evaluation cycle (e.g.
2m or so) to see if the memory spikes go along with the start of an
evaluation cycle. (We also start all the rule evaluations at the same
time, which is something that needs fixing. It makes expensive rule
evaluations even worse.)

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

blakout

unread,
Apr 19, 2017, 2:05:21 PM4/19/17
to Prometheus Users, niko...@gmail.com
To test your theory, I temporarily removed all recording/alert rules from Prometheus server (thankfully we recently span up a second server for HA, which I left the alerting rules active on). This was done by modifying `/etc/prometheus/prometheus.yaml` to not read any `rules` files and then reloading the Prometheus service.

This had no noticeable affect on memory usage.

$ ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
prometh+  8848  261 94.8 62333608 59672924 ?   Ssl  Apr18 2947:09 /usr/local/bin/prometheus -config.file=/etc/prometheus/prometheus.yaml -storage.
root       435  0.0  0.1 147068 97452 ?        Ss    2016  16:24 /lib/systemd/systemd-journald
root     16594  0.0  0.0 925424 25808 ?        Ssl   2016  56:25 dockerd -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-mas
node-ex+ 29419  0.4  0.0 567320 21964 ?        Ssl  Jan23 605:33 /usr/local/bin/node_exporter -collectors.enabled=conntrack,diskstats,entropy,file
consul   10668  0.4  0.0  37396 21344 ?        Ssl   2016 1016:49 /usr/local/bin/consul agent -config-dir /etc/consul
root     16604  0.0  0.0 820316 12040 ?        Ssl   2016   9:43 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock 
root      8537  0.0  0.0  26424 10424 ?        Ssl  Apr18   0:37 /usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -path.home /usr/sh
root      8818  0.0  0.0  17592  7776 ?        Ssl  Apr18   0:45 /bin/blackbox_exporter -config.file=/config.yml

We're still running Prometheus 1.5.2, and I plan to test upgrading to 1.6.0 shortly.


About the `go_memory_bytes` from Prometheus Stats: the metric for this is `go_memstats_alloc_bytes`


Björn Rabenstein

unread,
Apr 21, 2017, 6:01:33 PM4/21/17
to blakout, Prometheus Users
On 19 April 2017 at 14:05, blakout <niko...@gmail.com> wrote:
> We're still running Prometheus 1.5.2, and I plan to test upgrading to 1.6.0
> shortly.
>
> Node System Stats (past 1 hour) -
> https://snapshot.raintank.io/dashboard/snapshot/WCLPGSASMBZ8V4fLzzGS8NWwIEA3tCqH
> Prometheus Stats (past 1 hour) -
> https://snapshot.raintank.io/dashboard/snapshot/ebHlOgPksRS56Mfkr44FS6KXZTenpuBn

These stats don't look too weird. 3M time series are also quite a bit
already. Let's see what 1.6.1 gives you. It could help quite a bit in
your situation.

nsmeds

unread,
Apr 24, 2017, 4:04:14 PM4/24/17
to Prometheus Users, niko...@gmail.com
1.6.1 has been running on a couple instances for past 4-6 days. You can immediately see a difference in memory usage =) Very happy with the changes. Thank you thank you thank you!

govinda...@gmail.com

unread,
Aug 18, 2017, 8:22:51 AM8/18/17
to Prometheus Users, niko...@gmail.com
Hi All,

We are running Prometheus (1.6.2) scrapper nodes as docker container. This node is set to persist data for 1 hour (storage.local.retention

1h0m0s) only. When we start our container, after couple of hours (6 hrs), the memory becomes completely full and this docker container becomes unresponsive. 



CONTAINER           CPU %               MEM USAGE / LIMIT    MEM %               NET I/O             BLOCK I/O           PIDS
d364804446a7        11.59%              11.09 GiB / 12 GiB   92.38%              3.08 GB / 3.83 GB   3.57 GB / 4.2 TB    23

Here is the settings from /flags end point. 

Command-Line Flags

alertmanager.notification-queue-capacity10000
alertmanager.timeout10s
alertmanager.url
config.file/etc/prometheus/prometheus.yml
log.format"logger:stderr"
log.level"info"
query.max-concurrency20
query.staleness-delta5m0s
query.timeout2m0s
storage.local.checkpoint-dirty-series-limit5000
storage.local.checkpoint-interval5m0s
storage.local.chunk-encoding-version1
storage.local.dirtyfalse
storage.local.enginepersisted
storage.local.index-cache-size.fingerprint-to-metric10485760
storage.local.index-cache-size.fingerprint-to-timerange5242880
storage.local.index-cache-size.label-name-to-label-values10485760
storage.local.index-cache-size.label-pair-to-fingerprints20971520
storage.local.max-chunks-to-persist0
storage.local.memory-chunks0
storage.local.num-fingerprint-mutexes4096
storage.local.pathdata
storage.local.pedantic-checksfalse
storage.local.retention1h0m0s
storage.local.series-file-shrink-ratio0.1
storage.local.series-sync-strategyadaptive
storage.local.target-heap-size10737418240
storage.remote.graphite-address
storage.remote.graphite-prefix
storage.remote.graphite-transport
storage.remote.influxdb-url
storage.remote.influxdb.database
storage.remote.influxdb.retention-policy
storage.remote.influxdb.username
storage.remote.opentsdb-url
storage.remote.timeout
versionfalse
web.console.librariesconsole_libraries
web.console.templatesconsoles
web.enable-remote-shutdownfalse
web.external-urlhttp://d364804446a7:9090/
web.listen-address:9090
web.max-connections512
web.read-timeout30s
web.route-prefix/
web.telemetry-path/metrics
web.user-assets

Any suggestion on how to fix the memory issues? Let me know if any details required. Thanks.

Thanks,
Govind

promethues...@gmail.com

unread,
Apr 24, 2018, 10:51:40 PM4/24/18
to Prometheus Users
Have you ever got any answer for this issue?
Reply all
Reply to author
Forward
0 new messages