Memory usage spike of 20GB, included Grafana snapshots

blakout

unread,

Apr 17, 2017, 2:09:04 PM4/17/17

to Prometheus Users

Previously posted: https://groups.google.com/forum/#!topic/prometheus-users/NL8TejT5FH0

Back with questions and hoping for some direction.

A few days ago there was a sudden and permanent increase in memory usage from Prometheus. My initial assumption is that someone has loaded a Grafana dashboard (and left it open in their browser) that has a damn computationally expensive query. However, I figured I'd share the snapshots and see if there are any other possibilities involved:

Node System Stats (past 96 hours) - https://snapshot.raintank.io/dashboard/snapshot/t5MpiPeOg6AVDsLMxlhbuwU4JQPTtWyn

- increase of CPU Usage / Load Average

- increase of Memory Usage by 20GB in 10 seconds

- increase of I/O Activity: Page Out

Prometheus Stats (past 96 hours) - https://snapshot.raintank.io/dashboard/snapshot/Y4ekXz0L4ubBaQG5Lt1HxZ7kOklEWHTp

- steady increase of Persistence Urgency (nearing rushed mode)

- steady increase of Chunks to Persist

- increase of Rule Evaluation Duration

- increase of Bytes Read/Written: xvdf Write (Prometheus /data directory)

- Write increase of Chunk ops: transcode

Some notes:

- I've been working to include Recording Rules for Grafana dashboards (still very much a WIP)

- difficult part of resolving this: the entire company has access to Grafana, and we have ~hundred dashboards (which anyone can access/create/edit).

Are there other possible causes, or should I continue to focus on dashboards/recording rules?

Björn Rabenstein

unread,

Apr 18, 2017, 11:42:27 AM4/18/17

to blakout, Prometheus Users

On 17 April 2017 at 20:09, blakout <niko...@gmail.com> wrote:
> Some notes:
> - I've been working to include Recording Rules for Grafana dashboards (still
> very much a WIP)
> - difficult part of resolving this: the entire company has access to
> Grafana, and we have ~hundred dashboards (which anyone can
> access/create/edit).
>
> Are there other possible causes, or should I continue to focus on
> dashboards/recording rules?

Incidentally, I have just run into a very similar issue.

The most striking common symptom is that the Go heap size (I assume
that's what you show as `go_memory_bytes` in your dashboard) has much
higher maximums, but the minimums are essentially the same. That means
something is allocating short lived objects like crazy, taking more
memory than the whole storage layer (which has a baseline of <10GiB in
your case).

This is definitely not a query loading too many chunks into memory.

In my case, the likely culprit is a bunch of very expensive recording
rules that touch an enormous amount of data in an expensive operation
(in particular things like `deriv(...[90d])`. In this operations, a
lot of numbers are put onto the heap. In fact, more data volume is
generated on the heap (as uncompressed floats, which are intermediate
results of the calculation) than the total size of the (compressed)
data storage.

You could test if my theory also applies to your case by temporarily
remove expensive rules. Or have a longer rule evaluation cycle (e.g.
2m or so) to see if the memory spikes go along with the start of an
evaluation cycle. (We also start all the rule evaluations at the same
time, which is something that needs fixing. It makes expensive rule
evaluations even worse.)

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

blakout

unread,

Apr 19, 2017, 2:05:21 PM4/19/17

to Prometheus Users, niko...@gmail.com

To test your theory, I temporarily removed all recording/alert rules from Prometheus server (thankfully we recently span up a second server for HA, which I left the alerting rules active on). This was done by modifying `/etc/prometheus/prometheus.yaml` to not read any `rules` files and then reloading the Prometheus service.

This had no noticeable affect on memory usage.

$ ps aux --sort -rss
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
prometh+  8848  261 94.8 62333608 59672924 ?   Ssl  Apr18 2947:09 /usr/local/bin/prometheus -config.file=/etc/prometheus/prometheus.yaml -storage.
root       435  0.0  0.1 147068 97452 ?        Ss    2016  16:24 /lib/systemd/systemd-journald
root     16594  0.0  0.0 925424 25808 ?        Ssl   2016  56:25 dockerd -H unix:///var/run/docker.sock --ip-forward=true --iptables=true --ip-mas
node-ex+ 29419  0.4  0.0 567320 21964 ?        Ssl  Jan23 605:33 /usr/local/bin/node_exporter -collectors.enabled=conntrack,diskstats,entropy,file
consul   10668  0.4  0.0  37396 21344 ?        Ssl   2016 1016:49 /usr/local/bin/consul agent -config-dir /etc/consul
root     16604  0.0  0.0 820316 12040 ?        Ssl   2016   9:43 docker-containerd -l unix:///var/run/docker/libcontainerd/docker-containerd.sock 
root      8537  0.0  0.0  26424 10424 ?        Ssl  Apr18   0:37 /usr/share/filebeat/bin/filebeat -c /etc/filebeat/filebeat.yml -path.home /usr/sh
root      8818  0.0  0.0  17592  7776 ?        Ssl  Apr18   0:45 /bin/blackbox_exporter -config.file=/config.yml

We're still running Prometheus 1.5.2, and I plan to test upgrading to 1.6.0 shortly.

Node System Stats (past 1 hour) - https://snapshot.raintank.io/dashboard/snapshot/WCLPGSASMBZ8V4fLzzGS8NWwIEA3tCqH

Prometheus Stats (past 1 hour) - https://snapshot.raintank.io/dashboard/snapshot/ebHlOgPksRS56Mfkr44FS6KXZTenpuBn

About the `go_memory_bytes` from Prometheus Stats: the metric for this is `go_memstats_alloc_bytes`

Björn Rabenstein

unread,

Apr 21, 2017, 6:01:33 PM4/21/17

to blakout, Prometheus Users

On 19 April 2017 at 14:05, blakout <niko...@gmail.com> wrote:
> We're still running Prometheus 1.5.2, and I plan to test upgrading to 1.6.0
> shortly.
>
> Node System Stats (past 1 hour) -
> https://snapshot.raintank.io/dashboard/snapshot/WCLPGSASMBZ8V4fLzzGS8NWwIEA3tCqH
> Prometheus Stats (past 1 hour) -
> https://snapshot.raintank.io/dashboard/snapshot/ebHlOgPksRS56Mfkr44FS6KXZTenpuBn

These stats don't look too weird. 3M time series are also quite a bit
already. Let's see what 1.6.1 gives you. It could help quite a bit in
your situation.

nsmeds

unread,

Apr 24, 2017, 4:04:14 PM4/24/17

to Prometheus Users, niko...@gmail.com

1.6.1 has been running on a couple instances for past 4-6 days. You can immediately see a difference in memory usage =) Very happy with the changes. Thank you thank you thank you!

https://snapshot.raintank.io/dashboard/snapshot/XhDEY7FLe7zrwrwktJiHQT5tzT4PNIV9

https://snapshot.raintank.io/dashboard/snapshot/Aen1RuABibGzcInrRbhOQS7O0CQKBld3

Still need to test `-storage.local.target-heap-size`.

govinda...@gmail.com

unread,

Aug 18, 2017, 8:22:51 AM8/18/17

to Prometheus Users, niko...@gmail.com

Hi All,

We are running Prometheus (1.6.2) scrapper nodes as docker container. This node is set to persist data for 1 hour (storage.local.retention

1h0m0s) only. When we start our container, after couple of hours (6 hrs), the memory becomes completely full and this docker container becomes unresponsive.

CONTAINER CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS

d364804446a7 11.59% 11.09 GiB / 12 GiB 92.38% 3.08 GB / 3.83 GB 3.57 GB / 4.2 TB 23

Here is the settings from /flags end point.

Command-Line Flags

alertmanager.notification-queue-capacity	10000
alertmanager.timeout	10s
alertmanager.url
config.file	/etc/prometheus/prometheus.yml
log.format	"logger:stderr"
log.level	"info"
query.max-concurrency	20
query.staleness-delta	5m0s
query.timeout	2m0s
storage.local.checkpoint-dirty-series-limit	5000
storage.local.checkpoint-interval	5m0s
storage.local.chunk-encoding-version	1
storage.local.dirty	false
storage.local.engine	persisted
storage.local.index-cache-size.fingerprint-to-metric	10485760
storage.local.index-cache-size.fingerprint-to-timerange	5242880
storage.local.index-cache-size.label-name-to-label-values	10485760
storage.local.index-cache-size.label-pair-to-fingerprints	20971520
storage.local.max-chunks-to-persist	0
storage.local.memory-chunks	0
storage.local.num-fingerprint-mutexes	4096
storage.local.path	data
storage.local.pedantic-checks	false
storage.local.retention	1h0m0s
storage.local.series-file-shrink-ratio	0.1
storage.local.series-sync-strategy	adaptive
storage.local.target-heap-size	10737418240
storage.remote.graphite-address
storage.remote.graphite-prefix
storage.remote.graphite-transport
storage.remote.influxdb-url
storage.remote.influxdb.database
storage.remote.influxdb.retention-policy
storage.remote.influxdb.username
storage.remote.opentsdb-url
storage.remote.timeout
version	false
web.console.libraries	console_libraries
web.console.templates	consoles
web.enable-remote-shutdown	false
web.external-url	http://d364804446a7:9090/
web.listen-address	:9090
web.max-connections	512
web.read-timeout	30s
web.route-prefix	/
web.telemetry-path	/metrics
web.user-assets

Any suggestion on how to fix the memory issues? Let me know if any details required. Thanks.

Thanks,

Govind

promethues...@gmail.com

unread,

Apr 24, 2018, 10:51:40 PM4/24/18

to Prometheus Users

Have you ever got any answer for this issue?

Reply all

Reply to author

Forward