Memory usage grows from 4GB to eventually 10GB memory usage

Brett Larson

unread,

Dec 18, 2020, 7:03:13 PM12/18/20

to Prometheus Users

Hello,

I am using a Prometheus 2.21 server and I am seeing that the memory is growing at an untenable rate. I will start the pod at 4GB and eventually it will move to 10GB after a few days and go into a crashloop back off state.

The pod is configured to only keep around 8 hours of data, and this data is stored to empyDir, not a persistent file system. We are doing remote write to postgres for only about 4 metrics.

We do have some no-nos (high cardinality labels & pod names) but unfortunately these are needed.

I don't understand why after a retention of 8 hours we would still cause memory like this to grow and i'm looking for some guidance on how I can troubleshoot this.

Please let me know,

Thank you!

Julien Pivotto

unread,

Dec 18, 2020, 7:09:13 PM12/18/20

to Brett Larson, Prometheus Users

Can you share screenshots of the
https://grafana.com/grafana/dashboards/12054 dashboard ?

--
Julien Pivotto
@roidelapluie

Brett Larson

unread,

Dec 18, 2020, 7:40:56 PM12/18/20

to Prometheus Users

Here is a link to the "snapshot" of the dashboard.

https://snapshot.raintank.io/dashboard/snapshot/crxdjU7fhzAhl0x0KWiH1ZHGZXKyhqmF

Julien Pivotto

unread,

Dec 18, 2020, 7:50:07 PM12/18/20

to Brett Larson, Prometheus Users

On 18 Dec 16:40, Brett Larson wrote:
> Here is a link to the "snapshot" of the dashboard.
> https://snapshot.raintank.io/dashboard/snapshot/crxdjU7fhzAhl0x0KWiH1ZHGZXKyhqmF

Thanks, however this does not seem to show any memory issue.

There might be an issue with your configuration, where you could take
advantage of some tweaks , like reusing the same sd configs + relabeling
or using selectors: in your kubernetes config.

I would expect that if that is still an issue, a memory profile of
Prometheus when memory is high would help.

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/98ef5eea-5552-42c8-9798-8c005cd01eaen%40googlegroups.com.

--
Julien Pivotto
@roidelapluie

Brett Larson

unread,

Dec 19, 2020, 12:01:29 PM12/19/20

to Prometheus Users

Julien,

What is the best practices regarding service discovery?

We do have a large amount of jobs with a large number of inactive targets - could this negatively impact our memory usage?

For example:

1 (1/728 active targets)
2 (143/1797 active targets)
3 (1/728 active targets)
4 (41/41 active targets)
5 (41/41 active targets)
6 (13/1237 active targets)
7 (9/1243 active targets)
8 (5/1261 active targets)
9 (5/1261 active targets)
10 (1/1261 active targets)
11 (41/41 active targets)
12 (1/1261 active targets)
13 (41/1261 active targets)
14 (5/1261 active targets)
15 (5/1261 active targets)
16(41/1261 active targets)
17 (3/728 active targets)
18 (3/728 active targets)

Is the best practice now to use labels or regex in the re-label configs?

Here is an example of the config for envoy-stats:

https://gitlab.com/-/snippets/2052337

Julien Pivotto

unread,

Dec 19, 2020, 2:29:15 PM12/19/20

to Brett Larson, Prometheus Users

On 19 Dec 09:01, Brett Larson wrote:
> Julien,
> What is the best practices regarding service discovery?

A snipped with multiple kubernetes_sd_configs would help

>
> We do have a large amount of jobs with a large number of inactive targets -
> could this negatively impact our memory usage?
>
> For example:
>

> - 1 (1/728 active targets)
> - 2 (143/1797 active targets)
> - 3 (1/728 active targets)
> - 4 (41/41 active targets)
> - 5 (41/41 active targets)
> - 6 (13/1237 active targets)
> - 7 (9/1243 active targets)
> - 8 (5/1261 active targets)
> - 9 (5/1261 active targets)
> - 10 (1/1261 active targets)
> - 11 (41/41 active targets)
> - 12 (1/1261 active targets)
> - 13 (41/1261 active targets)
> - 14 (5/1261 active targets)
> - 15 (5/1261 active targets)
> - 16(41/1261 active targets)
> - 17 (3/728 active targets)
> - 18 (3/728 active targets)

> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/64ca3851-4f7b-4d61-a047-d5b4c3b58c23n%40googlegroups.com.

--
Julien Pivotto
@roidelapluie

Ben Kochie

unread,

Dec 19, 2020, 5:51:21 PM12/19/20

to Brett Larson, Prometheus Users

Retention does not affect memory use. Process memory is only necessary for ingestion, not storage retention. Data is automatically flushed to disk as soon as it's ready, and at most 2 hours.

Looking at your dashboard snapshot, everything looks normal. You have a peak of 1.25 million series, and if you divide the peak memory use of 7.18GiB, that's 6.2kB per series, so this is basically in line with expectations for managing scrapes of that many series.

Overall, you're collecting a moderate amount of series for one Prometheus server. The amount of memory you're using is to be expected for this many series.

I think the best course of action is to allocate a bit more resources to your Prometheus job to avoid OOMing.

On Sat, Dec 19, 2020 at 6:01 PM Brett Larson <brettpatr...@gmail.com> wrote:

Julien,
What is the best practices regarding service discovery?

We do have a large amount of jobs with a large number of inactive targets - could this negatively impact our memory usage?

No, this doesn't really affect memory use.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/64ca3851-4f7b-4d61-a047-d5b4c3b58c23n%40googlegroups.com.

Reply all

Reply to author

Forward