prometheus-config-reloader container gets to unable to connect

Kota Nikaido

unread,

Nov 28, 2019, 6:01:31 AM11/28/19

to Prometheus Users

Hello.

Currently I use Prometheus on AWS EKS.

Every morning, I do install the helm chart 'stable/prometheus-operator', version 6.8.2.

One day, "prometheus-config-reloader" got to unable to connect 127.0.0.1:9090 and failed helm install.

(The day before, it was success.)

The error log is as follows.

level=error ts=2019-11-24T20:58:52.721302449Z caller=runutil.go:88 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post http://127.0.0.1:9090/-/reload: dial tcp 127.0.0.1:9090: connect: connection refused"

I also confirmed that prometheus was killed because of OOM.

I removed the PVC attached to Prometheus, then this error was resolved.

However, the root cause of this case remains unknown.

In a customer environment, I don't want to delete the volume.

How can I resolve this error?

And How can I prevent this error?

Thanks.

Simon Pasquier

unread,

Nov 28, 2019, 6:49:45 AM11/28/19

to Kota Nikaido, Prometheus Users

You need to understand exactly why Prometheus doesn't start, logs would help.
If Prometheus is OOM-killed, try increasing the memory limits if this
is possible.

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fa15ac03-877d-4310-a40f-2b20b9229b89%40googlegroups.com.

Kota Nikaido

unread,

Nov 28, 2019, 8:38:58 PM11/28/19

to Prometheus Users

Thank you for replying.

Unfortunately, I have no logs other than what I presented.

And the memory limits setting cannot be easily changed because it affects the overall environment.

I set the Prometheus memory limit to 2400MiB.

I have several other environments with similar settings, so I checked how memory usage changed.

As a result, I noticed the following:

-Prometheus memory usage was always below 1000 MiB for the first few days

-Memory usage had reached 2400 MiB at startup when it exceeded 30 days of operation

I removed the PVC attached to Prometheus, then memory usage has returned to its previous state(be always below 1000 MiB).

Will Prometheus' memory usage increase with the amount of data stored on the attached volume?

2019年11月28日木曜日 20時49分45秒 UTC+9 Simon Pasquier:

You need to understand exactly why Prometheus doesn't start, logs would help.
If Prometheus is OOM-killed, try increasing the memory limits if this
is possible.

On Thu, Nov 28, 2019 at 12:01 PM Kota Nikaido <tis.ni...@gmail.com> wrote:
>
> Hello.
>
> Currently I use Prometheus on AWS EKS.
> Every morning, I do install the helm chart 'stable/prometheus-operator', version 6.8.2.
>
> One day, "prometheus-config-reloader" got to unable to connect 127.0.0.1:9090 and failed helm install.
> (The day before, it was success.)
> The error log is as follows.
>
> level=error ts=2019-11-24T20:58:52.721302449Z caller=runutil.go:88 msg="function failed. Retrying in next tick" err="trigger reload: reload request failed: Post http://127.0.0.1:9090/-/reload: dial tcp 127.0.0.1:9090: connect: connection refused"
>
> I also confirmed that prometheus was killed because of OOM.
>
> I removed the PVC attached to Prometheus, then this error was resolved.
> However, the root cause of this case remains unknown.
>
> In a customer environment, I don't want to delete the volume.
>
> How can I resolve this error?
> And How can I prevent this error?
>
> Thanks.
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Simon Pasquier

unread,

Nov 29, 2019, 4:14:44 AM11/29/19

to Kota Nikaido, Prometheus Users

On Fri, Nov 29, 2019 at 2:39 AM Kota Nikaido <tis.ni...@gmail.com> wrote:
>
> Thank you for replying.
>
> Unfortunately, I have no logs other than what I presented.
> And the memory limits setting cannot be easily changed because it affects the overall environment.
>
> I set the Prometheus memory limit to 2400MiB.
> I have several other environments with similar settings, so I checked how memory usage changed.
> As a result, I noticed the following:
> -Prometheus memory usage was always below 1000 MiB for the first few days
> -Memory usage had reached 2400 MiB at startup when it exceeded 30 days of operation
>
> I removed the PVC attached to Prometheus, then memory usage has returned to its previous state(be always below 1000 MiB).
> Will Prometheus' memory usage increase with the amount of data stored on the attached volume?

Yes because of compactions (when Prometheus aggregates data blocks
into bigger blocks) and because queries might return more samples.
You might need to reduce the retention time.

> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b0669ee1-92d6-42cd-a81d-eb98cf32c17d%40googlegroups.com.

Kota Nikaido

unread,

Dec 2, 2019, 8:51:50 PM12/2/19

to Prometheus Users

I set Prometheus “retention time” to 10 days.

From Grafana, I cannot view Prometheus data more than 10 days ago.

However, the amount of PV data continued to increase and OOM occurred.

I think that setting the "retention time" automatically reduces the amount of data. Is my idea wrong?

2019年11月29日金曜日 18時14分44秒 UTC+9 Simon Pasquier:

> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b0669ee1-92d6-42cd-a81d-eb98cf32c17d%40googlegroups.com.

Stuart Clark

unread,

Dec 3, 2019, 1:47:12 AM12/3/19

to promethe...@googlegroups.com

On 03/12/2019 01:51, Kota Nikaido wrote:
> I set Prometheus “retention time” to 10 days.
> From Grafana, I cannot view Prometheus data more than 10 days ago.
> However, the amount of PV data continued to increase and OOM occurred.
> I think that setting the "retention time" automatically reduces the
> amount of data. Is my idea wrong?

Prometheus will periodically remove data older than the retention time,
so storage shouldn't keep increasing forever. To achieve this as well as
the regular block compaction and WAL processes a fair amount of RAM is
needed. How many time series are you scraping, and how frequently?

--
Stuart Clark

Kota Nikaido

unread,

Dec 3, 2019, 2:53:19 AM12/3/19

to Prometheus Users

> How many time series are you scraping, and how frequently?

I'm sorry, I don't know how to answer it.

I have installed the helm chart "stable / prometheus-operator" and set the "Exporter" and "Scrape" options to default.

2019年12月3日火曜日 15時47分12秒 UTC+9 Stuart Clark:

Kota Nikaido

unread,

Dec 3, 2019, 3:33:00 AM12/3/19

to Prometheus Users

Does Prometheus automatically collect metrics, so is the idea that the number of time series automatically increases?

How does Prometheus set the number of time series with the default settings?

What can I see the current number of time series?

2019年12月3日火曜日 16時53分19秒 UTC+9 Kota Nikaido:

Stuart Clark

unread,

Dec 3, 2019, 3:38:17 AM12/3/19

to Kota Nikaido, Prometheus Users

Prometheus will automatically collect metrics based on the configuration file. That defines targets to scrape (often using a service discovery mechanism) and how often to scrape (every few seconds up to every few minutes).

It is then down to the exporter or instrumented application on how many metrics are returned and imported.

A time series is defined by the combination of key value pairs (including metric name), so each scrape may just add data to existing time series, rather than creating new ones.

--
Sent from my Android device with K-9 Mail. Please excuse my brevity.

Kota Nikaido

unread,

Dec 3, 2019, 6:54:06 AM12/3/19

to Prometheus Users

Thanks for your reply, Stuart.

I set the "scrape interval" to the default, ie 1m.

And in each environment, 30 to 40 containers are running on kubernetes.

In the environment where the OOM error occurred, metrics were collected for three namespaces on two nodes.

2019年12月3日火曜日 17時38分17秒 UTC+9 Stuart Clark:

Reply all

Reply to author

Forward