MONITORING AROUND 28K CONTAINERS

Isabel Noronha

unread,

Apr 16, 2020, 1:53:39 AM4/16/20

to Prometheus Users

Hello,

Yeah, the subject might be overwhelming...

Firstly I'm new to Prometheus.

I have configured 3 targets to monitor containers running inside it until now.(I have done this for around 50 containers so far)

The problem I faced is that grafana goes into a hang state as the containers increase.

The web application I'm currently working on is used for simulation. So on each server, we spawn around 2K containers.

so 14 servers * 2k (containers on each host) =28K containers in total.

Now I want to monitor containers running on each host. Priority would be to know when a container is consuming extra memory/CPU/is about to go down.

So can Prometheus handle so much load and monitor each container?

Should I rely on the host storage /use influxDb?

Please write your suggestions.

Thank you,

Isabel

Brian Candler

unread,

Apr 16, 2020, 3:09:24 AM4/16/20

to Prometheus Users

> So can Prometheus handle so much load and monitor each container?

In short yes, although the important figure is the total number of *metrics* (servers x containers-per-server x metrics-per-container), and this will affect how much resource you need to throw at your prometheus server. If it's too much, you may choose to filter the metrics you ingest to just the ones of interest, using metric relabelling.

Assuming you are using a modern version of prometheus (2.14 or later) then the web interface on port 9090 will tell you the stats you need to know, under Status > Runtime & Build Information.

As for grafana "hanging": you probably need to configure your dashboards to select a small enough subset of timeseries up-front, e.g. using dashboard variables. If you run an initial query which returns thousands of timeseries, it will indeed take an extremely long time to (a) return the results from prometheus, and (b) render them.

Ben Kochie

unread,

Apr 16, 2020, 3:16:10 AM4/16/20

to Isabel Noronha, Prometheus Users

On Thu, Apr 16, 2020 at 7:53 AM Isabel Noronha <isabeln...@gmail.com> wrote:

Hello,
Yeah, the subject might be overwhelming...

Firstly I'm new to Prometheus.
I have configured 3 targets to monitor containers running inside it until now.(I have done this for around 50 containers so far)
The problem I faced is that grafana goes into a hang state as the containers increase.

Depends on your query, you will likely need recording rules to summarize things.

The web application I'm currently working on is used for simulation. So on each server, we spawn around 2K containers.
so 14 servers * 2k (containers on each host) =28K containers in total.
Now I want to monitor containers running on each host. Priority would be to know when a container is consuming extra memory/CPU/is about to go down.

Like Brian said, it's about the total metrics not the number of containers. What exactly are you monitoring? cAdvisor? Directly instrumentation?

So can Prometheus handle so much load and monitor each container?
Should I rely on the host storage /use influxDb?

InfluxDB is pretty good, but is slightly less efficient than Prometheus internal TSDB. It's a general-use TSDB and not specifically tuned for the Prometheus use case.

Please write your suggestions.

Thank you,
Isabel

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ff164068-8401-4f55-830c-b45f9a44271d%40googlegroups.com.

Isabel Noronha

unread,

Apr 16, 2020, 4:18:52 AM4/16/20

to Prometheus Users

Yeah I just checked relabelling .I need only few metrics and labels.
ex.container_memory_usage_bytes

container_memory_usage_bytes

container_cpu_usage_seconds_total

Isabel Noronha

unread,

Apr 16, 2020, 4:28:18 AM4/16/20

to Prometheus Users

Yeah I just checked relabelling .I need only few metrics and labels like the ones below.
ex.container_memory_usage_bytes
container_memory_usage_bytes
container_cpu_usage_seconds_total

So I have more metrics to be dropped. I don't want to include(to be dropped metrics) in my prometheus.yml file.

So is there a way to keep only what I need and drop other metrics?

An example would help in understanding relabelling better.

Yes, I'm monitoring containers per host in grafana using variables.

And a query like only shows the top 20 containers exceeding the threshold value of CPU usage and sends an alert.

This is what I'm planning to do.

Brian Candler

unread,

Apr 16, 2020, 7:11:41 AM4/16/20

to Prometheus Users

On Thursday, 16 April 2020 09:28:18 UTC+1, Isabel Noronha wrote:

So is there a way to keep only what I need and drop other metrics?
An example would help in understanding relabelling better.

Untested:

metric_relabel_configs:

- source_labels: [__name__]

regex: '(container_memory_usage_bytes|container_cpu_usage_seconds_total)'

action: keep

Ben Kochie

unread,

Apr 16, 2020, 7:16:04 AM4/16/20

to Isabel Noronha, Prometheus Users

On Thu, Apr 16, 2020 at 10:28 AM Isabel Noronha <isabeln...@gmail.com> wrote:

Yeah I just checked relabelling .I need only few metrics and labels like the ones below.
ex.container_memory_usage_bytes
container_memory_usage_bytes
container_cpu_usage_seconds_total
So I have more metrics to be dropped. I don't want to include(to be dropped metrics) in my prometheus.yml file.
So is there a way to keep only what I need and drop other metrics?
An example would help in understanding relabelling better.

Yes, I'm monitoring containers per host in grafana using variables.
And a query like only shows the top 20 containers exceeding the threshold value of CPU usage and sends an alert.
This is what I'm planning to do.

If this is coming from cAdvisor, you can drop metrics by configuring it to disable some metrics collectors:

https://github.com/google/cadvisor/blob/master/docs/runtime_options.md#metrics

On Thursday, April 16, 2020 at 12:39:24 PM UTC+5:30, Brian Candler wrote:
> So can Prometheus handle so much load and monitor each container?

In short yes, although the important figure is the total number of *metrics* (servers x containers-per-server x metrics-per-container), and this will affect how much resource you need to throw at your prometheus server. If it's too much, you may choose to filter the metrics you ingest to just the ones of interest, using metric relabelling.

Assuming you are using a modern version of prometheus (2.14 or later) then the web interface on port 9090 will tell you the stats you need to know, under Status > Runtime & Build Information.

As for grafana "hanging": you probably need to configure your dashboards to select a small enough subset of timeseries up-front, e.g. using dashboard variables. If you run an initial query which returns thousands of timeseries, it will indeed take an extremely long time to (a) return the results from prometheus, and (b) render them.

--

You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.