Prometheus getting slow on about 400 node

Nur Kholis Majid

unread,

Feb 29, 2020, 6:38:39 PM2/29/20

to Prometheus Users

Hi,

I've test prometheus to monitoring node_exporter on 400 instances. With default configuration, in just two months tsdb size reach +- 450GB and memory size +- 135GB. Query become slow and unuseable.

Question:

1. How many maximum node_exporter instances can handle by prometheus with acceptable query duration?

2. Is there any special prometheus configuration for huge amount of instances?

Thank you

Julien Pivotto

unread,

Feb 29, 2020, 6:44:34 PM2/29/20

to Nur Kholis Majid, Prometheus Users

On 29 Feb 15:38, Nur Kholis Majid wrote:
> Hi,
>
> I've test prometheus to monitoring node_exporter on 400 instances. With
> default configuration, in just two months tsdb size reach +- 450GB and
> memory size +- 135GB. Query become slow and unuseable.
>
> [image: photo_2020-03-01_06-33-51.jpg]
>
> [image: photo_2020-03-01_06-34-00.jpg]

Can we know what you mean by default configuration? Is it default or
documented one? What are your startup parameters?

How many series do you have?
max_over_time(prometheus_tsdb_head_series[1d])

Do you have lots of different disks/devices per machines ? lots of
network interfaces?

I recommend you read
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
to better understand this.

>
>
> Question:
> 1. How many maximum node_exporter instances can handle by prometheus with
> acceptable query duration?
> 2. Is there any special prometheus configuration for huge amount of
> instances?
>
> Thank you
>

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7da6b213-02d0-4beb-83fb-e943701b2422%40googlegroups.com.

--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu

signature.asc

Nur Kholis Majid

unread,

Feb 29, 2020, 7:40:21 PM2/29/20

to Prometheus Users

Hi Julien,

On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:

On 29 Feb 15:38, Nur Kholis Majid wrote:
> Hi,
>
> I've test prometheus to monitoring node_exporter on 400 instances. With
> default configuration, in just two months tsdb size reach +- 450GB and
> memory size +- 135GB. Query become slow and unuseable.
>
> [image: photo_2020-03-01_06-33-51.jpg]
>
> [image: photo_2020-03-01_06-34-00.jpg]

Can we know what you mean by default configuration? Is it default or
documented one? What are your startup parameters?

I mean I just add minimum configuration in prometheus.yml:

$ cat prometheus.yml

# my global config

global:

scrape_interval: 15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.

evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.

# scrape_timeout is set to the global default (10s).

# Alertmanager configuration

alerting:

alertmanagers:

- static_configs:

- targets:

# - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.

rule_files:

# - "first_rules.yml"

# - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:

# Here it's Prometheus itself.

scrape_configs:

# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

- job_name: 'prometheus'

# metrics_path defaults to '/metrics'

# scheme defaults to 'http'.

static_configs:

- targets: ['localhost:9090']

- job_name: 'node'

static_configs:

- targets: ['10.10.10.1:9100', '10.10.10.2:9100', etc until 400 nodes]

In node_exporter side, no additional config made.

How many series do you have?
max_over_time(prometheus_tsdb_head_series[1d])

771651

Do you have lots of different disks/devices per machines ? lots of
network interfaces?

Yes. Each node consist of 2 NIC in bonding mode and 12 disks.

I recommend you read
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
to better understand this.

>
>
> Question:
> 1. How many maximum node_exporter instances can handle by prometheus with
> acceptable query duration?
> 2. Is there any special prometheus configuration for huge amount of
> instances?
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,

Feb 29, 2020, 7:55:30 PM2/29/20

to Nur Kholis Majid, Prometheus Users

On 29 Feb 16:40, Nur Kholis Majid wrote:
> Hi Julien,
>
> On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > Hi,
> > >
> > > I've test prometheus to monitoring node_exporter on 400 instances. With
> > > default configuration, in just two months tsdb size reach +- 450GB and
> > > memory size +- 135GB. Query become slow and unuseable.
> > >
> > > [image: photo_2020-03-01_06-33-51.jpg]
> > >
> > > [image: photo_2020-03-01_06-34-00.jpg]

Hi,

Can you tell us what is in your data directory? Are compaction
happening, etc?

e.g. the command
tree data

or ls -Rl data

Thanks

> > an email to promethe...@googlegroups.com <javascript:>.

> > > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/7da6b213-02d0-4beb-83fb-e943701b2422%40googlegroups.com.
> >
> >
> >
> >
> >
> > --
> > (o- Julien Pivotto
> > //\ Open-Source Consultant
> > V_/_ Inuits - https://www.inuits.eu
> >
>

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/986e63a7-798d-4945-adf6-580f9e48ad4b%40googlegroups.com.

signature.asc

Nur Kholis Majid

unread,

Feb 29, 2020, 9:13:52 PM2/29/20

to Prometheus Users

Hi,

On Sunday, March 1, 2020 at 7:55:30 AM UTC+7, Julien Pivotto wrote:

On 29 Feb 16:40, Nur Kholis Majid wrote:
> Hi Julien,
>
> On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > Hi,
> > >
> > > I've test prometheus to monitoring node_exporter on 400 instances. With
> > > default configuration, in just two months tsdb size reach +- 450GB and
> > > memory size +- 135GB. Query become slow and unuseable.
> > >
> > > [image: photo_2020-03-01_06-33-51.jpg]
> > >
> > > [image: photo_2020-03-01_06-34-00.jpg]

Hi,

Can you tell us what is in your data directory? Are compaction
happening, etc?

e.g. the command
tree data

or ls -Rl data

too long to copy here. please see https://paste.ee/p/ayBlq

Thanks

> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,

Mar 1, 2020, 4:02:05 AM3/1/20

to Nur Kholis Majid, Prometheus Users

On 29 Feb 18:13, Nur Kholis Majid wrote:
> Hi,
>
> On Sunday, March 1, 2020 at 7:55:30 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 16:40, Nur Kholis Majid wrote:
> > > Hi Julien,
> > >
> > > On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> > > >
> > > > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > > > Hi,
> > > > >
> > > > > I've test prometheus to monitoring node_exporter on 400 instances.
> > With
> > > > > default configuration, in just two months tsdb size reach +- 450GB
> > and
> > > > > memory size +- 135GB. Query become slow and unuseable.
> > > > >
> > > > > [image: photo_2020-03-01_06-33-51.jpg]
> > > > >
> > > > > [image: photo_2020-03-01_06-34-00.jpg]
> >
> > Hi,
> >
> > Can you tell us what is in your data directory? Are compaction
> > happening, etc?
> >
> > e.g. the command
> > tree data
> >
> > or ls -Rl data
> >
> > too long to copy here. please see https://paste.ee/p/ayBlq
>
> Thanks

You have a lot of failed compations in the past, and a lot of .tmp
directories.

What is strange is that at the end compaction happens.

I have the following next questions to help you:

- What is your prometheus version?
- Can you share the logs of prometheus?
- Are you using the node_exporter textfile_collector?
- Do you have metrics relabel configs?

We have a few bugs out there but none of them explain that the wal is
compacted correctly at the end.

> > https://groups.google.com/d/msgid/prometheus-users/986e63a7-798d-4945-adf6-580f9e48ad4b%40googlegroups.com.
> >
> >
> >
> > --
> > (o- Julien Pivotto
> > //\ Open-Source Consultant
> > V_/_ Inuits - https://www.inuits.eu
> >
>

> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.

> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/97240c8d-3a9d-4bf4-9a14-a91ae0a087d9%40googlegroups.com.

signature.asc

Reply all

Reply to author

Forward

Prometheus getting slow on about 400 node_exporter instances

Nur Kholis Majid

Julien Pivotto

Nur Kholis Majid

Julien Pivotto

Nur Kholis Majid

Julien Pivotto