Prometheus getting slow on about 400 node_exporter instances

37 views
Skip to first unread message

Nur Kholis Majid

unread,
Feb 29, 2020, 6:38:39 PM2/29/20
to Prometheus Users
Hi,

I've test prometheus to monitoring node_exporter on 400 instances. With default configuration, in just two months tsdb size reach +- 450GB and memory size +- 135GB. Query become slow and unuseable.

photo_2020-03-01_06-33-51.jpg

photo_2020-03-01_06-34-00.jpg



Question:
1. How many maximum node_exporter instances can handle by prometheus with acceptable query duration?
2. Is there any special prometheus configuration for huge amount of instances?

Thank you

Julien Pivotto

unread,
Feb 29, 2020, 6:44:34 PM2/29/20
to Nur Kholis Majid, Prometheus Users
On 29 Feb 15:38, Nur Kholis Majid wrote:
> Hi,
>
> I've test prometheus to monitoring node_exporter on 400 instances. With
> default configuration, in just two months tsdb size reach +- 450GB and
> memory size +- 135GB. Query become slow and unuseable.
>
> [image: photo_2020-03-01_06-33-51.jpg]
>
> [image: photo_2020-03-01_06-34-00.jpg]


Can we know what you mean by default configuration? Is it default or
documented one? What are your startup parameters?

How many series do you have?
max_over_time(prometheus_tsdb_head_series[1d])

Do you have lots of different disks/devices per machines ? lots of
network interfaces?

I recommend you read
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
to better understand this.

>
>
> Question:
> 1. How many maximum node_exporter instances can handle by prometheus with
> acceptable query duration?
> 2. Is there any special prometheus configuration for huge amount of
> instances?
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7da6b213-02d0-4beb-83fb-e943701b2422%40googlegroups.com.




--
(o- Julien Pivotto
//\ Open-Source Consultant
V_/_ Inuits - https://www.inuits.eu
signature.asc

Nur Kholis Majid

unread,
Feb 29, 2020, 7:40:21 PM2/29/20
to Prometheus Users
Hi Julien,

On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
On 29 Feb 15:38, Nur Kholis Majid wrote:
> Hi,
>
> I've test prometheus to monitoring node_exporter on 400 instances. With
> default configuration, in just two months tsdb size reach +- 450GB and
> memory size +- 135GB. Query become slow and unuseable.
>
> [image: photo_2020-03-01_06-33-51.jpg]
>
> [image: photo_2020-03-01_06-34-00.jpg]


Can we know what you mean by default configuration? Is it default or
documented one? What are your startup parameters?

I mean I just add minimum configuration in prometheus.yml:
$ cat prometheus.yml
# my global config
global:
  scrape_interval:     15s # Set the scrape interval to every 15 seconds. Default is every 1 minute.
  evaluation_interval: 15s # Evaluate rules every 15 seconds. The default is every 1 minute.
  # scrape_timeout is set to the global default (10s).

# Alertmanager configuration
alerting:
  alertmanagers:
  - static_configs:
    - targets:
      # - alertmanager:9093

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  # - "first_rules.yml"
  # - "second_rules.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
  - job_name: 'prometheus'

    # metrics_path defaults to '/metrics'
    # scheme defaults to 'http'.

    static_configs:
    - targets: ['localhost:9090']

  - job_name: 'node'
    static_configs:
    - targets: ['10.10.10.1:9100', '10.10.10.2:9100', etc until 400 nodes]

In node_exporter side, no additional config made. 
 
How many series do you have?
max_over_time(prometheus_tsdb_head_series[1d])

771651
 
Do you have lots of different disks/devices per machines ? lots of
network interfaces?
Yes. Each node consist of 2 NIC in bonding mode and 12 disks.
 

I recommend you read
https://www.robustperception.io/how-much-ram-does-prometheus-2-x-need-for-cardinality-and-ingestion
to better understand this.

>
>
> Question:
> 1. How many maximum node_exporter instances can handle by prometheus with
> acceptable query duration?
> 2. Is there any special prometheus configuration for huge amount of
> instances?
>
> Thank you
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,
Feb 29, 2020, 7:55:30 PM2/29/20
to Nur Kholis Majid, Prometheus Users
On 29 Feb 16:40, Nur Kholis Majid wrote:
> Hi Julien,
>
> On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > Hi,
> > >
> > > I've test prometheus to monitoring node_exporter on 400 instances. With
> > > default configuration, in just two months tsdb size reach +- 450GB and
> > > memory size +- 135GB. Query become slow and unuseable.
> > >
> > > [image: photo_2020-03-01_06-33-51.jpg]
> > >
> > > [image: photo_2020-03-01_06-34-00.jpg]

Hi,

Can you tell us what is in your data directory? Are compaction
happening, etc?

e.g. the command
tree data

or ls -Rl data

Thanks
> > an email to promethe...@googlegroups.com <javascript:>.
> > > To view this discussion on the web visit
> > https://groups.google.com/d/msgid/prometheus-users/7da6b213-02d0-4beb-83fb-e943701b2422%40googlegroups.com.
> >
> >
> >
> >
> >
> > --
> > (o- Julien Pivotto
> > //\ Open-Source Consultant
> > V_/_ Inuits - https://www.inuits.eu
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/986e63a7-798d-4945-adf6-580f9e48ad4b%40googlegroups.com.
signature.asc

Nur Kholis Majid

unread,
Feb 29, 2020, 9:13:52 PM2/29/20
to Prometheus Users
Hi,


On Sunday, March 1, 2020 at 7:55:30 AM UTC+7, Julien Pivotto wrote:
On 29 Feb 16:40, Nur Kholis Majid wrote:
> Hi Julien,
>
> On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > Hi,
> > >
> > > I've test prometheus to monitoring node_exporter on 400 instances. With
> > > default configuration, in just two months tsdb size reach +- 450GB and
> > > memory size +- 135GB. Query become slow and unuseable.
> > >
> > > [image: photo_2020-03-01_06-33-51.jpg]
> > >
> > > [image: photo_2020-03-01_06-34-00.jpg]

Hi,

Can you tell us what is in your data directory? Are compaction
happening, etc?

e.g. the command
tree data

or ls -Rl data

too long to copy here. please see https://paste.ee/p/ayBlq

Thanks
 
> To unsubscribe from this group and stop receiving emails from it, send an email to promethe...@googlegroups.com.

Julien Pivotto

unread,
Mar 1, 2020, 4:02:05 AM3/1/20
to Nur Kholis Majid, Prometheus Users
On 29 Feb 18:13, Nur Kholis Majid wrote:
> Hi,
>
> On Sunday, March 1, 2020 at 7:55:30 AM UTC+7, Julien Pivotto wrote:
> >
> > On 29 Feb 16:40, Nur Kholis Majid wrote:
> > > Hi Julien,
> > >
> > > On Sunday, March 1, 2020 at 6:44:34 AM UTC+7, Julien Pivotto wrote:
> > > >
> > > > On 29 Feb 15:38, Nur Kholis Majid wrote:
> > > > > Hi,
> > > > >
> > > > > I've test prometheus to monitoring node_exporter on 400 instances.
> > With
> > > > > default configuration, in just two months tsdb size reach +- 450GB
> > and
> > > > > memory size +- 135GB. Query become slow and unuseable.
> > > > >
> > > > > [image: photo_2020-03-01_06-33-51.jpg]
> > > > >
> > > > > [image: photo_2020-03-01_06-34-00.jpg]
> >
> > Hi,
> >
> > Can you tell us what is in your data directory? Are compaction
> > happening, etc?
> >
> > e.g. the command
> > tree data
> >
> > or ls -Rl data
> >
> > too long to copy here. please see https://paste.ee/p/ayBlq
>
> Thanks


You have a lot of failed compations in the past, and a lot of .tmp
directories.

What is strange is that at the end compaction happens.

I have the following next questions to help you:

- What is your prometheus version?
- Can you share the logs of prometheus?
- Are you using the node_exporter textfile_collector?
- Do you have metrics relabel configs?

We have a few bugs out there but none of them explain that the wal is
compacted correctly at the end.
> > https://groups.google.com/d/msgid/prometheus-users/986e63a7-798d-4945-adf6-580f9e48ad4b%40googlegroups.com.
> >
> >
> >
> > --
> > (o- Julien Pivotto
> > //\ Open-Source Consultant
> > V_/_ Inuits - https://www.inuits.eu
> >
>
> --
> You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/97240c8d-3a9d-4bf4-9a14-a91ae0a087d9%40googlegroups.com.
signature.asc
Reply all
Reply to author
Forward
0 new messages