Prometheus 2.x performance issues (https://github.com/prometheus/prometheus/issues/3707)

chevro...@gmail.com

unread,

Jan 26, 2018, 3:44:56 AM1/26/18

to Prometheus Users

All things that described below are actual for 2.0 and 2.1

What did you do?
Prometheus sends samples to our remote storage. One time when remote storage is going to down, we see in logs:
level=warn ts=2018-01-19T08:29:08.516631616Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=1 err="Post <storage_url>: getsockopt: no route to host"
If we try to reload prometheus configuration by sending post on <prometheus_url>/-/reload when remote storage is down, this request is never finished. Looks like prometheus trying to send data to storage again and again, before reload configuration.
What did you expect to see?
Prometheus should reload configuration
What did you see instead? Under which circumstances?
Configuration wasn't reloaded, no response on POST <prometheus_url>/-/reload request
Environment

System information:
Linux 3.10.0-514.26.2.el7.x86_64 x86_64
Prometheus version:
2.0.0
Prometheus configuration file:

global:
  scrape_interval: 15s
  scrape_timeout: 10s
  evaluation_interval: 15s
  external_labels:
    monitor: default
...
remote_write:
- url: <remote_storage_url>
  remote_timeout: 30s
  write_relabel_configs:
  - source_labels: [indicatorName]
    separator: ;
    regex: (.+)
    replacement: $1
    action: keep
  queue_config:
    capacity: 100000
    max_shards: 1000
    max_samples_per_send: 100
    batch_send_deadline: 5s
    max_retries: 10
    min_backoff: 30ms
    max_backoff: 100ms

Logs:

2018/01/19 08:25:22 Redirected: /-/reload
level=info ts=2018-01-19T08:25:22.313752379Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus/default.yml
level=info ts=2018-01-19T08:25:22.319062591Z caller=queue_manager.go:253 component=remote msg="Stopping remote storage..."
level=warn ts=2018-01-19T08:25:23.406480826Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:26.412528632Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:29.418470232Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"

One more case with low performance from prometheus 2.* Our prometheus was deployed in openshift. It has about 60 targets which were monitored by snmp, blackbox and other exporters. When we tried to update prometheus configuration from external network, we receive 504 error (30-seconds timeout). When we execute reload from internal network we receive response about successfull update in 90-120 seconds.

Prometheus started on local PC:
we have 2000 targets which are monitored by snmp. Some of this are available, some - not.
Prometheus 2.* works slowly with configuration update than 1.7.1 in around 100 times. Why is it could be? It looks like performance bug.

Prometheus 2.*
$ time curl -X POST http://localhost:9090/-/reload
real    0m 10.007s
user    0m 0.004s
sys     0m 0.007s
Prometheus 1.7.1
$ time curl -X POST http://localhost:9090/-/reload
real    0m 0.122s
user    0m 0.005s
sys     0m 0.002s

prometheus configuration:

global:
  scrape_interval:     60s
  evaluation_interval: 60s
  external_labels:
      monitor: 'codelab-monitor'
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
  - job_name: 'node'
    scrape_interval: 5s 
    static_configs:
      - targets: ['localhost:9100']
  - job_name: 'snmpjob'
    metrics_path: /snmp
    params:
      module: [base]
    static_configs:
      - targets:
        - 192.168.56.1:20000
        - 192.168.56.1:20001
        - 192.168.56.1:20002
        - 192.168.56.1:20003
...
        - 192.168.56.1:21997
        - 192.168.56.1:21998
        - 192.168.56.1:21999
        - 192.168.56.1:22000
    relabel_configs:
      - source_labels: [__address__]
        target_label: __param_target
      - source_labels: [__param_target]
        target_label: instance
      - target_label: __address__
        replacement: 127.0.0.1:9116

Ben Kochie

unread,

Jan 26, 2018, 3:57:47 AM1/26/18

to chevro...@gmail.com, Prometheus Users

One tip, if you have a large number of targets, you can use `file_sd_configs` instead of `static_configs`. File discovery uses filesystem notifications so you don't need to trigger a reload.

It would be good to get some pprof data during the reload to see what's going on.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/79a44ec9-10e8-48ba-ad08-47fecede7f3d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

chevro...@gmail.com

unread,

Jan 26, 2018, 7:11:17 AM1/26/18

to Prometheus Users

trace and profile for case with 2000 targets (prometheus 2.0): https://drive.google.com/open?id=1NfswKHox4AHUNRk5kVfa82VLY5M4z02h

пятница, 26 января 2018 г., 11:57:47 UTC+3 пользователь Ben Kochie написал:

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

chevro...@gmail.com

unread,

Jan 29, 2018, 3:13:59 AM1/29/18

to Prometheus Users

From trace:

пятница, 26 января 2018 г., 15:11:17 UTC+3 пользователь chevro...@gmail.com написал:

chevro...@gmail.com

unread,

Jan 29, 2018, 3:22:15 AM1/29/18

to Prometheus Users

And...

looks like run was executed more than needed

github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run N=4702 
net/http.(*Transport).getConn.func4 N=508 
main.main.func2 N=1 
github.com/prometheus/prometheus/retrieval.(*TargetManager).reload.func1 N=3 
net/http.(*persistConn).readLoop N=523 
net/http.(*conn).serve N=2 
internal/singleflight.(*Group).doCall N=3 
net/http.(*persistConn).writeLoop N=523 
main.main.func4 N=1 
runtime.timerproc N=1 
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*SegmentWAL).run N=1 
runtime/trace.Start.func1 N=1 
github.com/prometheus/prometheus/retrieval.(*scrapePool).reload.func1 N=2003 
context.WithDeadline.func2 N=31 
github.com/prometheus/prometheus/vendor/github.com/cockroachdb/cmux.(*cMux).serve N=1 
github.com/prometheus/prometheus/web.(*Handler).Run.func5 N=1 
net.(*netFD).connect.func2 N=511 
github.com/prometheus/prometheus/discovery.(*TargetSet).updateProviders.func1 N=3 
net/http.(*connReader).backgroundRead N=1 
github.com/prometheus/prometheus/discovery.(*StaticProvider).Run N=3 
N=1136

понедельник, 29 января 2018 г., 11:13:59 UTC+3 пользователь chevro...@gmail.com написал:

Reply all

Reply to author

Forward