All things that described below are actual for 2.0 and 2.1
What did you do?
Prometheus sends samples to our remote storage. One time when remote storage is going to down, we see in logs:
level=warn ts=2018-01-19T08:29:08.516631616Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=1 err="Post <storage_url>: getsockopt: no route to host"
If we try to reload prometheus configuration by sending post on <prometheus_url>/-/reload when remote storage is down, this request is never finished. Looks like prometheus trying to send data to storage again and again, before reload configuration.
What did you expect to see?
Prometheus should reload configuration
What did you see instead? Under which circumstances?
Configuration wasn't reloaded, no response on POST <prometheus_url>/-/reload request
Environment
System information:
Linux 3.10.0-514.26.2.el7.x86_64 x86_64
Prometheus version:
2.0.0
Prometheus configuration file:
global:
scrape_interval: 15s
scrape_timeout: 10s
evaluation_interval: 15s
external_labels:
monitor: default
...
remote_write:
- url: <remote_storage_url>
remote_timeout: 30s
write_relabel_configs:
- source_labels: [indicatorName]
separator: ;
regex: (.+)
replacement: $1
action: keep
queue_config:
capacity: 100000
max_shards: 1000
max_samples_per_send: 100
batch_send_deadline: 5s
max_retries: 10
min_backoff: 30ms
max_backoff: 100ms
2018/01/19 08:25:22 Redirected: /-/reload
level=info ts=2018-01-19T08:25:22.313752379Z caller=main.go:490 msg="Loading configuration file" filename=/config/prometheus/default.yml
level=info ts=2018-01-19T08:25:22.319062591Z caller=queue_manager.go:253 component=remote msg="Stopping remote storage..."
level=warn ts=2018-01-19T08:25:23.406480826Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:26.412528632Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"
level=warn ts=2018-01-19T08:25:29.418470232Z caller=queue_manager.go:485 component=remote msg="Error sending samples to remote storage" count=8 err="Post <remote_storage_url>: getsockopt: no route to host"One more case with low performance from prometheus 2.* Our prometheus was deployed in openshift. It has about 60 targets which were monitored by snmp, blackbox and other exporters. When we tried to update prometheus configuration from external network, we receive 504 error (30-seconds timeout). When we execute reload from internal network we receive response about successfull update in 90-120 seconds.
Prometheus started on local PC:
we have 2000 targets which are monitored by snmp. Some of this are available, some - not.
Prometheus 2.* works slowly with configuration update than 1.7.1 in around 100 times. Why is it could be? It looks like performance bug.
Prometheus 2.*
$ time curl -X POST http://localhost:9090/-/reload
real 0m 10.007s
user 0m 0.004s
sys 0m 0.007s
Prometheus 1.7.1
$ time curl -X POST http://localhost:9090/-/reload
real 0m 0.122s
user 0m 0.005s
sys 0m 0.002s
prometheus configuration:
global:
scrape_interval: 60s
evaluation_interval: 60s
external_labels:
monitor: 'codelab-monitor'
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node'
scrape_interval: 5s
static_configs:
- targets: ['localhost:9100']
- job_name: 'snmpjob'
metrics_path: /snmp
params:
module: [base]
static_configs:
- targets:
- 192.168.56.1:20000
- 192.168.56.1:20001
- 192.168.56.1:20002
- 192.168.56.1:20003
...
- 192.168.56.1:21997
- 192.168.56.1:21998
- 192.168.56.1:21999
- 192.168.56.1:22000
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: 127.0.0.1:9116--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/79a44ec9-10e8-48ba-ad08-47fecede7f3d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
github.com/prometheus/prometheus/retrieval.(*scrapeLoop).run N=4702
net/http.(*Transport).getConn.func4 N=508
main.main.func2 N=1
github.com/prometheus/prometheus/retrieval.(*TargetManager).reload.func1 N=3
net/http.(*persistConn).readLoop N=523
net/http.(*conn).serve N=2
internal/singleflight.(*Group).doCall N=3
net/http.(*persistConn).writeLoop N=523
main.main.func4 N=1
runtime.timerproc N=1
github.com/prometheus/prometheus/vendor/github.com/prometheus/tsdb.(*SegmentWAL).run N=1
runtime/trace.Start.func1 N=1
github.com/prometheus/prometheus/retrieval.(*scrapePool).reload.func1 N=2003
context.WithDeadline.func2 N=31
github.com/prometheus/prometheus/vendor/github.com/cockroachdb/cmux.(*cMux).serve N=1
github.com/prometheus/prometheus/web.(*Handler).Run.func5 N=1
net.(*netFD).connect.func2 N=511
github.com/prometheus/prometheus/discovery.(*TargetSet).updateProviders.func1 N=3
net/http.(*connReader).backgroundRead N=1
github.com/prometheus/prometheus/discovery.(*StaticProvider).Run N=3
N=1136