We have setup a Prometheus instance in an EC2 instance of type "m3.xlarge" (4 x vCPU and 15GB of RAM). We are using an external EBS volume for storage. It has been working for about a month now, slowly ingesting metrics using Consul discovery of nodes. It is also scraping a Push Gateway running in a container on the same EC2 host.
I also tweaked it to use approx 50% of the memory for heap usage.
Recently it's using up all the CPU on the EC2 host (load is around 5.0 on a 4 core box). The checkpointing times and urgency scores are also slowly creeping up. A few times per hour, the "up" metric returns "0" for about 30 seconds or so, then goes back to "1".
I have also noticed that it's no longer able to scrape the push gateway. The error I'm getting is "context deadline exceeded" and the target is marked as DOWN.
I am going to attach some dashboards that I've been using to monitor this as well as a Go routine dump.
**Environment**
AWS EC2. "m3.xlarge" (4 x vCPU and 15GB of RAM). We are using an external EBS volume for storage.
* System information:
Linux 4.4.0-1013-aws x86_64
* Prometheus version:
prometheus, version 1.6.2 (branch: master, revision: b38e977fd8cc2a0d13f47e7f0e17b82d1a908a9a)
build user: root@c99d9d650cf4
build date: 20170511-12:59:13
go version: go1.8.1
* Container running config:
"/bin/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -alertmanager.url=
http://alertmanager:9093 -storage.local.target-heap-size=
8053063680"
* Prometheus configuration file:
```
# my global config
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
- "alert.rules"
# - "first.rules"
# - "second.rules"
scrape_configs:
- job_name: 'consul'
scrape_interval: 4s
metrics_path: '/__prometheus/pull'
consul_sd_configs:
relabel_configs:
- source_labels: [__meta_consul_tags]
regex: .*,http,.*
action: keep
- source_labels: [__meta_consul_tags]
regex: '.*,(http),.*'
replacement: '${1}'
target_label: instance
- job_name: 'pushgateway'
scrape_interval: 4s
honor_labels: true
metrics_path: '/metrics'
static_configs:
- targets:
- 'pushgateway:9091'
metric_relabel_configs:
- source_labels: [__scheme__]
target_label: instance
replacement: 'http'
- job_name: 'prometheus'
scrape_interval: 10s
metrics_path: '/metrics'
static_configs:
- targets: ['localhost:9090','cadvisor:8080','node-exporter:9100']
```
* Logs (last 50 lines):
```
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=373 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.03] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=642 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.03] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=511 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.268] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.37] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=2 source="scrape.go:533"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=573 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.409] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 0 @[1498581977.641] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.673] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:595"
time="2017-06-27T16:46:20Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=552 source="scrape.go:536"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.875] source="scrape.go:586"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:589"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:592"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:595"
```