Very high CPU usage on Prometheus instance and Push Gateway errors

1,203 views
Skip to first unread message

Khusro Jaleel

unread,
Jun 27, 2017, 1:08:38 PM6/27/17
to Prometheus Users

Hello there,


We have setup a Prometheus instance in an EC2 instance of type "m3.xlarge" (4 x vCPU and 15GB of RAM). We are using an external EBS volume for storage. It has been working for about a month now, slowly ingesting metrics using Consul discovery of nodes. It is also scraping a Push Gateway running in a container on the same EC2 host.

I also tweaked it to use approx 50% of the memory for heap usage.

Recently it's using up all the CPU on the EC2 host (load is around 5.0 on a 4 core box). The checkpointing times and urgency scores are also slowly creeping up. A few times per hour, the "up" metric returns "0" for about 30 seconds or so, then goes back to "1". 

I have also noticed that it's no longer able to scrape the push gateway. The error I'm getting is "context deadline exceeded" and the target is marked as DOWN.

I am going to attach some dashboards that I've been using to monitor this as well as a Go routine dump.

**Environment**

AWS EC2. "m3.xlarge" (4 x vCPU and 15GB of RAM). We are using an external EBS volume for storage.

* System information:
Linux 4.4.0-1013-aws x86_64

* Prometheus version: 
prometheus, version 1.6.2 (branch: master, revision: b38e977fd8cc2a0d13f47e7f0e17b82d1a908a9a)
  build user:       root@c99d9d650cf4
  build date:       20170511-12:59:13
  go version:       go1.8.1

* Container running config:

 "/bin/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -alertmanager.url=http://alertmanager:9093 -storage.local.target-heap-size=8053063680"

* Prometheus configuration file:
```
# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "alert.rules"
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: 'consul'
    scrape_interval: 4s
    metrics_path: '/__prometheus/pull'
    consul_sd_configs:
      - server: 'consul.host.name:8500'

    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,http,.*
        action: keep
      - source_labels: [__meta_consul_tags]
        regex: '.*,(http),.*'
        replacement: '${1}'
        target_label: instance

  - job_name: 'pushgateway'

    scrape_interval: 4s
    honor_labels: true
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pushgateway:9091'

    metric_relabel_configs:
      - source_labels: [__scheme__]
        target_label: instance
        replacement: 'http'

  - job_name: 'prometheus'

    scrape_interval: 10s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['localhost:9090','cadvisor:8080','node-exporter:9100']
```

* Logs (last 50 lines):
```
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.103607008 @[1498581979.007] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=373 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.03] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134071557 @[1498581979.03] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=642 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.03] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.175078637 @[1498581979.03] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=511 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.268] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.37] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.143751463 @[1498581979.268] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.042173088 @[1498581979.37] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.059883943 @[1498581979.371] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample with repeated timestamp but different value" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.060360829 @[1498581979.371] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting out-of-order samples" numDropped=2 source="scrape.go:533"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=573 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.409] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.134828594 @[1498581979.409] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=1 source="scrape.go:536"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 0 @[1498581977.641] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 2.010976508 @[1498581977.641] source="scrape.go:595"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.673] source="scrape.go:586"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:589"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:592"
time="2017-06-27T16:46:19Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.0632237 @[1498581979.673] source="scrape.go:595"
time="2017-06-27T16:46:20Z" level=warning msg="Error on ingesting samples with different value but same timestamp" numDropped=552 source="scrape.go:536"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape health sample discarded" error="sample timestamp out of order" sample=up{instance="http", job="consul"} => 1 @[1498581979.875] source="scrape.go:586"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape duration sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:589"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape sample count sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:592"
time="2017-06-27T16:46:20Z" level=warning msg="Scrape sample count post-relabeling sample discarded" error="sample timestamp out of order" sample=scrape_duration_seconds{instance="http", job="consul"} => 0.149382989 @[1498581979.875] source="scrape.go:595"
```



goroutine.dump.txt

universal...@gmail.com

unread,
Feb 14, 2018, 12:16:19 AM2/14/18
to Prometheus Users
Hey Khusro,

Can you plz share the Graffana json?
I am very new to prometheus and could learn from the way you are monitoring prometheus.

Thanks

Khusro Jaleel

unread,
Feb 14, 2018, 4:23:43 AM2/14/18
to Prometheus Users
Hi, 

Those Grafana graphs are from Grafana.com, specifically the "Prometheus Benchmark" ones here:
Reply all
Reply to author
Forward
0 new messages