Prometheus CPU usage on 4-core box

1,631 views
Skip to first unread message

khusro...@holidayextras.com

unread,
Jul 3, 2017, 9:40:57 AM7/3/17
to Prometheus Users

Hi there, 

I am running Prometheus on a 4-core Amazon instance and it's CPU usage during the day is always very close to 100% or otherwise it's very spiky. Is it worth increasing the number of cores in my case to a few more? I'm attaching a graph below. The disk utilisation is also very spiky I have noticed.

**Environment**

AWS EC2. "m3.xlarge" (4 x vCPU and 15GB of RAM). We are using an external EBS volume for storage.

* System information:
Linux 4.4.0-1013-aws x86_64

* Prometheus version: 
prometheus, version 1.6.2 (branch: master, revision: b38e977fd8cc2a0d13f47e7f0e17b82d1a908a9a)
  build user:       root@c99d9d650cf4
  build date:       20170511-12:59:13
  go version:       go1.8.1

* Container running config:

 "/bin/prometheus -config.file=/etc/prometheus/prometheus.yml -storage.local.path=/prometheus -alertmanager.url=http://alertmanager:9093 -storage.local.target-heap-size=8053063680"

* Prometheus configuration file:
```
# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  - "alert.rules"
  # - "first.rules"
  # - "second.rules"

scrape_configs:
  - job_name: 'consul'
    scrape_interval: 4s
    metrics_path: '/__prometheus/pull'
    consul_sd_configs:
      - server: 'consul.host.name:8500'

    relabel_configs:
      - source_labels: [__meta_consul_tags]
        regex: .*,http,.*
        action: keep
      - source_labels: [__meta_consul_tags]
        regex: '.*,(http),.*'
        replacement: '${1}'
        target_label: instance

  - job_name: 'pushgateway'

    scrape_interval: 4s
    honor_labels: true
    metrics_path: '/metrics'
    static_configs:
      - targets:
          - 'pushgateway:9091'

    metric_relabel_configs:
      - source_labels: [__scheme__]
        target_label: instance
        replacement: 'http'

  - job_name: 'prometheus'

    scrape_interval: 10s
    metrics_path: '/metrics'
    static_configs:
      - targets: ['localhost:9090','cadvisor:8080','node-exporter:9100']
```




Ben Kochie

unread,
Jul 3, 2017, 10:06:32 AM7/3/17
to khusro...@holidayextras.com, Prometheus Users
Interesting, I usually don't see a major daily cycle on Prometheus servers.  Do you have some kind of auto-scaling system that follows a daily cycle?

Depending on the number of targets, I would probably say yes, time to scale up this Prometheus server's CPU.

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/aac15568-4c2a-4947-9844-6882d5d61580%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

khusro...@holidayextras.com

unread,
Jul 3, 2017, 11:35:14 AM7/3/17
to Prometheus Users, khusro...@holidayextras.com
Thanks Ben. Well the targets it is scraping are on Amazon's ECS (containers) which means that they do get recycled once in a while, however the actual containers running do not change. So although we see a dip in CPU usage during the night, the containers (and their number) should be fairly static. Our apps do remove their metrics from memory after they have been scraped but that's just a good thing to do anyway.

I am thinking of increasing the cores too. I have a total of 88 targets at the moment and the number of metrics per target varies.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.

Ben Kochie

unread,
Jul 3, 2017, 11:50:13 AM7/3/17
to khusro...@holidayextras.com, Prometheus Users
On Mon, Jul 3, 2017 at 5:35 PM, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
Thanks Ben. Well the targets it is scraping are on Amazon's ECS (containers) which means that they do get recycled once in a while, however the actual containers running do not change. So although we see a dip in CPU usage during the night, the containers (and their number) should be fairly static. Our apps do remove their metrics from memory after they have been scraped but that's just a good thing to do anyway.

Removing metrics after a scrape is generally the wrong thing to do, it violates a number of Prometheus design standards.  The big one is it makes high availability impossible.

khusro...@holidayextras.com

unread,
Jul 3, 2017, 12:09:38 PM7/3/17
to Prometheus Users, khusro...@holidayextras.com
It violates design standards? Where can I find out more, so I can mention this to the devs? We are using federation at the moment with a slave instance so the slave pulls the data from the master which means we don't really have to worry about deleting the metrics from the container instances.

Lastly, any ideas about the excessive disk I/O? My understanding is that Prometheus only tries to write to disk when it does checkpoints? Our urgency scores are really low so its not like it should be trying to flush to disk all the time.

Ben Kochie

unread,
Jul 3, 2017, 12:34:55 PM7/3/17
to khusro...@holidayextras.com, Prometheus Users
There are two sections worth reviewing in our main docs.

The docs on writing libraries and exporters. (I highly recommend using our official libraries if possible)
https://prometheus.io/docs/instrumenting/writing_clientlibs/

There is also a section on practices:

As for the disk IO, I'm not sure.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/979873d4-c239-4933-8cc5-6eaf15c21064%40googlegroups.com.

khusro...@holidayextras.com

unread,
Jul 3, 2017, 12:43:33 PM7/3/17
to Prometheus Users, khusro...@holidayextras.com
OK, thanks. Any ideas on how to figure out what it's trying to do when hitting the disk? I've disabled the logs temporarily so it's not even logging that's causing it.


On Monday, 3 July 2017 17:34:55 UTC+1, Ben Kochie wrote:
There are two sections worth reviewing in our main docs.

The docs on writing libraries and exporters. (I highly recommend using our official libraries if possible)
https://prometheus.io/docs/instrumenting/writing_clientlibs/

There is also a section on practices:

As for the disk IO, I'm not sure.

On Mon, Jul 3, 2017 at 6:09 PM, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
It violates design standards? Where can I find out more, so I can mention this to the devs? We are using federation at the moment with a slave instance so the slave pulls the data from the master which means we don't really have to worry about deleting the metrics from the container instances.

Lastly, any ideas about the excessive disk I/O? My understanding is that Prometheus only tries to write to disk when it does checkpoints? Our urgency scores are really low so its not like it should be trying to flush to disk all the time.



On Monday, 3 July 2017 16:50:13 UTC+1, Ben Kochie wrote:
On Mon, Jul 3, 2017 at 5:35 PM, khusro.jaleel via Prometheus Users <promethe...@googlegroups.com> wrote:
Thanks Ben. Well the targets it is scraping are on Amazon's ECS (containers) which means that they do get recycled once in a while, however the actual containers running do not change. So although we see a dip in CPU usage during the night, the containers (and their number) should be fairly static. Our apps do remove their metrics from memory after they have been scraped but that's just a good thing to do anyway.

Removing metrics after a scrape is generally the wrong thing to do, it violates a number of Prometheus design standards.  The big one is it makes high availability impossible.
 

I am thinking of increasing the cores too. I have a total of 88 targets at the moment and the number of metrics per target varies.


--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages