Prometheus server increases CPU usage beyond 200%

34 views
Skip to first unread message

Isabel Noronha

unread,
May 14, 2020, 4:45:22 AM5/14/20
to Prometheus Users
Hi,

Server config where prometheus is running:
160 CPU cores
500 Gb RAM
2Tb Hardisk.

Prometheus version:2.18.0
cadvisor version:0.36.0

Prometheus is running inside a container.
I have already done relabeling.
Retention period is 15days.

I am using Cadvisor to get metrics from containers  around 4k containers.
I have done relabeling for container metrics as well.

Scrape interval is 40s

I use top command to check the CPU usage.
So to my surprise Prometheus was exceeding 200% CPU usage.
On this server (where prometheus server is running ) has around 2K containers.

On another target 2K containers,
So overall 4K containers.

Could anyone help me understand the possible reasons for prometheus to increase the CPU usage?

prometheus .yml
# my global config
global:
  scrape_interval:     15s # By default, scrape targets every 15 seconds.
  evaluation_interval: 15s # By default, scrape targets every 15 seconds.
  # scrape_timeout is set to the global default (10s).

  # Attach these labels to any time series or alerts when communicating with
  # external systems (federation, remote storage, Alertmanager).
  external_labels:
      monitor: ‘prometheus-monitor’

# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
  #- 'alert.rules'
  # - "first.rules"
   - "alert_rules.yml"

# alert
alerting:
  alertmanagers:
  - scheme: http
    static_configs:
    - targets:
      - "server:9093"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.

  - job_name: 'prometheus'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 40s
    scrape_timeout: 40s

    static_configs:
        - targets: ['localhost:9010']

  - job_name: 'cadvisor'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 40s
    scrape_timeout: 40s

    static_configs:
          - targets: [server1:8080',server2:8080',server3:8080']
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(container_fs_writes_total|container_fs_reads_total|container_tasks_state|container_cpu_user_seconds_total|container_last_seen|container_memory_usage_bytes|container_cpu_usage_seconds_total|container_network_transmit_bytes_total|container_memory_rss|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_memory_cache|cadvisor_version_info)'
        action: keep

  - job_name: 'node-exporter'

    # Override the global default and scrape targets from this job every 5 seconds.
    scrape_interval: 15s
    scrape_timeout: 15s
    static_configs:
          - targets: [server1:9100',server2:9100',server3:9100']

    metric_relabel_configs:
      - source_labels: [__name__]
        regex: '(process_start_time_seconds|node_load1|node_exporter_build_info|node_uname_info|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemFree_bytes|node_memory_SwapCached_bytes|node_memory_PageTables_bytes|node_memory_VmallocUsed_bytes|node_memory_SwapTotal_bytes|node_memory_Committed_AS_bytes|node_memory_Active_bytes|node_memory_Mapped_bytes|node_memory_Inactive_bytes|node_cpu_seconds_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_memory_MemFree_bytes |node_memory_Cached_bytes|node_filesystem_free_bytes)'
        action: keep
  - job_name: 'docker'
         # metrics_path defaults to '/metrics'
         # scheme defaults to 'http'.
    scrape_interval: 5s

    static_configs:
      - targets: ['172.17.0.1:9999']

Thank you!


Stuart Clark

unread,
May 14, 2020, 4:50:22 AM5/14/20
to Isabel Noronha, Prometheus Users
On 2020-05-14 09:45, Isabel Noronha wrote:
> Hi,
>
> Server config where prometheus is running:
> 160 CPU cores
> 500 Gb RAM
> 2Tb Hardisk.
>
> Prometheus version:2.18.0
> cadvisor version:0.36.0
>
> Prometheus is running inside a container.
> I have already done relabeling.
> Retention period is 15days.
>
> I am using Cadvisor to get metrics from containers around 4k
> containers.
> I have done relabeling for container metrics as well.
>
> Scrape interval is 40s
>
> I use top command to check the CPU usage.
> So to my surprise Prometheus was exceeding 200% CPU usage.
> On this server (where prometheus server is running ) has around 2K
> containers.
>

Memory, CPU and disk usage will be for down to a number of different
tasks:

- Scraping (more targets/time series, more resources)
- Recording rules (more rule touching more data, more resources)
- Queries (more & more complex, more resources)
- WAL processing, compaction and expiry (more time series, more
resources)

Those different usages will add together. There are various metrics to
show the number of scrapes, timeseries, queries, etc.
> --
> You received this message because you are subscribed to the Google
> Groups "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to prometheus-use...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/41576f93-da37-4524-aba8-8e5d0e595402%40googlegroups.com
> [1].
>
>
> Links:
> ------
> [1]
> https://groups.google.com/d/msgid/prometheus-users/41576f93-da37-4524-aba8-8e5d0e595402%40googlegroups.com?utm_medium=email&utm_source=footer

--
Stuart Clark
Reply all
Reply to author
Forward
0 new messages