2Tb Hardisk.
Prometheus is running inside a container.
I have already done relabeling.
Retention period is 15days.
I am using Cadvisor to get metrics from containers around 4k containers.
I have done relabeling for container metrics as well.
I use top command to check the CPU usage.
So to my surprise Prometheus was exceeding 200% CPU usage.
On this server (where prometheus server is running ) has around 2K containers.
So overall 4K containers.
Could anyone help me understand the possible reasons for prometheus to increase the CPU usage?
# my global config
global:
scrape_interval: 15s # By default, scrape targets every 15 seconds.
evaluation_interval: 15s # By default, scrape targets every 15 seconds.
# scrape_timeout is set to the global default (10s).
# Attach these labels to any time series or alerts when communicating with
# external systems (federation, remote storage, Alertmanager).
external_labels:
monitor: ‘prometheus-monitor’
# Load and evaluate rules in this file every 'evaluation_interval' seconds.
rule_files:
#- 'alert.rules'
# - "first.rules"
- "alert_rules.yml"
# alert
alerting:
alertmanagers:
- scheme: http
static_configs:
- targets:
- "server:9093"
# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
# The job name is added as a label `job=<job_name>` to any timeseries scraped from this config.
- job_name: 'prometheus'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 40s
scrape_timeout: 40s
static_configs:
- targets: ['localhost:9010']
- job_name: 'cadvisor'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 40s
scrape_timeout: 40s
static_configs:
- targets: [server1:8080',server2:8080',server3:8080']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(container_fs_writes_total|container_fs_reads_total|container_tasks_state|container_cpu_user_seconds_total|container_last_seen|container_memory_usage_bytes|container_cpu_usage_seconds_total|container_network_transmit_bytes_total|container_memory_rss|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_memory_cache|cadvisor_version_info)'
action: keep
- job_name: 'node-exporter'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval: 15s
scrape_timeout: 15s
static_configs:
- targets: [server1:9100',server2:9100',server3:9100']
metric_relabel_configs:
- source_labels: [__name__]
regex: '(process_start_time_seconds|node_load1|node_exporter_build_info|node_uname_info|node_network_receive_bytes_total|node_network_transmit_bytes_total|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemFree_bytes|node_memory_SwapCached_bytes|node_memory_PageTables_bytes|node_memory_VmallocUsed_bytes|node_memory_SwapTotal_bytes|node_memory_Committed_AS_bytes|node_memory_Active_bytes|node_memory_Mapped_bytes|node_memory_Inactive_bytes|node_cpu_seconds_total|node_filesystem_avail_bytes|node_filesystem_size_bytes|node_memory_MemAvailable_bytes|node_memory_MemTotal_bytes|node_memory_MemFree_bytes |node_memory_Cached_bytes|node_filesystem_free_bytes)'
action: keep
- job_name: 'docker'
# metrics_path defaults to '/metrics'
# scheme defaults to 'http'.
scrape_interval: 5s
static_configs: