Prometheus v1.7.1 - high CPU usage while only scraping itself, no rules and with one dashboard

767 views
Skip to first unread message

Danny Kulchinsky

unread,
Sep 7, 2017, 2:36:57 PM9/7/17
to Prometheus Users
Dear friends,

I'm experiencing a very strange situation with Prometheus on two identical servers.

We are running Prometheus v1.7.1 on two VMs (Debian 8.7, 4-core, 32GB RAM) with local SSD storage (retention is 168h), prometheus is running in Docker (official image) without resource limits/etc.

cmd flags: -storage.local.retention=168h -storage.local.target-heap-size=22906492245


CPU is constantly @100%, I have tried the following:

1) Removed all targets except Prometheus itself
2) Removed all recording rules
3) There's just one dashboard being accessed (refreshed by me every few minutes) with Prometheus stats
4) Performed clean shutdown (kill -SIGTERM) and started again - CPU spiked to 100% and holding

Only thing I haven't tried is purging the data dir (kind of don't want to loose my history).

I can't quite understand what's going on... based on a comment here, I executed a profile on the running instance (svg and pprof attached), I couldn't find any clues but don't have experiencing in go profiling.

Any ideas?


Danny
pprof.pb.gz
prof.svg

Danny Kulchinsky

unread,
Sep 7, 2017, 2:56:15 PM9/7/17
to Prometheus Users
So, I went ahead and purged the data (just renamed it actually) dir, CPU iis now low and stable.

Data dir size was 17G - is that too much? is our retention (7 days) is too much? any clues what's causing this?

Going to restore the targets & rules gradually...


Danny

lup...@newdevices.com

unread,
Sep 11, 2017, 4:17:58 PM9/11/17
to Prometheus Users
Could you share your prometheus query?
what is your cpu mode=system, idle, user and iowait?

Provide the results based on this query:
sum by (cpu,mode,instance)(rate(node_cpu{instance="hostanme:9100",mode="$mode"}[1m]))

Thanks,
Lp


On Thursday, September 7, 2017 at 2:36:57 PM UTC-4, Danny Kulchinsky wrote:

dan...@tuenti.com

unread,
Sep 12, 2017, 9:10:06 AM9/12/17
to Prometheus Users
Thanks for your reply!

Since I had to purge the data from both Prometheus servers, I do not have the historical metric series anymore (I should have made some screenshots..), however the usage was mainly in User mode.

Luckily, we also monitor these nodes using Diamond collector and store the data in Graphite, here's the details from there (before & after purging data).

Prometheus-01




Prometheus-02 (shows same symptoms)



Both Servers are now running ~5 days and seem to be just fine, we are going to hit the retention period soon (168h = 7 days), so maybe we'll start seeing issues then.


Regards,
Danny

Danny Kulchinsky

unread,
Sep 15, 2017, 7:28:11 PM9/15/17
to Prometheus Users
So, as I've suspected, after a week of relatively normal operation the CPU usage has spiked.

System configuration is kept as static as possible, no new rules/etc...


Here's the query: sum by (mode,instance)(rate(node_cpu{instance="prometheus-01"}[5m]))

And here's the graph:


Here's the free memory stats for last week:



I'm not quite sure what's going on, and expect for purging the data I have not been able to find a solution.


Danny
Reply all
Reply to author
Forward
0 new messages