Understanding when Prometheus is Overloaded

Peter Zaitsev

unread,

Oct 6, 2017, 2:17:27 PM10/6/17

to Prometheus Developers

Hi,

I was experimenting to see what metrics in Prometheus can help me to understand when Prometheus is Overloaded, so it can't scrape metrics.

To simulate the overload I run "stress" to take away CPU resources on the Prometheus box:

root@celenuc1:~# stress -c 64

stress: info: [8773] dispatching hogs: 64 cpu, 0 io, 0 vm, 0 hdd

As I start this process I see the amount of Ingested Time series goes down a lot

It was my expectations what because some scrapes are being fixed I should be able to see

prometheus_target_skipped_scrapes_total

to grow to significant value, however it looks like this variable is NOT being increased at all in this situation

Is there any other variable I can track to see there is not enough CPU available to to execute all scrapes ?

--

Peter Zaitsev, CEO, Percona
Tel: +1 888 401 3401 ext 7360 Skype: peter_zaitsev

Peter Zaitsev

unread,

Oct 6, 2017, 2:32:42 PM10/6/17

to Prometheus Developers

Hi,

Sorry. It is so inconsiderable of me not to include Prometheus version. This is 1.7.1

Björn Rabenstein

unread,

Oct 9, 2017, 8:36:24 AM10/9/17

to Peter Zaitsev, Prometheus Developers

On 6 October 2017 at 20:17, Peter Zaitsev <p...@percona.com> wrote:
>
>
> It was my expectations what because some scrapes are being fixed I should be able to see
>
> prometheus_target_skipped_scrapes_total
>
> to grow to significant value, however it looks like this variable is NOT being increased at all in this situation

Yeah, this metric is only used for scrapes skipped because of the
storage requesting throttling. (HELP text is: “Total number of scrapes
that were skipped because the metric storage was throttled.”) The
storage usually need throttling because the underlying storage device
cannot keep up with the I/O load. In practice, that usually happens
before Prometheus is CPU starved. Most CPU is burned earlier in the
stack, notable contributors are protobuf handling and hash
calculation. If you systematically take CPU away from your Prometheus
server, it will probably take long enough for those step, while the
disk is actually quite relaxed as fewer data even makes it to the
disk.

> Is there any other variable I can track to see there is not enough CPU available to to execute all scrapes ?

Fundamentally, if prometheus_local_storage_persistence_urgency_score
is low (much smaller than 1) but you don't see the full scrape rate
you expect, you have a problem outside of disk i/o. It could be CPU
starvation, but also something else like indexing problems (different
story I could tell now, but your case here is about CPU). For a more
sophisticated analysis in terms of “expected scrape rate”, you might
want to look at prometheus_target_interval_length_seconds, which tells
you, for each scrape interval, the distribution of actually observed
scrape intervals.

Usual Prometheus 2.0. considerations: Disk i/o ops needed for
persistence are dramatically reduced with Prometheus 2.0 (like 2+
orders of magnitude), which makes CPU starvation more likely (in
relative terms), but also CPU usage drops a lot (but by less, “only”
~one order of magnitude). Prometheus 2.0 doesn't have to hash every
single incoming sample anymore, and it avoids protobuf overhead by
only using the text format (although a more efficient protobuf library
like https://github.com/gogo/protobuf possibly could have accomplished
similar effects).

--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein

SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B

Peter Zaitsev

unread,

Oct 9, 2017, 8:49:48 AM10/9/17

to Björn Rabenstein, Prometheus Developers

Hi Bjorn,

Understood. So that metric is only for storage related issues. We see fair amount of CPU related issues as we get data frequently and in some configurations (ie large amount of tables) there might be a lot of data coming in from relatively few servers.

The challenge is folks tend to deploy in the cloud on the under-powered instances... or even worse instances which allow you to burst CPU usage for a time but when throttle it.

Is there any way to compute the scrape rate which is expected versus what is actually happening ?

So far it looks like I can only tell users to look at the "Samples Ingested" graph and see if it looks "funny" with number goes randomly up and down

Björn Rabenstein

unread,

Oct 10, 2017, 7:52:22 AM10/10/17

to Peter Zaitsev, Prometheus Developers

On 9 October 2017 at 14:49, Peter Zaitsev <p...@percona.com> wrote:
>
> Is there any way to compute the scrape rate which is expected versus what
> is actually happening ?

The best I can come up with at the moment is what I said before: For a more

sophisticated analysis in terms of “expected scrape rate”, you might
want to look at prometheus_target_interval_length_seconds, which tells
you, for each scrape interval, the distribution of actually observed
scrape intervals.

If somebody has a better idea, please speak up.

Peter Zaitsev

unread,

Oct 10, 2017, 8:48:56 AM10/10/17

to Björn Rabenstein, Prometheus Developers

Bjorn,

What really happens on the low level in the case of lack of CPU resources ?

As I would imagine this means what the next scrape can't be scheduled because by the time comes the deadline where it was to be scheduled already passed. Would not it be possible to add a counter to track such event ?

Björn Rabenstein

unread,

Oct 10, 2017, 8:51:57 AM10/10/17

to Peter Zaitsev, Prometheus Developers

On 10 October 2017 at 14:48, Peter Zaitsev <p...@percona.com> wrote:
>
> What really happens on the low level in the case of lack of CPU resources ?
>
> As I would imagine this means what the next scrape can't be scheduled
> because by the time comes the deadline where it was to be scheduled already
> passed. Would not it be possible to add a counter to track such event ?

I haven't touched the scrape layer in a long time. I guess others are
required for an authoritative answer.

co...@freshtracks.io

unread,

Oct 10, 2017, 8:58:27 AM10/10/17

to Prometheus Developers

Peter, can you share a screenshot of your scrape duration graph(s)?

I'm not too familiar with the code base, but I imagine scrape duration would start to show unusual distributions when the box is starved for CPU.

And if that's not the case, then that's pretty interesting too!

ch...@freshtracks.io

unread,

Oct 10, 2017, 9:21:21 AM10/10/17

to Prometheus Developers

That's an interesting idea, and I think you could learn a lot from this number. The question that follows, for me, is one that can't really be answered without a ton of prometheus experience:

Would it be more appropriate to measure that discrepancy as a subtraction of actual-expected or the fraction of actual/expected. Which is more coherent with the decay pattern expected in prometheus 1.7.1?

Reply all

Reply to author

Forward