On 6 October 2017 at 20:17, Peter Zaitsev <
p...@percona.com> wrote:
>
>
> It was my expectations what because some scrapes are being fixed I should be able to see
>
> prometheus_target_skipped_scrapes_total
>
> to grow to significant value, however it looks like this variable is NOT being increased at all in this situation
Yeah, this metric is only used for scrapes skipped because of the
storage requesting throttling. (HELP text is: “Total number of scrapes
that were skipped because the metric storage was throttled.”) The
storage usually need throttling because the underlying storage device
cannot keep up with the I/O load. In practice, that usually happens
before Prometheus is CPU starved. Most CPU is burned earlier in the
stack, notable contributors are protobuf handling and hash
calculation. If you systematically take CPU away from your Prometheus
server, it will probably take long enough for those step, while the
disk is actually quite relaxed as fewer data even makes it to the
disk.
> Is there any other variable I can track to see there is not enough CPU available to to execute all scrapes ?
Fundamentally, if prometheus_local_storage_persistence_urgency_score
is low (much smaller than 1) but you don't see the full scrape rate
you expect, you have a problem outside of disk i/o. It could be CPU
starvation, but also something else like indexing problems (different
story I could tell now, but your case here is about CPU). For a more
sophisticated analysis in terms of “expected scrape rate”, you might
want to look at prometheus_target_interval_length_seconds, which tells
you, for each scrape interval, the distribution of actually observed
scrape intervals.
Usual Prometheus 2.0. considerations: Disk i/o ops needed for
persistence are dramatically reduced with Prometheus 2.0 (like 2+
orders of magnitude), which makes CPU starvation more likely (in
relative terms), but also CPU usage drops a lot (but by less, “only”
~one order of magnitude). Prometheus 2.0 doesn't have to hash every
single incoming sample anymore, and it avoids protobuf overhead by
only using the text format (although a more efficient protobuf library
like
https://github.com/gogo/protobuf possibly could have accomplished
similar effects).
--
Björn Rabenstein, Engineer
http://soundcloud.com/brabenstein
SoundCloud Ltd. | Rheinsberger Str. 76/77, 10115 Berlin, Germany
Managing Director: Alexander Ljung | Incorporated in England & Wales
with Company No. 6343600 | Local Branch Office | AG Charlottenburg |
HRB 110657B