Graph Tab in Prometheus

155 views
Skip to first unread message

kekr...@gmail.com

unread,
Aug 17, 2022, 6:21:03 PM8/17/22
to Prometheus Users
I am currently looking for all CPU alerts using a query of ALERTS{alertname="CPUUtilization"}

I am stepping through the graph time frame one click at a time. 

At the 12h time, I get one entry.  At 1d I get zero entries.  At 2d, I get 4 entries but not the one I found at 12h.  I would expect to get everything from 2d to now.

At 1w, I get 8 entries but at 2w, I only get 5 entries.  I would expect to get everything from 2w to now.

Last week I ran this same query and found the alert I was looking for back in April.  Today I ran the same query and I cannot find that alert from April.

I see this behavior in multiple Prometheus environments.

Is this a problem or the way the graphing works in Prometheus?

Prometheus version is 2.29.1
Prometheus retention period is 1y
DB is currently 1.2TB.  There are DBs as large as 5TB in other Prometheus environments.


Brian Candler

unread,
Aug 18, 2022, 4:46:40 AM8/18/22
to Prometheus Users
Presumably you are using the PromQL query browser built into prometheus? (Not some third party tool like Grafana etc?)

When you draw a graph from time T1 to T2, you send the prometheus API a range query to repeatedly evaluate an instant vector query over a time range from T1 to T2 with some step S.  The step S is chosen by the client so that it a suitable number fit in the display, e.g. if it wants 200 data points then it could chose step = (T2 - T1) / 200.  In the prometheus graph view you can see this by moving your mouse left and right over the graph; a pop-up shows you each data point, and you can see it switch from point to point as you move left to right.

Therefore, it's showing the values of the timeseries at the instants T1, T1+S, T1+2S, ... T2-S,T2.

When evaluating a timeseries at a given instant in time, it finds the closest value *at or before* that time (up to a maximum lookback interval, which by default is 5 minutes).

Therefore, your graph is showing *samples* of the data in the TSDB.  If you zoom out too far, you may be missing "interesting" values.  For example:

TSDB :  0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0  ...
Graph:       0         0         1         0         0 ...

Counters make this less of a problem: you can get your graph to show how the counter has *increased* between two adjacent points (usually then divided by the step time, to get a rate).

However, the problem for a metric like ALERTS is it's not a counter, and it doesn't even switch between 0 and 1, but the whole timeseries appears and disappears.  (In fact, it creates separate timeseries for when the alert is in state "pending" and "firing").  If you graph step is more than 5 minutes, you may not catch the alert's presence at all.

What you could try is a query like this:

max_over_time(ALERTS{alertname="CPUUtilization"}[1h])

The inner query is a range vector: it returns all data points within a 1 hour window, between 1 hour before the evaluation time up to the evaluation time.  Then if *any* data points exist in that window, the highest one returned, forming an instant vector again.  When your graph sweeps this expression over a time period from T1 to T2, then each data point will cover one hour. That should catch the "missing" samples.

Of course, the time window is fixed to 1h in that query, and you may need to adjust it depending on your graph zoom level, to match the time period between adjacent points on the graph.  If you're using grafana, there's a magic variable $__interval you can use.  I vaguely remember seeing a proposal for PromQL to have a way of referring to "the current step interval" in a range vector expression, but I don't know what happened to that.

HTH,

Brian.

Brian Candler

unread,
Aug 18, 2022, 5:27:01 AM8/18/22
to Prometheus Users
BTW, I just did a quick test.  When setting my graph display range to 2w in the Prometheus web interface, I found that adjacent data points were just under 81 minutes apart.  So the query

    max_over_time(ALERTS[81m])

was able to show lots of short-lived alerts, which the plain query

    ALERTS

did not.  Setting it longer, e.g. to [3h], smears those alerts over multiple graph points, as expected.

kekr...@gmail.com

unread,
Aug 18, 2022, 5:44:25 PM8/18/22
to Prometheus Users
Thank you Brian.  This helps.

Kevin
Reply all
Reply to author
Forward
0 new messages