up query

1,336 views
Skip to first unread message

BHARATH KUMAR

unread,
Aug 9, 2022, 8:23:33 AM8/9/22
to Prometheus Users
Hi all,

First Query :
I want to find the servers which have not been reachable for the last X days. It should not be in a reachable state for the last X days. I tried the following query, but it didn't work out.

Query :  max_over_time(up{instance=~"instance"}[Xd]) == 0

The above query gives me the info that servers are not reachable at least for 1 minute. But I want to know the info like it should not be reachable for the last X days.

Second Query :
I want to find the servers which are partially reachable for the last X days and it should not include the info that is totally unreachable state for the X days.

Any leads?

Thanks & regards,
Bharath Kumar.

Brian Candler

unread,
Aug 9, 2022, 11:15:56 AM8/9/22
to Prometheus Users
Use the PromQL query browser (in the Prometheus web interface) to debug it.  I suggest you first need to look at the inner query:

up{instance=~"instance"}

and graph it, setting the "instance" regexp to match one or more instances of interest. What does it look like? Is it a mixture of 0's and 1's, or all 0's, or all 1's, or is it absent entirely?  If it's absent entirely, then that's a different problem you need to investigate - your scrape job is completely broken.

If it's a mixture of 0's and 1's, then try this query:

max_over_time(up{instance=~"instance"}[2d])

It should show 1 for any instant where the server was up at any time over the previous 48 hours.  Does it not?

If it's all 0's for at least 48 hours, then

max_over_time(up{instance=~"instance"}[2d])

should show 0.

Once you've understood why your query wasn't working as you were expecting, then for partially reachable you can try a query like this:

avg_over_time(up{instance=~"instance"}[2d]) > 0 < 0.9

(setting thresholds as appropriate)

BHARATH KUMAR

unread,
Aug 15, 2022, 11:29:58 PM8/15/22
to Prometheus Users
Hello sir,

 When I look at the inner query it was showing the mixture of both 1's and 0's.

max_over_time(up{instance=~"instance"}[2d])

It should show 1 for any instance where the server was up over the previous 48 hours.  Does it not?
Yes it is not showing the value equal to 1. For the last 48 hours, It was showing when it was in reachable state and when it was not.

But I want only unreachable state servers over a period of time?

How can we achieve that?

Thanks & regards,
Bharath Kumar

Brian Candler

unread,
Aug 16, 2022, 3:17:03 AM8/16/22
to Prometheus Users
If the metric is 0, 1, 0, 1, 1, 0 ...  then max_over_time will be 1, if the time period in question covers those values.
If the metric is 0, 0, 0, 0, 0, 0 ... then max_over_time will be 0.

If you enter an expression like

max_over_time(up{instance=~"some_instance_name"}[2d])

and *draw a graph of it*, then you need to understand what that graph represents.  On the X axis is time; this is the time the expression was evaluated at.  The expression itself looks at the 2 days of data *up to and including that time*: that is, the range vector up[2d] reads all data in the database between T and T-2d.

For example, if there's a point on the graph where the X axis is 15 Aug 12:00, and the Y axis is 1, it means that the max_over_time between 13 Aug 12:00 and 15 Aug 12:00 was 1.  This in turn implies that there was at least one 1 value in that 2d period.  It will only show 0 if *all* the values in that period were 0.

If that doesn't do what you want, then you'll have to describe exactly what you see more clearly, with actual concrete queries and responses, and explain why it is different to what you expect.  Otherwise, only you can see the data in front of you, so it's up to you to understand why your query isn't doing what you expect.

> But I want only unreachable state servers over a period of time?

That will be those where max_over_time(...) is zero, and you can filter down to just those servers with an expression like this:

max_over_time(up[2d]) == 0

If you graph this expression, then all the data points will be zeros, but the points will appear and disappear over time.  They will be present at time T only if all the values in the period T-2d to T were 0.  If that's not the case, then the point will not be displayed.

BHARATH KUMAR

unread,
Aug 16, 2022, 10:08:15 AM8/16/22
to Prometheus Users
hello,

max_over_time(up[2d]) == 0 is giving me the info like ...for the last two days if the server goes down for 1 minute also it was displaying in the graph which I don't want. I want the information that for the last "X" days it should be completely in an unreachable state.

Thanks & regards,
Bharath Kumar.

Stuart Clark

unread,
Aug 16, 2022, 10:57:30 AM8/16/22
to BHARATH KUMAR, Prometheus Users
On 2022-08-16 15:08, BHARATH KUMAR wrote:
> hello,
>
> max_over_time(up[2d]) == 0 is giving me the info like ...for the last
> two days if the server goes down for 1 minute also it was displaying
> in the graph which I don't want. I want the information that for the
> last "X" days it should be completely in an unreachable state.
>

So you are only wanting it if every single scrape failed over the past 2
days?

Try sum() instead of max_over_time().

--
Stuart Clark

Brian Candler

unread,
Aug 16, 2022, 10:57:44 AM8/16/22
to Prometheus Users
What you are saying doesn't make sense, so you need to provide some evidence: actual queries, actual data.  Pick a specific instance, which I'll say is "foo".

up{instance="foo"}     # show graph over 1 week
up{instance="foo"}[2d]    # show console (range vectors can't be graphed)
max_over_time{instance="foo"}[2d])   # show graph over 1 week

I assert that if up{instance="foo"} is a mixture of 0s and 1s for a given 48 hour period, then the value of max_over_time{instance="foo"}[2d]} at the end of that 48 hour period will be 1.

Brian Candler

unread,
Aug 16, 2022, 10:58:33 AM8/16/22
to Prometheus Users
> Try sum() instead of max_over_time().

Do you mean sum_over_time() ?  But it will amount to the same thing, surely?

BHARATH KUMAR

unread,
Aug 16, 2022, 11:36:43 PM8/16/22
to Prometheus Users

yeah. I want only that the servers are down for the whole two days. Its value should always be zero(0) throughout the last 'X' days.

But max_over_time is giving me the info if the servers are down for even one minute from the last 'X' days.

Thanks & regards,
Bharath kumar.

Brian Candler

unread,
Aug 17, 2022, 3:25:48 AM8/17/22
to Prometheus Users
Extraordinary claims require extraordinary evidence.

I don't believe there's a bug in prometheus: I believe there's a bug in how you are using it.  But unless you show the data, there's no way to demonstrate this.

BHARATH KUMAR

unread,
Aug 17, 2022, 9:09:42 AM8/17/22
to Prometheus Users
up.PNG
this is the query I am using and the above graph is for 30 days and it is down from the last day. I want the servers that are down for the whole 30 days

Brian Candler

unread,
Aug 17, 2022, 10:12:26 AM8/17/22
to Prometheus Users
If you want servers that have been down for 30 days, then I thought it should be obvious you need max_over_time(up[30d]) == 0  ... but perhaps it isn't as obvious as I thought.

Let me break that query down into parts:

up[30d]   :   returns a *range vector* containing all data points for the timeseries with metric name "up" from T - 30 days to T (where T is the evaluation time, i.e. the point on the X axis)

By "timeseries" I mean distinct combination of metric name and labels, e.g.
up{instance="foo"}
up{instance="bar"}
are two different timeseries.  They happen to share the same metric name ("up") but they are recording an independent sequence of measurements.

Think of the range vector as a two-dimensional grid: there are N different timeseries, each with M data points over that period. The data collected and stored in the TSDB might look like this:

up{instance="foo"}  v1 . . . v2 . . . v3 . . .
up{instance="bar"}  . . v4 . . . v5 . . . v6 .
                    -------------------------> time

Then:
max_over_time(...)  :  for each timeseries in the range vector, picks the highest value.  This returns an *instant vector*, i.e. a single value for every timeseries, which is the maximum of each.

up{instance="foo"}  v3
up{instance="bar"}  v5

Each of those values is the maximum value of the timeseries, over the 30 day period.

Now, you've chosen to draw a graph of this expression, but it's important to realise that the graph itself doesn't need to be over 30 days.  When you draw a graph of an expression, it will sweep across the evaluation time, evaluating the expression repeatedly at different instants in time over the given period.

Let's say, for example, you set the graph range to be 1 week, but you are graphing max_over_time(up[30d]) == 0

What will you get?  This will be a series of points.  Let's imagine the graph only had one point per day. Considering the position of each point on the time axis:
Aug 17: shows if the server has been down from (Aug 17 - 30 days) to (Aug 17)
Aug 16: shows if the server has been down from (Aug 16 - 30 days) to (Aug 16)
...
Aug 10: shows if the server has been down from (Aug 10 - 30 days) to (Aug 10)

In fact, for your purposes (asking, has the server been down for the *last 30 days*?) you don't need to draw a graph at all!  In which case, if you turn on the "Instant" switch in Grafana it will only ask Prometheus to evaluate the expression for the current instant, which makes the query much faster and cheaper.

This is then an ideal query to use in a dashboard, where you just want to show a list of servers that have been down for the last 30 days.  You don't care, for example, if 2 days ago they were down for the 30 days before that point, do you?  Because that's what basically a graph of that expression will tell you: at each point in time, whether it was down for the previous 30 days.

Brian Candler

unread,
Aug 17, 2022, 10:53:45 AM8/17/22
to Prometheus Users
Incidentally, there is another way to slice this, which may or may not be helpful.

If you tell Grafana to query Prometheus for the simple query "up", you can then get Grafana itself to calculate the average, or minimum, or maximum over the time range it has queried:
img1.png
This can be useful in stat panels, where the stat panel dynamically changes for the time period you have selected in Grafana (e.g. if you select a particular 6 hour window, you want to show the average over those 6 hours).  It defaults to "Last", i.e. the most recent value.

But we are now moving into the realm of Grafana, and this is a mailing list for Prometheus.  Grafana has its own community discussion forum, so questions about Grafana are best asked there.

BHARATH KUMAR

unread,
Aug 24, 2022, 6:43:15 AM8/24/22
to Prometheus Users
I saw some blog in google as below:

If you want to count the time spent in down state, this becomes more complicated because you have to detect the switch from 1 to 0 which count for 1min and the subsequent down state until the first switch back from 0 to 1.

It could be something along the lines of:

(max_over_time(up[60s]) == bool 0) * ((up offset 61s == bool 1) * count(up[60s]) OR vector(1)) ---> query

But the above query threw me an error as below:

bad_data: 1:73: parse error: expected type instant vector in aggregation expression, got range vector


What I am missing here... How I can achieve this solution like "find the instances that have been completely in down state for last X days"

Thanks & regards,

Bharath Kumar.


Brian Candler

unread,
Aug 24, 2022, 7:27:32 AM8/24/22
to Prometheus Users
On Wednesday, 24 August 2022 at 11:43:15 UTC+1 chembakay...@gmail.com wrote:

(max_over_time(up[60s]) == bool 0) * ((up offset 61s == bool 1) * count(up[60s]) OR vector(1)) ---> query

But the above query threw me an error as below:

bad_data: 1:73: parse error: expected type instant vector in aggregation expression, got range vector

That expression is junk, and you didn't say where you got it from apart from "some blog".

What I am missing here... How I can achieve this solution like "find the instances that have been completely in down state for last X days"


Can you explain why the answer I gave before is not usable for you?  I have already told you that:

    max_over_time(up[30d]) == 0

will give you a list all instances which have been down continuously for the last 30 days, and that seems to be what you keep asking for.  I have tested it, it works:

img1.png
That is a table of machines which have been down for 30 days continuously.

Note that this is a query that you should run at a single instant (the current time), not one that you make a graph from.  In Grafana, turn the "instant" toggle on to get this behaviour.

img2.png

You'll just get set of single data points, which is a list of all the machines that have been down continuously from (now - 30 days) to (now).

You probably want to change the visualisation to a table, or some other panel type. Graph isn't want you want here, since it only shows data for a single point in time.  That is: those machines, which *at the current time* have been down for 30 days before *the current time*.  The reference point is the current time only; you don't want to sweep this query over previous times.

BHARATH KUMAR

unread,
Aug 25, 2022, 8:33:44 AM8/25/22
to Prometheus Users
Thanks, Brian. It really helped me. 

I want to find the Downtime of the instance in a similar way to how we will find the up time of the instance.

Up time : time() - node_boot_time_seconds{instance=~"$instance"}

Is there any metric in node exporter so that we can find the downtime of the instance?

Brian Candler

unread,
Aug 27, 2022, 8:33:33 AM8/27/22
to Prometheus Users
That's a different thing.

node_boot_time_seconds is a metric that says when the host itself thinks it booted - which is not necessarily the same as the host has been "up" or "down" from the point of view of Prometheus, which classes "up" as a successful scrape.  For example, the host could have been running fine, but the network was down: you'll get up == 0 during the network outage, but node_boot_time_seconds will not have changed.

Question: are you generating alerts when these machines go down?  If you are, then the answer is easy: there's a metric ALERTS_FOR_STATE where the value is the time that the alert started. See:

(You could always add alerting rules which send out no alerts: add a label that identifies them as a silent alert, and match this tag in your alertmanager routing rules to route them to an empty receiver)

Otherwise, assuming the node is currently down (i.e. up == 0), I think you are looking for either:
* the last time at which up == 1
* the last time at which up changed from 1 to 0

However, getting this answer directly through a prometheus query is not easy. You can graph the transitions from "up" to "down":

    up == 0 unless up offset 5m == 1

But you want the timestamp of the last transition. There is a function last_over_time(...) which gets you the last available value, but timestamp(last_over_time(...)) doesn't tell you its timestamp.

To the best of my knowledge, you need a trick like:

    timestamp(up) and up==1

or more simply, since we know up=0 or 1 only:

    time() * up

Then you can sweep this over a range and pick the maximum value, which must be the most recent, since time increases monotonically (and it will give zero if the machine has been down over the whole period):

    max_over_time((time() * up)[24h:])

Note: This is a fairly expensive query, so make sure you only evaluate it at a single instant.  If you're doing this in Prometheus web interface select "Table", not "Graph".  If you're doing this in Grafana, turn on the "Instant" switch.

Want to limit the result to just machines which are down *now*?

    max_over_time((time() * up)[24h:]) unless up == 1

You want to know how long have they been down? Do the same as you did with node_boot_time_seconds:

    time() - max_over_time((time() * up)[24h:]) unless up == 1

This query gets more expensive as you increase the time range covered. If you're not too worried about full accuracy, e.g. the approximate number of hours that the machine has gone down is OK, then you can use a larger evaluation step in the subquery: 

    time() - max_over_time((time() * up)[30d:1h]) unless up == 1

Hopefully, this has given some ideas about how flexible and powerful PromQL is.  Here are some links about PromQL I've bookmarked over time, in case they are useful (I haven't tested they all still work):

* <https://prometheus.io/docs/prometheus/latest/querying/basics/>
* <https://github.com/infinityworks/prometheus-example-queries>
* <https://timber.io/blog/promql-for-humans/>
* <https://www.weave.works/blog/promql-queries-for-the-rest-of-us/>
* <https://www.slideshare.net/weaveworks/promql-deep-dive-the-prometheus-query-language>
* <https://medium.com/@valyala/promql-tutorial-for-beginners-9ab455142085>
* <https://www.robustperception.io/common-query-patterns-in-promql>
* <https://www.robustperception.io/booleans-logic-and-math>
* <https://www.robustperception.io/composing-range-vector-functions-in-promql>
* <https://www.robustperception.io/rate-then-sum-never-sum-then-rate>
* <https://www.robustperception.io/using-group_left-to-calculate-label-proportions>
* <https://www.robustperception.io/extracting-raw-samples-from-prometheus>
* <https://www.robustperception.io/prometheus-query-results-as-csv/>
* <https://www.robustperception.io/existential-issues-with-metrics>
* <https://www.robustperception.io/left-joins-in-promql>

Brian Candler

unread,
Aug 27, 2022, 8:41:30 AM8/27/22
to Prometheus Users
On Saturday, 27 August 2022 at 13:33:33 UTC+1 Brian Candler wrote:
You want to know how long have they been down? Do the same as you did with node_boot_time_seconds:

    time() - max_over_time((time() * up)[24h:]) unless up == 1

On reflection, I think the following is a slightly better version:

    time() - max_over_time((time() * up)[24h:]) and up == 0

The only difference is, if you remove a target from the scrape job, this will also suppress the value.  That is: you'll only be told that the machine has been down for N seconds, if the machine is still being scraped *and* is still down.

Reply all
Reply to author
Forward
0 new messages