Modifying values in queries based on conditions

301 views
Skip to first unread message

Erwin

unread,
Nov 17, 2021, 11:42:06 AM11/17/21
to Prometheus Users
Hello everyone,

I'm trying to craft a general availability dashboard of our infrastructure in grafana, exclusively with Prometheus metrics.

Most of our checks consists of metrics like "up" where the result is either 0 or 1. It makes it very easy to make different panels in grafana and additional operations to get informations like "least available component", "general combined availability of infra", etc...

But there's a few situations where I would like a query to display a result of 0 or 1... But cannot.

For example, I might have a cluster where one of the servers can fail and still display an available service (and a result of 1 for my query), but having 2 failed servers would get me a result of "0" for my query.
For another service, I need the result of the query to be above a certain threshold to consider it “healthy” (and a result of 1).

Is it possible with a promQL query to display a result of 0 or 1 based on conditions specified in the query itself? I think this should be doable, but I can't find how.

Can someone help me? Or am I looking at this the wrong way?


Regards,

--
Erwin

Brian Candler

unread,
Nov 18, 2021, 3:57:37 AM11/18/21
to Prometheus Users
You're probably looking at it the wrong way, and I expect you should configure Grafana to visualise correctly the response you have.

You can display or not display something in Grafana based on presence/absence of any value.  However usually it's more useful to see the actual failing value, because an indication of just "not healthy" doesn't give you any clue to help debug the problem.  One thing you can do in Grafana is to set thresholds and colours: e.g. display green if the value is between 0 and 5, amber if 5 to 10, red if 10 or higher.  That's often much more useful (except for users with colour blindness who may need additional cues).

However, you can also frig the queries in PromQL if required.  Since you don't give the actual queries, I can only talk in general terms.

foo < 1
# gives you some value for foo, if it's less than 1, and no value if foo >= 1.

(foo < 1) * 0
# will always gives you a value of 0 if foo < 1, or no value if foo >= 1

foo < bool 1
# will always give you a value: 0 if foo < 1, 1 if foo >= 1

> For example, I might have a cluster where one of the servers can fail and still display an available service (and a result of 1 for my query), but having 2 failed servers would get me a result of "0" for my query.

I would be inclined make a query to count "number of failed servers", and set a display threshold on this.  Then the dashboard won't say "too many failed servers!", it will say "2 failed servers!"

Erwin

unread,
Nov 23, 2021, 5:36:33 AM11/23/21
to Prometheus Users
Hello Brian,

Sorry for the late response.
I already have plenty of dashboards in Grafana for various parts of our infrastructure, alerts and thresholds works well, and having an actual value helps us finding the source of our problems as you say. However, the particular dashboard I'm crafting is aimed at the executives and other partners than demands an availability counter for our infrastructure as a whole.
So for this particular dashboard, the question is not "is something broken and why is it broken?" but just "is everything working and if not, what broke and when?".
I should have made it a bit clearer, sorry.

The few queries you gave me helped me a lot actually! I never used a bool in my queries before and never bothered to use it until you mentioned it.
So now I use home-made recording rules for the various parts of the infrastructure, mainly containing min/max/max_over_time/bool and a few conditions. I get a nice load of 0s and 1s everywhere and it's very easy now to get a global % of availability for a period of time.
The state timeline panel in Grafana is also very useful.

Thanks for your help Brian :)
Reply all
Reply to author
Forward
0 new messages