A good to select metrics <= threshold value

chuanjia xing

unread,

May 25, 2021, 7:38:27 PM5/25/21

to Prometheus Users

I am collecting some cpu utilization data for ec2 instances across many k8s clusters. One problem I need to solve is: "find all the clusters with cpu utilization <= 10%". This seems pretty straightforward, but turns out I can't find a good query for it.

The query I am suing to find these clusters is as follows:

avg by (cluster_id)(max_over_time(cpu_utilization{}[1200s])) <= 10

This basically looks at the cpu utilization averaged by each cluster, then select the ones <= 10%. But I got one cluster with cpu utilization as follows:

Screen Shot 2021-05-25 at 4.22.35 PM.png

Looking at the graph, the average cpu utilization is apparently > 10%. I think the only reason it gets selected is because during some time window, the cpu utilization is < 10%, so prometheus selects it.

This doesn't make much sense. Looks like the way prometheus does is that, if there's one data point is <= 10%, then it will select this series. It never selects based on "average value".

I hit this problem multiple times. How can I fix this? Is there a different query I should use for my case? Thanks!

Aliaksandr Valialkin

unread,

May 26, 2021, 8:22:21 AM5/26/21

to chuanjia xing, Prometheus Users

I'm unsure whether this is possible with PromQL, but it should be easy with MetricsQL. Try using the following MetricsQL query:

with (x = avg by (cluster_id) (cpu_utilization)) x if range_max(x) <= 10

It uses the following MetricsQL features:

* WITH templates

* range_max() function, which returns the maximum value for the given metric on the selected time range.

MetricsQL also provides the bottomk_max() function, which can be used for returning the bottom series with the maximum value on the select time range. For example, the following query returns cpu utilization for 3 least loaded clusters:

bottomk_max(3, avg by (cluster_id) (cpu_utilization))

See more details about MetricsQL at https://docs.victoriametrics.com/MetricsQL.html .

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7b253bf5-2b09-4522-ae16-53ce62c1d756n%40googlegroups.com.

--

Best Regards,

Aliaksandr Valialkin, CTO VictoriaMetrics

Julius Volz

unread,

May 26, 2021, 4:03:01 PM5/26/21

to chuanjia xing, Prometheus Users

Hi,

To give an actual PromQL answer to a PromQL question:

If I understand your question correctly, you may want to take a look at the new experimental "@" modifier for selectors + subqueries, which was added in Prometheus 2.25 and can be turned on via the "--enable-feature=promql-at-modifier" command-line flag:

https://prometheus.io/blog/2021/02/18/introducing-the-@-modifier/

It's designed for these kinds of situations where you want to evaluate specific vector selectors or subqueries at a fixed point in time (e.g. the end of the graph range, via "foo @ end()"), rather than relative to each resolution step.

So e.g. for a 1-hour graph where you wanted to select the per-cluster-averaged CPU utilizations only for those clusters whose overall average within that graphed last hour was <= 10%, you could write something like:

----------------

avg by(cluster_id) (cpu_utilization)

and

max_over_time(

avg by(cluster_id) (cpu_utilization)[1h:] @ end()

) <= 10

----------------

Depending on what exactly you want as your filter criterium, you may want to switch max_over_time() for avg_over_time() instead (requiring the max over the graph range to be <= 10% vs. requiring the average over the graph range to be <= 10%).

And when using Grafana, you can replace the hard-coded "1h" with the "$__range" variable to make this dynamic according to the currently viewed graph range (see https://grafana.com/docs/grafana/latest/variables/variable-types/global-variables/#__range):

----------------

avg by(cluster_id) (cpu_utilization)

and

max_over_time(

avg by(cluster_id) (cpu_utilization)[$__range:] @ end()

) <= 10

----------------

Regards,

Julius

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7b253bf5-2b09-4522-ae16-53ce62c1d756n%40googlegroups.com.

--

Julius Volz

PromLabs - promlabs.com

Reply all

Reply to author

Forward