Overlapping values in graphs and queries

518 views
Skip to first unread message

stefan...@gmail.com

unread,
Dec 24, 2016, 10:09:40 AM12/24/16
to Prometheus Users
Hi,


I'm trying to compute aggregate system statistics (i.e. CPU utilization, measured by node exporters) when the system is subject to load tests. I want to be able to distinguish how different test conditions affect system performance.

To do that, I've created a text file read by the node exporter textfile collector, in which I put the run ID, i.e.:

run{run="126"} 1

126 is the test id and this metric is set to 1 during the test run. When the next test is being run (i.e. test 127), I overwrite the file and change the run label to run="127".

It seems it is working fine, however if I chart the expression run[10m] as a stacked graph in prometheus, I see overlapping in the different labels:


Why does this occur?

As I always update the file atomically with a new value, no overlapping should ever occur - I should always get something equal to 1. It looks like prometheus takes a while to understand that the value changed.

The scraping works correctly: if I visualize the query results in the Console, I correctly see that the no overlapping is present in the timestamps:

 

What am I doing wrong here?


Many thanks for your help and for making this great tool!

Brian Brazil

unread,
Dec 24, 2016, 1:02:32 PM12/24/16
to stefan...@gmail.com, Prometheus Users
On 24 December 2016 at 15:09, <stefan...@gmail.com> wrote:
Hi,


I'm trying to compute aggregate system statistics (i.e. CPU utilization, measured by node exporters) when the system is subject to load tests. I want to be able to distinguish how different test conditions affect system performance.

To do that, I've created a text file read by the node exporter textfile collector, in which I put the run ID, i.e.:

run{run="126"} 1

126 is the test id and this metric is set to 1 during the test run. When the next test is being run (i.e. test 127), I overwrite the file and change the run label to run="127".

It seems it is working fine, however if I chart the expression run[10m] as a stacked graph in prometheus, I see overlapping in the different labels:

The current staleness handling means that the time series will still be returned by instant vectors for 5 minutes. I'd suggest putting the run number as the value of a single timeseries.

Brian
 


Why does this occur?

As I always update the file atomically with a new value, no overlapping should ever occur - I should always get something equal to 1. It looks like prometheus takes a while to understand that the value changed.

The scraping works correctly: if I visualize the query results in the Console, I correctly see that the no overlapping is present in the timestamps:

 

What am I doing wrong here?


Many thanks for your help and for making this great tool!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/438dc53b-088a-4c1b-9b02-6384f8e10d8d%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.



--

stefan...@gmail.com

unread,
Dec 29, 2016, 9:51:45 AM12/29/16
to Prometheus Users, stefan...@gmail.com
Hi Brian,

Thanks for the quick and useful reply!


I managed to rework my setup around your suggestion. I have now a metric called runid that holds the id of the current running test, and 0 when no run is in progress. I can then gather the performance metrics I require during a specific run as follows:

avg by (instance, job) (sum by (instance, cpu, job) (irate(node_cpu{mode=~"user|system|nice"}[1m]))) * 100 and runid == $runid

Now, I have two remaining questions:

1) How to compute aggregate stats over a specific run

In order to understand how a test run performed, I would like to have a single aggregate value like a benchmark score. For example, the average CPU utilization during a test run. Is there a way I can accomplish this using promQL?

I have found the <aggr>_over_time functions, however I'm not able to come up with a working query. It works in a simple case like:

avg_over_time(go_goroutines[1m])

However, if I have to do an average of a rate, this doesn't work because this is an instant vector I guess:

avg_over_time(rate(node_cpu[1m]))

Is it possible to transform it in a range vector?


2) Distinct values of a metric

I'm using grafana on top of prometheus to analyze the results. I have setup a template var which holds my runid distinct values, so that I can filter and visualize the relevant tests.

Now, how can I get the distinct values of a metric, so that the template variable will show them?


I also wanted to thank you all again for this wonderful project, I'm now able to implement complex derived metrics & ideas in a matter of clicks!

Brian Brazil

unread,
Dec 29, 2016, 10:41:23 AM12/29/16
to stefan...@gmail.com, Prometheus Users
On 29 December 2016 at 14:51, <stefan...@gmail.com> wrote:
Hi Brian,

Thanks for the quick and useful reply!


I managed to rework my setup around your suggestion. I have now a metric called runid that holds the id of the current running test, and 0 when no run is in progress. I can then gather the performance metrics I require during a specific run as follows:

avg by (instance, job) (sum by (instance, cpu, job) (irate(node_cpu{mode=~"user|system|nice"}[1m]))) * 100 and runid == $runid

Now, I have two remaining questions:

1) How to compute aggregate stats over a specific run

In order to understand how a test run performed, I would like to have a single aggregate value like a benchmark score. For example, the average CPU utilization during a test run. Is there a way I can accomplish this using promQL?

I have found the <aggr>_over_time functions, however I'm not able to come up with a working query. It works in a simple case like:

avg_over_time(go_goroutines[1m])

However, if I have to do an average of a rate, this doesn't work because this is an instant vector I guess:

avg_over_time(rate(node_cpu[1m]))

Is it possible to transform it in a range vector?

That's not really possible with PromQL. I'd suggest having a tool figure out the required time ranges, and then making appropriate requests to Prometheus.
 


2) Distinct values of a metric

I'm using grafana on top of prometheus to analyze the results. I have setup a template var which holds my runid distinct values, so that I can filter and visualize the relevant tests.

Now, how can I get the distinct values of a metric, so that the template variable will show them?

You might be able to hack something with count_values.

Brian
 


I also wanted to thank you all again for this wonderful project, I'm now able to implement complex derived metrics & ideas in a matter of clicks!


On Saturday, 24 December 2016 16:09:40 UTC+1, stefan...@gmail.com wrote:
Hi,


I'm trying to compute aggregate system statistics (i.e. CPU utilization, measured by node exporters) when the system is subject to load tests. I want to be able to distinguish how different test conditions affect system performance.

To do that, I've created a text file read by the node exporter textfile collector, in which I put the run ID, i.e.:

run{run="126"} 1

126 is the test id and this metric is set to 1 during the test run. When the next test is being run (i.e. test 127), I overwrite the file and change the run label to run="127".

It seems it is working fine, however if I chart the expression run[10m] as a stacked graph in prometheus, I see overlapping in the different labels:


Why does this occur?

As I always update the file atomically with a new value, no overlapping should ever occur - I should always get something equal to 1. It looks like prometheus takes a while to understand that the value changed.

The scraping works correctly: if I visualize the query results in the Console, I correctly see that the no overlapping is present in the timestamps:

 

What am I doing wrong here?


Many thanks for your help and for making this great tool!

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.



--
Reply all
Reply to author
Forward
0 new messages