Counters and non-zero values

80 views
Skip to first unread message

Khazhismel Kumykov

unread,
Aug 20, 2019, 1:15:21 AM8/20/19
to Prometheus Users
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

Searching the help pages gives no solution to this problem, simply "set the field to 0 before the first query", and other Q&As giving bizzare incantations such as "sum(increase(log_message_count{level="error"}[1m])) without (instance) > 0 or ((log_message_count{level="error"} != 0 unless log_message_count{level="error"} offset 1m))" or "sum(max_over_time(workflow_action_executions_count{result="ok"}[1m]) or vector(0)) - sum(max_over_time(workflow_action_executions_count{result="ok"}[1m] offset 1m) or vector(0))", which sorta-kinda work in special situations but break down past those, and are terrifying to look at anyways.

Is there some way to configure prometheus to perhaps store this _created field, and if it's changed, assume we re-started from zero? Some config I'm missing?

Brian Brazil

unread,
Aug 20, 2019, 4:12:04 AM8/20/19
to Khazhismel Kumykov, Prometheus Users
There isn't. This series is part of the OpenMetrics draft, and as far as I know only StackDriver has a use for it currently.

--

Aliaksandr Valialkin

unread,
Aug 21, 2019, 6:11:32 AM8/21/19
to Khazhismel Kumykov, Prometheus Users
On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.


Searching the help pages gives no solution to this problem, simply "set the field to 0 before the first query", and other Q&As giving bizzare incantations such as "sum(increase(log_message_count{level="error"}[1m])) without (instance) > 0 or ((log_message_count{level="error"} != 0 unless log_message_count{level="error"} offset 1m))" or "sum(max_over_time(workflow_action_executions_count{result="ok"}[1m]) or vector(0)) - sum(max_over_time(workflow_action_executions_count{result="ok"}[1m] offset 1m) or vector(0))", which sorta-kinda work in special situations but break down past those, and are terrifying to look at anyways.

Is there some way to configure prometheus to perhaps store this _created field, and if it's changed, assume we re-started from zero? Some config I'm missing?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90068849-5d92-4d99-ad16-71739014d98c%40googlegroups.com.


--
Best Regards,

Aliaksandr

Brian Brazil

unread,
Aug 21, 2019, 6:15:26 AM8/21/19
to Aliaksandr Valialkin, Khazhismel Kumykov, Prometheus Users
On Wed, 21 Aug 2019 at 11:11, Aliaksandr Valialkin <val...@gmail.com> wrote:

On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.

This is generally unsafe, we can't tell the difference between a metric that was just created and one that has existed for years but Prometheus only started scraping it now.

Brian
 


Searching the help pages gives no solution to this problem, simply "set the field to 0 before the first query", and other Q&As giving bizzare incantations such as "sum(increase(log_message_count{level="error"}[1m])) without (instance) > 0 or ((log_message_count{level="error"} != 0 unless log_message_count{level="error"} offset 1m))" or "sum(max_over_time(workflow_action_executions_count{result="ok"}[1m]) or vector(0)) - sum(max_over_time(workflow_action_executions_count{result="ok"}[1m] offset 1m) or vector(0))", which sorta-kinda work in special situations but break down past those, and are terrifying to look at anyways.

Is there some way to configure prometheus to perhaps store this _created field, and if it's changed, assume we re-started from zero? Some config I'm missing?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90068849-5d92-4d99-ad16-71739014d98c%40googlegroups.com.


--
Best Regards,

Aliaksandr

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Aliaksandr Valialkin

unread,
Aug 21, 2019, 6:28:11 AM8/21/19
to Brian Brazil, Khazhismel Kumykov, Prometheus Users
On Wed, Aug 21, 2019 at 1:15 PM Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 21 Aug 2019 at 11:11, Aliaksandr Valialkin <val...@gmail.com> wrote:

On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.

This is generally unsafe, we can't tell the difference between a metric that was just created and one that has existed for years but Prometheus only started scraping it now.

So it looks like the best solution would be to skip results if `vFirst` is missing, since both approaches mentioned above have real-life issues.
 

Brian
 


Searching the help pages gives no solution to this problem, simply "set the field to 0 before the first query", and other Q&As giving bizzare incantations such as "sum(increase(log_message_count{level="error"}[1m])) without (instance) > 0 or ((log_message_count{level="error"} != 0 unless log_message_count{level="error"} offset 1m))" or "sum(max_over_time(workflow_action_executions_count{result="ok"}[1m]) or vector(0)) - sum(max_over_time(workflow_action_executions_count{result="ok"}[1m] offset 1m) or vector(0))", which sorta-kinda work in special situations but break down past those, and are terrifying to look at anyways.

Is there some way to configure prometheus to perhaps store this _created field, and if it's changed, assume we re-started from zero? Some config I'm missing?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90068849-5d92-4d99-ad16-71739014d98c%40googlegroups.com.


--
Best Regards,

Aliaksandr

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAPbKnmCMT6-7ViEatbtBO7KgX4Oa_EQyaJwysNAORKXknzczpQ%40mail.gmail.com.


--


--
Best Regards,

Aliaksandr

Khazhismel Kumykov

unread,
Aug 21, 2019, 2:00:27 PM8/21/19
to Aliaksandr Valialkin, Brian Brazil, Prometheus Users


On Wed, Aug 21, 2019, 03:28 Aliaksandr Valialkin <val...@gmail.com> wrote:


On Wed, Aug 21, 2019 at 1:15 PM Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 21 Aug 2019 at 11:11, Aliaksandr Valialkin <val...@gmail.com> wrote:

On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.

This is generally unsafe, we can't tell the difference between a metric that was just created and one that has existed for years but Prometheus only started scraping it now.
_created + ntp/timestamps/cleverness + interpolation seems like it'd be a "safe" way to solve this. At least, better than data loss, without spikes necessarily. Hence asking it anyone uses this.

Are there any plans for using this open metrics in promethous?

Brian Brazil

unread,
Aug 21, 2019, 2:24:46 PM8/21/19
to Khazhismel Kumykov, Aliaksandr Valialkin, Prometheus Users
On Wed, 21 Aug 2019 at 19:00, Khazhismel Kumykov <kha...@gmail.com> wrote:


On Wed, Aug 21, 2019, 03:28 Aliaksandr Valialkin <val...@gmail.com> wrote:


On Wed, Aug 21, 2019 at 1:15 PM Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 21 Aug 2019 at 11:11, Aliaksandr Valialkin <val...@gmail.com> wrote:

On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.

This is generally unsafe, we can't tell the difference between a metric that was just created and one that has existed for years but Prometheus only started scraping it now.
_created + ntp/timestamps/cleverness + interpolation seems like it'd be a "safe" way to solve this. At least, better than data loss, without spikes necessarily. Hence asking it anyone uses this.

Are there any plans for using this open metrics in promethous?

Prometheus already supports scraping OpenMetrics. Having rate() use _created is a more complicated question.

Brian
 

So it looks like the best solution would be to skip results if `vFirst` is missing, since both approaches mentioned above have real-life issues.
 

Brian
 


Searching the help pages gives no solution to this problem, simply "set the field to 0 before the first query", and other Q&As giving bizzare incantations such as "sum(increase(log_message_count{level="error"}[1m])) without (instance) > 0 or ((log_message_count{level="error"} != 0 unless log_message_count{level="error"} offset 1m))" or "sum(max_over_time(workflow_action_executions_count{result="ok"}[1m]) or vector(0)) - sum(max_over_time(workflow_action_executions_count{result="ok"}[1m] offset 1m) or vector(0))", which sorta-kinda work in special situations but break down past those, and are terrifying to look at anyways.

Is there some way to configure prometheus to perhaps store this _created field, and if it's changed, assume we re-started from zero? Some config I'm missing?

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/90068849-5d92-4d99-ad16-71739014d98c%40googlegroups.com.


--
Best Regards,

Aliaksandr

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/CAPbKnmCMT6-7ViEatbtBO7KgX4Oa_EQyaJwysNAORKXknzczpQ%40mail.gmail.com.


--


--
Best Regards,

Aliaksandr

Khazhismel Kumykov

unread,
Aug 21, 2019, 3:18:37 PM8/21/19
to Brian Brazil, Aliaksandr Valialkin, Prometheus Users


On Wed, Aug 21, 2019, 11:24 Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 21 Aug 2019 at 19:00, Khazhismel Kumykov <kha...@gmail.com> wrote:


On Wed, Aug 21, 2019, 03:28 Aliaksandr Valialkin <val...@gmail.com> wrote:


On Wed, Aug 21, 2019 at 1:15 PM Brian Brazil <brian....@robustperception.io> wrote:
On Wed, 21 Aug 2019 at 11:11, Aliaksandr Valialkin <val...@gmail.com> wrote:

On Tue, Aug 20, 2019 at 8:15 AM Khazhismel Kumykov <kha...@gmail.com> wrote:
So looking at what python prometheus_client exports, I noticed that my counters (heck, all my metrics) have a "created" timestamp.

I thought to myself - hey, this might be used to signal that the metric was reset via process restart, so we should treat it as if it restarted from 0!
Except... it doesn't.

So if I start a process and very quickly increment a counter beyond zero, increase()/rate()/etc. still thinks that my value has increased by "0". Even worse, I launch 20 processes, all which increase a few metrics by 1 or 2, before the first poll from the prometheus server, then I increase(), then sum(), and my 25 events are now 0!

`increase(q[d])` is calculated as `vLast - vFirst` for each `q` on the time range `d`. `vLast` is the last value on the time range `d`, while `vFirst` is the first value on the time range. When Prometheus scrapes the first sample for new time series, then only `vLast` exists, while `vFirst` doesn't exist. So Prometheus cannot calculate `increase` for the first sample on the new time series and returns 0 assuming it missed the previous scrape and the value didn't change. This is valid assumption, but it leads to invalid calculations for the first sample in time series as in your example. Possible fix is to assume that the time series had zero value on the previous scrape. Such a fix has been recently implemented in VictoriaMetrics. This fix can break if the time series had long gaps because of failed scrapes. In this case it will show too big values for the first samples after each gap. Such invalid values can be filtered out with `clamp_max` as a temporary workaround until better solution appears.

This is generally unsafe, we can't tell the difference between a metric that was just created and one that has existed for years but Prometheus only started scraping it now.
_created + ntp/timestamps/cleverness + interpolation seems like it'd be a "safe" way to solve this. At least, better than data loss, without spikes necessarily. Hence asking it anyone uses this.

Are there any plans for using this open metrics in promethous?

Prometheus already supports scraping OpenMetrics. Having rate() use _created is a more complicated question.
I guess this more complicated question is the one I'm curious about the most :)

Reply all
Reply to author
Forward
0 new messages