Confusing "rule evaluation with different value but same timestamp" error

833 views
Skip to first unread message

Alin Sînpălean

unread,
Jul 20, 2018, 7:04:45 AM7/20/18
to Prometheus Users
I'm seeing this warning in the Prometheus logs every time a metric whose existence I'm trying to track appears or disappears. I believe I understand what it means, I just don't understand why it happens.

    level=warn ts=2018-07-20T10:31:24.371484017Z caller=manager.go:384 component="rule manager" group=foo-probe.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1

Essentially what I'm trying to do is that I have a script that executes periodically, running a number of probes against a frontend (e.g. attempt to log in, go to the products page and record the number of products displayed there, etc.). The result of all these probes is pushed to a Pushgateway. But whenever one of the probes fails (as in it throws an exception) the corresponding metric is not pushed. So if the "products" page stops working, the entity_count{env="prod",job="foo-probe",type="product"} metric will go missing.

Now I want to alert whenever a probe that succeeded any time during the last 24 hours has not been running for the last X minutes. For that purpose, I figured I'd create a rule that produces a probe_ran metric with a value of 1 if said probe is currently being exported by Pushgateway (i.e. it was ran during the last execution) and a value of 0 otherwise. It is mostly running without issue, except that every time a metric appears or disappears, exactly one rule eval fails with the message above and one datapoint goes missing (presumably because both a value of 1 and a value of 0 are recorded for probe_ran). Here are my rules:

groups:

- name: foo-probe.rules
  rules:

  # A metric recording a value of 1 for all probes that have been run during the most recent frontend probe execution.
  - record: probe_ran
    expr: |
      0 * probe{job="foo-probe"}
        + 1

  # Same metric as above, recording a value of 0 for all probes that have not been run during the last frontend
  # probe execution but have been run sometime during the last 24 hours.
  - record: probe_ran
    expr: |
      0 * (max_over_time(probe_ran{job="foo-probe"}[24h]) == 1)
        unless
      probe{job="foo-probe"}

I know they're somewhat convoluted, and I'll likely rewrite them to something simpler (e.g. the probe{job="foo-probe"} is actually a label_replace() call, but the problem also reproduces without it). What is bothering me though, is how could (during the same eval run) both expressions return a value for the same set of labels. One expression is essentially A and the other B unless A, so they should be mutually exclusive.

It may have to do with how stale values are handled; with referencing values from the same eval round vs. previous eval rounds; or with my own stupidity, but I just don't understand what's happening. If I graph the two expressions with second resolution, one will produce a result of 1 until time T and the other a value of 0 from time T+1, there is never any overlap.

Any insight is appreciated.

Cheers,
Alin.

Alin Sînpălean

unread,
Jul 20, 2018, 7:15:02 AM7/20/18
to Prometheus Users
Well, I've rewritten the two rules as a single rule and now everything works just fine, as there's obviously no way of producing two values for the same set of labels:

  - record: probe_ran
    expr: |
      1 + 0 * some_metric{job="nexus-probez"}
        or
      0 * (max_over_time(probe_ran{job="nexus-probez"}[24h]) == 1)

That being said, the question still stands. Why were both rules producing a value within the same eval run earlier?

Cheers,
Alin.

Matthias Rampke

unread,
Jul 23, 2018, 3:41:23 AM7/23/18
to Alin Sînpălean, Prometheus Users
Very much shooting from the hip here, but my guess would be something with staleness? In 1.x this would not have worked at all because the metric would still be at the last value for 5 minutes. I think there's still cases where that can happen.

Is it feasible to make the script emit a count of 0 if the page isn't working? I think Prometheus just isn't very good with metrics where disappearance is significant.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/981b0150-49a7-4ab0-8fff-955e75914003%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Alin Sînpălean

unread,
Jul 23, 2018, 3:59:13 AM7/23/18
to m...@soundcloud.com, Prometheus Users
Like I said, I've managed to fix the issue to my satisfaction by using a single rule definition, where the output is either 0 or 1 (it can't be both, since it's a single expression).

I am just very confused as to how the 2 expressions (as defined initially) could both produce a value _in_the_same_eval_run_. I also suspect it has something to do with staleness (e.g. it may be possible that an arithmetic operation handles a stale NaN one way and a logical unless operation does it differently). Or it might look at the last value before the stale marker in one instance but not the other.

As for it not working in 1.x, I would have been fine with the 5 minute latency (in this particular case). The probes are only executed every 15 minutes and I'm waiting for a couple of them to fail repeatedly before alerting.

Thanks,
Alin.

Brian Brazil

unread,
Jul 23, 2018, 6:37:39 AM7/23/18
to Alin Sînpălean, Prometheus Users
On 20 July 2018 at 12:15, Alin Sînpălean <alin.si...@gmail.com> wrote:
Well, I've rewritten the two rules as a single rule and now everything works just fine, as there's obviously no way of producing two values for the same set of labels:

  - record: probe_ran
    expr: |
      1 + 0 * some_metric{job="nexus-probez"}
        or
      0 * (max_over_time(probe_ran{job="nexus-probez"}[24h]) == 1)

That being said, the question still stands. Why were both rules producing a value within the same eval run earlier?

Staleness for each recording rule is evaluated independently, so the first rule was producing a stale NaN while the second was producing a normal value and they clashed. When you're doing something like this you should use one rule, as you are now doing.

Brian
 

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.



--

Alin Sînpălean

unread,
Jul 23, 2018, 8:28:10 AM7/23/18
to Brian Brazil, Prometheus Users
Oh, so it was a stale NaN, and not a 1, that the first rule produced. That makes sense with hindsight, but it was far from obvious. Maybe it would help if the error message produced the 2 conflicting values, it would have been much easier to figure it out had I seen the NaN.

My error was assuming that a rule would either produce a value or not (and then my 2 expressions would have been mutually exclusive), but there is a third state, where a rule will produce a stale marker, which is not, strictly speaking, a value, but will still conflict with a value. Would it make sense to make an exception if the value being overwritten is a stale marker? (Just wondering, I don't have a strong opinion one way or the other.)

Thanks,
Alin.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To post to this group, send email to promethe...@googlegroups.com.



--

Brian Brazil

unread,
Jul 23, 2018, 1:45:06 PM7/23/18
to Alin Sînpălean, Prometheus Users
On 23 July 2018 at 13:27, Alin Sînpălean <alin.si...@gmail.com> wrote:
Oh, so it was a stale NaN, and not a 1, that the first rule produced. That makes sense with hindsight, but it was far from obvious. Maybe it would help if the error message produced the 2 conflicting values, it would have been much easier to figure it out had I seen the NaN.

My error was assuming that a rule would either produce a value or not (and then my 2 expressions would have been mutually exclusive), but there is a third state, where a rule will produce a stale marker, which is not, strictly speaking, a value, but will still conflict with a value. Would it make sense to make an exception if the value being overwritten is a stale marker? (Just wondering, I don't have a strong opinion one way or the other.)

I don't think that'd be a good idea. It'd be a layering violation as the tsdb doesn't know about staleness, and also the database is append-only so changing that could produce even weirder results for what are generally misconfigurations in the first place (your example is the first I've come across which is not a relabelling misconfiguration).

Brian 

Thanks,
Alin.

To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscribe@googlegroups.com.
To post to this group, send email to prometheus-users@googlegroups.com.



--



--
Reply all
Reply to author
Forward
0 new messages