I'm seeing this warning in the Prometheus logs every time a metric whose existence I'm trying to track appears or disappears. I believe I understand what it means, I just don't understand why it happens.
level=warn ts=2018-07-20T10:31:24.371484017Z caller=manager.go:384 component="rule manager" group=foo-probe.rules msg="Error on ingesting results from rule evaluation with different value but same timestamp" numDropped=1
Essentially what I'm trying to do is that I have a script that executes periodically, running a number of probes against a frontend (e.g. attempt to log in, go to the products page and record the number of products displayed there, etc.). The result of all these probes is pushed to a Pushgateway. But whenever one of the probes fails (as in it throws an exception) the corresponding metric is not pushed. So if the "products" page stops working, the entity_count{env="prod",job="foo-probe",type="product"} metric will go missing.
Now I want to alert whenever a probe that succeeded any time during the last 24 hours has not been running for the last X minutes. For that purpose, I figured I'd create a rule that produces a probe_ran metric with a value of 1 if said probe is currently being exported by Pushgateway (i.e. it was ran during the last execution) and a value of 0 otherwise. It is mostly running without issue, except that every time a metric appears or disappears, exactly one rule eval fails with the message above and one datapoint goes missing (presumably because both a value of 1 and a value of 0 are recorded for probe_ran). Here are my rules:
groups:
- name: foo-probe.rules
rules:
# A metric recording a value of 1 for all probes that have been run during the most recent frontend probe execution.
- record: probe_ran
expr: |
0 * probe{job="foo-probe"}
+ 1
# Same metric as above, recording a value of 0 for all probes that have not been run during the last frontend
# probe execution but have been run sometime during the last 24 hours.
- record: probe_ran
expr: |
0 * (max_over_time(probe_ran{job="foo-probe"}[24h]) == 1)
unless
probe{job="foo-probe"}
I know they're somewhat convoluted, and I'll likely rewrite them to something simpler (e.g. the probe{job="foo-probe"} is actually a label_replace() call, but the problem also reproduces without it). What is bothering me though, is how could (during the same eval run) both expressions return a value for the same set of labels. One expression is essentially A and the other B unless A, so they should be mutually exclusive.
It may have to do with how stale values are handled; with referencing values from the same eval round vs. previous eval rounds; or with my own stupidity, but I just don't understand what's happening. If I graph the two expressions with second resolution, one will produce a result of 1 until time T and the other a value of 0 from time T+1, there is never any overlap.
Any insight is appreciated.
Cheers,
Alin.