Monitor number of seconds since metric change as prometheus time series

3,684 views
Skip to first unread message

Weston Greene

unread,
Mar 30, 2020, 10:21:01 AM3/30/20
to Prometheus Users

I have the Recording rule pattern:
```yaml
  - record: last-update
    expr: |
      timestamp(changes(metric-name[450s]) > 0)
        or
      last-update
```

However, that doesn't work. The `or last-update` part doesn't return a value.

I have tried using an offset,
` or (last-update offset 450s)`, 
to no avail.


My evaluation frequency is 5 minutes (the frequency that prometheus runs my Recording rules). I tried the 7.5 minutes offset because I theorized that the OR was attempting to write last-update as last-update but last-update was null in that second; if the OR were to attempt writing last-update as the value it was during it's previous evaluation, then it should find a value in last-update, but that returned no value as well.


This is what the metric looks like graphed: 

[choppy rather than a complete staircase][1] (I don't have enough reputation to post pictures...)



Thank you in advance for your help.

Why I care:
If a time series plateaus for an extended period of time then I want to know as that may mean it has begun to fail to return accurate data.


  [1]: I think the image link is preventing me from posting
Message has been deleted

Weston Greene

unread,
Mar 30, 2020, 10:23:20 AM3/30/20
to Prometheus Users
This was already partially answered in https://stackoverflow.com/questions/54148451

But not sufficiently, so I'm asking here and in the Stack Overflow: https://stackoverflow.com/questions/60928468

Here is the image of the graph: 

Screen Shot 2020-03-30 at 06.18.07.png

Weston Greene

unread,
Apr 1, 2020, 5:41:19 AM4/1/20
to Prometheus Users
In the stackoverflow post about this same topic, I was encouraged to reduce my evaluation frequency since `last-update` was likely going stale by the default TTL (Time To Live) of 5 minutes.

Now I can't get passed the `vector contains metrics with the same labelset after applying rule labels`.

I do add labels in the recording rule:
```
                  stat: true
                  monitor: false
```

I believe this is because `last-update` already has all the labels that `metric-name` has plus the labels that the recording rule adds, so when the `or` is triggered `last-update` conflicts since it already has the labels.

How do I get around this? Thank you again for your creativity!

Weston Greene

unread,
Apr 3, 2020, 5:01:52 AM4/3/20
to Prometheus Users
ANSWERED! 
From Stackoverflow:

Summing up our discussion: the evaluation interval is too big; after 5 minutes, a metric becomes [stale][1]. This means that when the expression is evaluated, the right hand side of your `OR` expression is no longer considered by Prometheus and thus is always empty.

Your second issue is that your record rule is adding some labels to the original metric and you get some complaint by Prometheus. This is not because the labels already exists: in [recording rules][3], labels overwrite the existing labels.

The issue is your `OR` expression: it should specify an `ignoring()` [matching clause][2] for ignoring the added labels or you will get the labels from both sides of the `OR` expression:

> `vector1 or vector2` results in a vector that contains all original elements (label sets + values) of vector1 and additionally all elements of vector2 ***which do not have matching label sets in vector1***.

Since you get both side of the `OR`, when Prometheus tries to add the labels to the left hand side, it conflicts with the right hand side which already exists.

Your expression should be something like:
```yaml
    expr: |
      timestamp(changes(metric-name[450s]) > 0)
        or ignoring(stat,monitor)
      last-update
```
Or use an `ON(label1,label2,...)` clause on a discriminating label set which avoids changing the expression whenever you change the labels.


t1hom7as

unread,
Sep 9, 2020, 6:41:13 AM9/9/20
to Prometheus Users
I am actually trying to do something very similar, but I can't really tell if it is the same or not.
Basically, I have a metric that gives me the status of up or down, being 1 or 0 respectively in the value field. 

I would like to somehow find out from when the value went FROM 0 TO 1, so how long it has been. 
In this case, how long since it changed to 1 to the current timestamp, therefore I should be able to measure the uptime value of that metric.  

Open to ideas, as I can't seem to get this working, eventually I would like to present this into grafana so I can show the uptime of that metric.  

Weston Greene

unread,
Sep 13, 2020, 4:45:37 AM9/13/20
to Prometheus Users
I feel like this answer gives directly what you need minus one step, so forgive me if I'm misunderstanding. The one step it doesn't explicitly say is a second rule for `time() - stat__change__timestamp`. 
Here is an example directly from my working solution:

```rules.yaml
              - record: stat__change__timestamp  
                # timestamp of when the metric last changed  
                expr: 
                  timestamp(changes({exported_job=~"visor_.*", alertname="", offset="", original_name!="", original_stat=""}[${SCRAPE_INTERVAL_AND_A_HALF}]) > 0)
                    or ignoring(stat, monitor, original_stat)
                  stat__change__timestamp
                labels:
                  stat: true
                  original_stat: stat__change__timestamp  # This keeps the stat__offset of this metric unique from the original
    
              - record: stat__change__seconds_since  
                # number of seconds since the metric value changed  # this will highlight whether a script is not recording correctly or if a metric is stagnant
                expr: 
                  time() - stat__change__timestamp
                labels:
                  stat: true
                  original_stat: stat__change__seconds_since  # This keeps the stat__offset of this metric unique from the original
```

An alternative to `changes()` (pulled from a different prometheus server I manage, hence the different label criteria):
```rules.yaml
                timestamp(
                  (
                      kafka_consumer_group_lag{topic!~".*verification_id|.*submission_id|.*__leader|.*-changelog|.*_Internal.*", group!="BifrostMonitor_Bifrost_MongoTopicDumper"}
                       -
                       kafka_consumer_group_lag{topic!~".*verification_id|.*submission_id|.*__leader|.*-changelog|.*_Internal.*", group!="BifrostMonitor_Bifrost_MongoTopicDumper"} offset ${SCRAPE_INTERVAL_DOUBLE}
                     ) != 0
                   )
```

When I say `SCRAPE_INTERVAL`, I mean 
```prometheus.yaml
  global:
    scrape_interval: ${SCRAPE_INTERVAL} # Default is every minute.
    evaluation_interval: ${EVALUATION_INTERVAL} # default is every minute.
  alerting:
     ...
```

I can't remember why I chose `_AND_A_HALF` for `changes()` and yet `_DOUBLE` for subtracting the offset. Don't think it much matters.

Reply all
Reply to author
Forward
0 new messages