Unit testing

38 views
Skip to first unread message

Jimmy the Greek

unread,
Oct 23, 2020, 5:41:43 PM10/23/20
to Prometheus Users
I have been experimenting with the unit test capabilities provided by promtool and have run into a few issues/gotchas that I can't seem to understand.

example code:

rule_files:
  - ../nodelocal-cache.yaml

evaluation_interval: 1m

tests:
  - interval: 1m
    external_labels:
      cluster: test
    input_series:
    - series: 'coredns_nodecache_setup_errors_total{pod="unit-test", errortype="configmap"}'
      values: '1 2 3 4 5 6 7 8 9 10'
    - series: 'coredns_dns_response_rcode_count_total{job="nodelocal-dns", rcode="SERVFAIL", zone="."}'
      values: '0 60 120 180 240 300 360 420 480 540'
    - series: 'coredns_dns_response_rcode_count_total{job="nodelocal-dns", rcode="NOERROR", zone="."}'
      values: '0 120 240 360 480 600 720 840 960 1080'

    promql_expr_test:
    - expr: rate(coredns_nodecache_setup_errors_total{}[5m])
      eval_time: 5m
      exp_samples:
        - labels: '{pod="unit-test", errortype="configmap"}'
          value: 1.6666666666666666E-02
    - expr: rate(coredns_dns_response_rcode_count_total{}[5m])
      eval_time: 10m
      exp_samples:
        - labels: '{job="nodelocal-dns", rcode="SERVFAIL", zone="."}'
          value: 1
        - labels: '{job="nodelocal-dns", rcode="NOERROR", zone="."}'
          value: 2

    alert_rule_test:
      - eval_time: 6m
        alertname: NodeLocalDNSSetupErrorsHigh
        exp_alerts:
          - exp_labels:
              severity: critical
              alertname: NodeLocalDNSSetupErrorsHigh
              errortype: configmap
              pod: unit-test
            exp_annotations:
              description: test:unit-test There are configmap errors setting up NodeLocalDNS
              summary: NodeLocalDNS setup errors on test:unit-test

----

groups:
- name: NodeLocalDNS
  rules:
  - alert: NodeLocalDNSSetupErrorsHigh
    labels:                                                                                                                              severity: critical
    for: 5m
    expr: |
      rate(coredns_nodecache_setup_errors_total{}[5m]) > 0                                                                             annotations:
      summary: "NodeLocalDNS setup errors on {{ $externalLabels.cluster }}:{{ $labels.pod }}"
      description: "{{ $externalLabels.cluster }}:{{ $labels.pod }} There are {{ $labels.errortype }} errors setting up NodeLocalDNS"


As you can see I run prom QL test rate(coredns_nodecache_setup_errors_total{}[5m]) that evaluates to 1.666. Therefore when I test NodeLocalDNSSetupErrorsHigh which will trigger when that value is above 0 for 5 minute period the test only passes if I set eval_time to 6m, and fails if I set it to 5m (alert doesn't trigger).

What is the relation between the for time in the alert rule itself and the eval_time in the test?

David Leadbeater

unread,
Oct 23, 2020, 6:47:14 PM10/23/20
to Jimmy the Greek, Prometheus Users
On Fri, 23 Oct 2020 at 22:41, Jimmy the Greek <matthe...@gmail.com> wrote:
[...]
> As you can see I run prom QL test rate(coredns_nodecache_setup_errors_total{}[5m]) that evaluates to 1.666. Therefore when I test NodeLocalDNSSetupErrorsHigh which will trigger when that value is above 0 for 5 minute period the test only passes if I set eval_time to 6m, and fails if I set it to 5m (alert doesn't trigger).
>
> What is the relation between the for time in the alert rule itself and the eval_time in the test?

In this case the "for: 5m" means that the alert rule has to be firing
for 5 minutes. Because a rate() needs two samples in order to
calculate a rate your rate() function starts to return a value at 1m,
then when your rules are evaluated at 6m the alert starts firing.

Aside: if you get the promtool from 2.22.0 it's now possible to look
at the ALERTS timeseries, including pending alerts where the for
threshold hasn't been reached, I wouldn't recommend you actually test
the "for" threshold in rules in most cases (you're kind of testing
Prometheus rather than your rules then). But it is possible to
temporarily add a test for debugging like:

- expr: ALERTS{alertstate="pending"}
eval_time: 5m

Which the failure output of will tell you that your alert is pending
at that point, e.g.:

expr: "ALERTS{alertstate=\"pending\"}", time: 5m,
exp:"nil"
got:"{__name__=\"ALERTS\",
alertname=\"NodeLocalDNSSetupErrorsHigh\", alertstate=\"pending\",
errortype=\"configmap\", pod=\"unit-test\", severity=\"critical\"}
1E+00"

David
Reply all
Reply to author
Forward
0 new messages