How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

2,407 views
Skip to first unread message

Evelyn Pereira Souza

unread,
Apr 22, 2021, 1:06:30 PM4/22/21
to Prometheus Users
Hi

We have those alerts constantly:

{group="local", instance="localhost:9090", job="prometheus",
rule_group="/etc/config/prometheus-rules.yml;node.rules"}
10
{group="local", instance="localhost:9090", job="prometheus",
rule_group="/etc/config/prometheus-rules.yml;prometheus"}
10

Source:

https://github.com/prometheus-operator/kube-prometheus/blob/0cb0c49186fbf580825746bd1756ebbd32067d81/manifests/prometheus-prometheusRule.yaml#L184-L193

expr: |

increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m])
> 0

The playbook empty

runbook_url:
https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusrulefailures

How-To debug this alert?

kind regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Matthias Rampke

unread,
Apr 22, 2021, 2:20:31 PM4/22/21
to Evelyn Pereira Souza, Prometheus Users
Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/819b814c-c038-9860-ff26-89de893a0ac1%40disroot.org.

Evelyn Pereira Souza

unread,
Apr 23, 2021, 12:08:56 AM4/23/21
to promethe...@googlegroups.com
On 22.04.21 20:20, Matthias Rampke wrote:
> Your best starting point is the rules page of the Prometheus UI
> (:9090/rules). It will show the error. You can also evaluate the rule
> expression yourself, using the UI, or maybe using PromLens to help debug
> expression issues.
>
> /MR

:9090/rules show those 2 errors - found duplicate series for the match group

I think we may have a problem with the federation connfig..

alert:PrometheusRemoteWriteBehind
expr:(max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds[5m])
- on(job, instance) group_right()
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds[5m]))
> 120
for: 15m
labels:
severity: critical
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote
write is {{ printf "%.1f" $value }}s behind for {{
$labels.remote_name}}:{{ $labels.url }}.
summary: Prometheus remote write is behind.


found duplicate series for the match group
{instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate"} on the left hand-side of the operation: [{cluster="poc",
endpoint="web", exported_instance="x.x.x.x:9090",
exported_job="prometheus-k8s",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring", pod="prometheus-k8s-1",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
service="prometheus-k8s", team="MY-TEAM-NAME"}, {cluster="poc",
endpoint="web", exported_instance="x.x.x.x:9090",
exported_job="prometheus-k8s",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
service="prometheus-k8s", team="MY-TEAM-NAME"}];many-to-many matching

not allowed: matching labels must be unique on one side


and

record:node:node_num_cpu:sum
expr:count by(cluster, node) (sum by(node, cpu)
(node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod)
group_left(node) node_namespace_pod:kube_pod_info:))


found duplicate series for the match group {namespace="monitoring",
pod="prometheus-k8s-0"} on the right hand-side of the operation:
[{__name__="node_namespace_pod:kube_pod_info:", cluster="preprod",
instance="prometheus.ep-preprod-in.kuber.example.org:9090",
job="federate", namespace="monitoring",
node="4516e9ed-4917-4792-ad49-2158775dc07e", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-1",
team="MY-TEAM-NAME"}, {__name__="node_namespace_pod:kube_pod_info:",
cluster="poc",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring",
node="602efe91-2eb5-466f-9350-c4c6ce35119a", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
team="MY-TEAM-NAME"}];many-to-many matching not allowed: matching labels
must be unique on one side

also this alert fires

name: PrometheusOutOfOrderTimestamps
expr: rate(prometheus_target_scrapes_sample_out_of_order_total[5m]) > 0

we may have a problem with federation:

We have an external Prometheus which federates from 4x k8s cluter
Prometheus.

config

- job_name: federate
scrape_interval: 15s
scrape_timeout: 15s
honor_labels: false
metrics_path: /federate
scheme: https
tls_config:
insecure_skip_verify: true
params:
match[]:
- '{__name__=~".+"}'
file_sd_configs:
- files:
- k8s.yml
relabel_configs:
- source_labels:
- __address__
regex: (.*)
replacement: ${1}:9090
target_label: __address__


- labels:
cluster: poc
team: MY-TEAM-NAME
targets:
- prometheus.slash-dir-poc-in.kuber.example.org
- labels:
cluster: devtest
team: MY-TEAM-NAME
targets:
- prometheus.slash-dir-devtest-in.kuber.example.org
- labels:
cluster: preprod
team: MY-TEAM-NAME
targets:
- prometheus.ep-preprod-in.kuber.example.org
- labels:
cluster: prod
team: MY-TEAM-NAME
targets:
- prometheus.ep-prod-in.kuber.example.org

kind regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Matthias Rampke

unread,
Apr 23, 2021, 2:35:29 PM4/23/21
to Evelyn Pereira Souza, Prometheus Users
It seems like you are federating through an ingress or load balancer that balances over multiple Prometheus server replicas. Either federate from each separately, or make sure that you only get responses from one consistently.

As an alternative to the global federation, consider Thanos, it scales further and handles this situation out of the box.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

Evelyn Pereira Souza

unread,
Apr 24, 2021, 2:58:04 AM4/24/21
to promethe...@googlegroups.com
On 23.04.21 20:35, Matthias Rampke wrote:
> It seems like you are federating through an ingress or load balancer
> that balances over multiple Prometheus server replicas. Either federate

> from each separately, or make sure that you only get responses from one

> consistently.
>
> As an alternative to the global federation, consider Thanos, it scales
> further and handles this situation out of the box.
>
> /MR

Thank you. Will check this. This sounds right.

In official docs
(https://prometheus.io/docs/prometheus/latest/federation/) there is an
other federation config than I use

scrape_configs:
- job_name: 'federate'
scrape_interval: 15s

honor_labels: true
metrics_path: '/federate'

params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'

Do I also need to modify that?

I use at the moment:

- job_name: federate
scrape_interval: 15s
scrape_timeout: 15s
honor_labels: false
metrics_path: /federate
scheme: https
tls_config:
insecure_skip_verify: true
params:
match[]:
- '{__name__=~".+"}'

best regards
Evelyn
OpenPGP_0x61776FA8E38403FB.asc
OpenPGP_signature

Matthias Rampke

unread,
May 1, 2021, 11:01:49 AM5/1/21
to Evelyn Pereira Souza, Prometheus Users
That looks good, I think the issue is which target(s) you discover for these jobs.

If you scrape Prometheus directly you may have to change the TLS settings depending on your configuration.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

pin...@hioscar.com

unread,
May 13, 2021, 4:02:04 PM5/13/21
to Prometheus Users
We are facing the issue where rules fail sporadically time to time. Are these errors logged somewhere if they cannot be found on UI? Thanks
Reply all
Reply to author
Forward
0 new messages