How-To debug prometheus_rule_evaluation_failures_total? Prometheus is failing rule evaluations

Evelyn Pereira Souza

unread,

Apr 22, 2021, 1:06:30 PM4/22/21

to Prometheus Users

Hi

We have those alerts constantly:

{group="local", instance="localhost:9090", job="prometheus",
rule_group="/etc/config/prometheus-rules.yml;node.rules"}
10
{group="local", instance="localhost:9090", job="prometheus",
rule_group="/etc/config/prometheus-rules.yml;prometheus"}
10

Source:

https://github.com/prometheus-operator/kube-prometheus/blob/0cb0c49186fbf580825746bd1756ebbd32067d81/manifests/prometheus-prometheusRule.yaml#L184-L193

expr: |

increase(prometheus_rule_evaluation_failures_total{job="prometheus-k8s",namespace="monitoring"}[5m])
> 0

The playbook empty

runbook_url:
https://github.com/prometheus-operator/kube-prometheus/wiki/prometheusrulefailures

How-To debug this alert?

kind regards
Evelyn

OpenPGP_0x61776FA8E38403FB.asc

OpenPGP_signature

Matthias Rampke

unread,

Apr 22, 2021, 2:20:31 PM4/22/21

to Evelyn Pereira Souza, Prometheus Users

Your best starting point is the rules page of the Prometheus UI (:9090/rules). It will show the error. You can also evaluate the rule expression yourself, using the UI, or maybe using PromLens to help debug expression issues.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/819b814c-c038-9860-ff26-89de893a0ac1%40disroot.org.

Evelyn Pereira Souza

unread,

Apr 23, 2021, 12:08:56 AM4/23/21

to promethe...@googlegroups.com

On 22.04.21 20:20, Matthias Rampke wrote:
> Your best starting point is the rules page of the Prometheus UI
> (:9090/rules). It will show the error. You can also evaluate the rule
> expression yourself, using the UI, or maybe using PromLens to help debug
> expression issues.
>
> /MR

:9090/rules show those 2 errors - found duplicate series for the match group

I think we may have a problem with the federation connfig..

alert:PrometheusRemoteWriteBehind
expr:(max_over_time(prometheus_remote_storage_highest_timestamp_in_seconds[5m])
- on(job, instance) group_right()
max_over_time(prometheus_remote_storage_queue_highest_sent_timestamp_seconds[5m]))
> 120
for: 15m
labels:
severity: critical
annotations:
description: Prometheus {{$labels.namespace}}/{{$labels.pod}} remote
write is {{ printf "%.1f" $value }}s behind for {{
$labels.remote_name}}:{{ $labels.url }}.
summary: Prometheus remote write is behind.

found duplicate series for the match group
{instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate"} on the left hand-side of the operation: [{cluster="poc",
endpoint="web", exported_instance="x.x.x.x:9090",
exported_job="prometheus-k8s",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring", pod="prometheus-k8s-1",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
service="prometheus-k8s", team="MY-TEAM-NAME"}, {cluster="poc",
endpoint="web", exported_instance="x.x.x.x:9090",
exported_job="prometheus-k8s",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
service="prometheus-k8s", team="MY-TEAM-NAME"}];many-to-many matching

not allowed: matching labels must be unique on one side

and

record:node:node_num_cpu:sum
expr:count by(cluster, node) (sum by(node, cpu)
(node_cpu_seconds_total{job="node-exporter"} * on(namespace, pod)
group_left(node) node_namespace_pod:kube_pod_info:))

found duplicate series for the match group {namespace="monitoring",
pod="prometheus-k8s-0"} on the right hand-side of the operation:
[{__name__="node_namespace_pod:kube_pod_info:", cluster="preprod",
instance="prometheus.ep-preprod-in.kuber.example.org:9090",
job="federate", namespace="monitoring",
node="4516e9ed-4917-4792-ad49-2158775dc07e", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-1",
team="MY-TEAM-NAME"}, {__name__="node_namespace_pod:kube_pod_info:",
cluster="poc",
instance="prometheus.slash-dir-poc-in.kuber.example.org:9090",
job="federate", namespace="monitoring",
node="602efe91-2eb5-466f-9350-c4c6ce35119a", pod="prometheus-k8s-0",
prometheus="monitoring/k8s", prometheus_replica="prometheus-k8s-0",
team="MY-TEAM-NAME"}];many-to-many matching not allowed: matching labels
must be unique on one side

also this alert fires

name: PrometheusOutOfOrderTimestamps
expr: rate(prometheus_target_scrapes_sample_out_of_order_total[5m]) > 0

we may have a problem with federation:

We have an external Prometheus which federates from 4x k8s cluter
Prometheus.

config

- job_name: federate
scrape_interval: 15s
scrape_timeout: 15s
honor_labels: false
metrics_path: /federate
scheme: https
tls_config:
insecure_skip_verify: true
params:
match[]:
- '{__name__=~".+"}'
file_sd_configs:
- files:
- k8s.yml
relabel_configs:
- source_labels:
- __address__
regex: (.*)
replacement: ${1}:9090
target_label: __address__

- labels:
cluster: poc
team: MY-TEAM-NAME
targets:
- prometheus.slash-dir-poc-in.kuber.example.org
- labels:
cluster: devtest
team: MY-TEAM-NAME
targets:
- prometheus.slash-dir-devtest-in.kuber.example.org
- labels:
cluster: preprod
team: MY-TEAM-NAME
targets:
- prometheus.ep-preprod-in.kuber.example.org
- labels:
cluster: prod
team: MY-TEAM-NAME
targets:
- prometheus.ep-prod-in.kuber.example.org

kind regards
Evelyn

OpenPGP_0x61776FA8E38403FB.asc

OpenPGP_signature

Matthias Rampke

unread,

Apr 23, 2021, 2:35:29 PM4/23/21

to Evelyn Pereira Souza, Prometheus Users

It seems like you are federating through an ingress or load balancer that balances over multiple Prometheus server replicas. Either federate from each separately, or make sure that you only get responses from one consistently.

As an alternative to the global federation, consider Thanos, it scales further and handles this situation out of the box.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8eb1d476-d2ce-9c99-1dfa-392b390c096c%40disroot.org.

Evelyn Pereira Souza

unread,

Apr 24, 2021, 2:58:04 AM4/24/21

to promethe...@googlegroups.com

On 23.04.21 20:35, Matthias Rampke wrote:
> It seems like you are federating through an ingress or load balancer
> that balances over multiple Prometheus server replicas. Either federate

> from each separately, or make sure that you only get responses from one

> consistently.
>
> As an alternative to the global federation, consider Thanos, it scales
> further and handles this situation out of the box.
>
> /MR

Thank you. Will check this. This sounds right.

In official docs
(https://prometheus.io/docs/prometheus/latest/federation/) there is an
other federation config than I use

scrape_configs:

- job_name: 'federate'
scrape_interval: 15s

honor_labels: true
metrics_path: '/federate'

params:
'match[]':
- '{job="prometheus"}'
- '{__name__=~"job:.*"}'

Do I also need to modify that?

I use at the moment:

- job_name: federate
scrape_interval: 15s
scrape_timeout: 15s
honor_labels: false
metrics_path: /federate
scheme: https
tls_config:
insecure_skip_verify: true
params:
match[]:
- '{__name__=~".+"}'

best regards
Evelyn

OpenPGP_0x61776FA8E38403FB.asc

OpenPGP_signature

Matthias Rampke

unread,

May 1, 2021, 11:01:49 AM5/1/21

to Evelyn Pereira Souza, Prometheus Users

That looks good, I think the issue is which target(s) you discover for these jobs.

If you scrape Prometheus directly you may have to change the TLS settings depending on your configuration.

/MR

--
You received this message because you are subscribed to the Google Groups "Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-use...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/7eb91b81-29f8-b66c-c7f3-4112c3abf4d6%40disroot.org.

pin...@hioscar.com

unread,

May 13, 2021, 4:02:04 PM5/13/21

to Prometheus Users

We are facing the issue where rules fail sporadically time to time. Are these errors logged somewhere if they cannot be found on UI? Thanks

Reply all

Reply to author

Forward