My Query never fires an Alarm

Jay P

unread,

Aug 21, 2024, 11:19:34 AM8/21/24

to Prometheus Users

I am not new to Prometheus however, I wrote the following rules which never fires. (Alertmanger and all other settings are fine since I get Alarms for other rules except this one)

Attached here it he screenshot and i am copy pasting here as well.

confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1

Results:

confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", kafka_id="XXX", topic="XXX"}
2
confluent_kafka_server_consumer_lag_offsets{consumer_group_id="XXX", instance="api.telemetry.confluent.cloud:443", job="confluent-cloud", kafka_id="XXX", topic="XXX"}
3

Any help is greatly appreciated. Thank you

Screenshot 2024-08-21 093941.png

Daz Wilkin

unread,

Aug 21, 2024, 2:51:12 PM8/21/24

to Prometheus Users

Please include the rule.

You've shown that the query returns results which is necessary but insufficient.

Jay

unread,

Aug 21, 2024, 4:10:45 PM8/21/24

to Daz Wilkin, Prometheus Users

Greetings Team

Yes Sure. Please see below.

name: Dev-NotEqualtoBoolZero
expr: confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
labels:
severity: critical
annotations:
description: The consumer lags for Dev client`

global:
  scrape_interval: 1m
  scrape_timeout: 1m
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  evaluation_interval: 5m
alerting:
  alertmanagers:
  - follow_redirects: true
    enable_http2: true
    scheme: https
    timeout: 10s
    api_version: v2
    static_configs:
    - targets:
      - app-promalertmanager2-d.ase-appintd.appserviceenvironment.net:443
rule_files:
- /etc/prometheus/rules/kafka/confluent.yml

- job_name: confluent-cloud
  honor_timestamps: true
  track_timestamps_staleness: false
  params:
    resource.kafka.id:
    - lkc-xxx
    - lkc-xxx
    - lkc-xxx
    resource.schema_registry.id:
    - lsrc-xxx
    - lsrc-xxx
    - lsrc-xxx
  scrape_interval: 1m
  scrape_timeout: 1m
  scrape_protocols:
  - OpenMetricsText1.0.0
  - OpenMetricsText0.0.1
  - PrometheusText0.0.4
  metrics_path: /v2/metrics/cloud/export
  scheme: https
  enable_compression: true
  basic_auth:
    username: XXXXXXX
    password: <secret>
  follow_redirects: true
  enable_http2: true
  static_configs:
  - targets:
    - api.telemetry.confluent.cloud

--
You received this message because you are subscribed to a topic in the Google Groups "Prometheus Users" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/prometheus-users/pBEqCDIUFug/unsubscribe.
To unsubscribe from this group and all its topics, send an email to prometheus-use...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e8007886-5a85-4fdb-938e-36373409498cn%40googlegroups.com.

Jay

unread,

Aug 21, 2024, 4:13:10 PM8/21/24

to Daz Wilkin, Prometheus Users

Last one was from Prometheius server's status. But here is my actual file.

global:
  scrape_interval: 1m # By default, scrape targets every 1 minute.
  scrape_timeout: 1m
  evaluation_interval: 5m # How frequently to evaluate rules

# Alertmanager configuration
alerting:
  alertmanagers:
    - scheme: https
      static_configs:
        - targets:
            - "${NFM_ALERT_MANAGER_URL}"

# Load rules once and periodically evaluate them according to the global 'evaluation_interval'.
rule_files:
  - "/etc/prometheus/rules/kafka/confluent.yml"
  - "/etc/prometheus/rules/kafka/connect.yml"
  - "/etc/prometheus/rules/kafka/uptime.yml"
  - "/etc/prometheus/rules/observability/uptime.yml"

# A scrape configuration containing exactly one endpoint to scrape:
# Here it's Prometheus itself.
scrape_configs:
  # have prometheus scrape itself to gather internal metrics
  - job_name: 'prometheus'
    scrape_interval: 1m
    scheme: http
    static_configs:
      - targets: 
        - 'localhost:9090'

  - job_name: 'alertmanager'
    scrape_interval: 1m
    metrics_path: /metrics
    scheme: https
    static_configs:
      - targets:
        - "${NFM_ALERT_MANAGER_URL}"

Jay

unread,

Aug 21, 2024, 4:24:58 PM8/21/24

to Daz Wilkin, Prometheus Users

Here is the text:

groups:
  - name: confluent-rules
    rules:

    - alert: Dev-NotEqualtoBoolZero

      expr: confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
      labels:
        severity: critical
      annotations:
        description: "The consumer lags for Dev client`"

On Wed, Aug 21, 2024 at 1:51 PM Daz Wilkin <daz.w...@gmail.com> wrote:

--

Brian Candler

unread,

Aug 22, 2024, 11:20:17 AM8/22/24

to Prometheus Users

Your test example in PromQL browser has:

confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 1

and the values were 2 or 3; but the alerting expression has

confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
So clearly it's not going to trigger under that condition, when the lags are less than 100.

If that's not the probelm, then you need to determine: is the rule not firing? Or is Alertmanager not sending an alert?

To do this, check in the Prometheus web interface under the Alerts tab. Is there a firing alert there? If yes, then you focus your investigation on the alertmanager side (e.g. check alertmanager logs). If no, then drill further into the expression, although if the same expression shows a non-empty result in the PromQL query interface, then it certainly should be able to fire an alert.

Jay

unread,

Aug 22, 2024, 11:53:29 AM8/22/24

to Brian Candler, Prometheus Users

Brian

Let me answer you in bullet points:

1. I have tried the expression with both, ie. > 1 and also > 100. Both don't fire.

2. I am getting other Alerts through Alertmanager for example, UP/down of instance. So its not the Alertmanager.

Expression shows non-empty results in PromQL Query interface but still it doesn't fire.

J

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8be525b8-e216-4cc4-8f05-c126bf42fc35n%40googlegroups.com.

Brian Candler

unread,

Aug 23, 2024, 3:38:10 AM8/23/24

to Prometheus Users

> 2. I am getting other Alerts through Alertmanager for example, UP/down of instance. So its not the Alertmanager.

No, that does not necessarily follow. (e.g. different alerts can have different labels and are processed differently by alertmanager routing rules).

Please determine whether Prometheus is sending alerts to Alertmanager by checking in the Prometheus web interface under the "Alerts" tab. Then we can focus on either Prometheus or Alertmanager configuration.

Jay

unread,

Aug 23, 2024, 9:45:48 AM8/23/24

to Brian Candler, Prometheus Users

Brian

"Please determine whether Prometheus is sending alerts to Alertmanager by checking in the Prometheus web interface under the "Alerts" tab. Then we can focus on either Prometheus or Alertmanager configuration."

Prometheus Alerts Tab is empty (Never see alarm there for This alert rule). I see Alarms for UP. Also, Labels for ALL rules are same! Critical.

To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/6e0c7340-b5e2-4d58-9331-68295cca5c0an%40googlegroups.com.

Brian Candler

unread,

Aug 24, 2024, 9:35:43 AM8/24/24

to Prometheus Users

Your alert has an odd name for its purpose ("alert: Dev-NotEqualtoBoolZero"). Is it possible you're using the same name for another alerting rule? Or maybe an alert name containing a dash is problematic, although I don't remember this being a problem.

What version of Prometheus are you running?

If you go to the web interface in the "Alerts" tab, you should be able to view green "Inactive" alerts. Is your alerting rule shown there? If you click on the ">" to expand it, do you see the rule you were expecting?

You could try copy-pasting the alert rule from this view directly into the PromQL browser, just in case some symbol is not what you expect it to be.

You could also try putting the whole expr in single quotes, or using the multi-line form:

    - alert: Dev-NotEqualtoBoolZero

      expr: |
        confluent_kafka_server_consumer_lag_offsets{job="confluent-cloud"} > 100
      labels:
        severity: critical
      annotations:
        description: "The consumer lags for Dev client`"

Those are the only things I can think of.

Reply all

Reply to author

Forward