Prometheus alert tracking

92 views
Skip to first unread message

kedar sirshikar

unread,
May 27, 2020, 2:53:55 PM5/27/20
to Prometheus Users
Hi all,

We have seen an alert getting triggered couple of days ago and it got resolved after 5 seconds.

After checking container logs & grafana screenshots, no evidence was seen in support of the alert which was created and resolved. 

System does not show any symptom relevant to the particular alert since last week.

So wanted to know if there is anything which can be tracked from prometheus side to know why alert was generated?

Please let me know if someone has any inputs about these kind of suspicious alerts.

Thanks.

Sally Lehman

unread,
May 28, 2020, 2:27:18 AM5/28/20
to Prometheus Users
Hi Kedar! What evidence do you have that the alert was triggered outside of prometheus, and how far can you follow it back? We could start there.

You can turn on debug logging for both prometheus and alertmanager and that may tell you more, --log.level=debug

Sally 

kedar sirshikar

unread,
May 28, 2020, 3:42:34 AM5/28/20
to Prometheus Users
Thank you Sally for your reply however turning on debug may help to troubleshoot future but not past alert occurrences.
Is there any possibility to verify that prometheus may have generated a delayed/incorrect alert.

Brian Candler

unread,
May 28, 2020, 5:00:04 PM5/28/20
to Prometheus Users
Was the alert an E-mail? If so it will include full details, such as the alert name (which you can cross-reference to your alerting rules), and a link to prometheus where you can inspect the query.

kedar sirshikar

unread,
May 29, 2020, 12:12:16 AM5/29/20
to Prometheus Users
Thank you Brian for reply. Alet was not an email. I tracked alert rule and I also inspected the specific metric in grafana however it's value from grafana was never equal to the one mentioned in query.
That is why, I am finding it tough to troubleshoot and perform any kind of root cause analysis.
Let me know if you have any suggestion.

Thanks,
Kedar.

Brian Candler

unread,
May 29, 2020, 2:22:21 AM5/29/20
to Prometheus Users
Was this an alert generated by prometheus' alertmanager, or an alert generated by grafana's alerting system?

You said alert was resolved "in 5 seconds" which sounds dubious.  Maybe you have some extremely low interval configured for your alerting rules in prometheus?

Nonetheless, the history is all in prometheus (at least for the TSDB retention period - default 15 days).  You need to work out what expression generated the alert, and use PromQL to explore the data in prometheus.

That's all we can say, unless you show the content of the alert itself *and* the rule which you believe generated the alert *and* the data which backs up your assertion that there was no triggering data in that period.

kedar sirshikar

unread,
Jun 1, 2020, 7:16:17 PM6/1/20
to Prometheus Users
Hi Brian,
I thought this issue will not be faced again however we witnessed this alert again on Saturday 09:12:18 am UTC so requesting your guideance.

Alert configuration is as below:

admin@orchestrator[nd2bwa6drm01v]# show running-config alert rule PROCESS_STATE 
alert rule PROCESS_STATE
 expression         "docker_service_up==1 or docker_service_up==3"
 event-host-label   container_name
 message            "{{ $labels.service_name }} instance {{ $labels.module_instance }} of module {{ $labels.module }} is in Aborted state !"
 snmp-facility      application
 snmp-severity      critical
 snmp-clear-message "{{ $labels.service_name }} instance {{ $labels.module_instance }} of module {{ $labels.module }} is moved from Aborted state !"
!
admin@orchestrator[nd2bwa6drm01v]# 


Recent alert details:

NAME           EVENT HOST    STATUS MESSAGE                                                      CREATE TIME            RESOLVE TIME           UPDATE TIME  

PROCESS_STATE haproxy-common-s101 resolved   haproxy-common instance 101 of module haproxy-common is moved from Aborted state !  2020-05-30T09:12:18.643+00:00  2020-05-30T09:12:33.617+00:00  2020-05-30T09:27:38.659+00:00
PROCESS_STATE haproxy-common-s103 resolved   haproxy-common instance 103 of module haproxy-common is moved from Aborted state !  2020-05-30T09:12:18.644+00:00  2020-05-30T09:12:33.619+00:00  2020-05-30T09:27:38.66+00:00

Per your last suggestion, I have also verified below output but it does not indicate 'docker_service_up' metric set to either 1 or 3 (for which alert is configured)

Please let me know if you have any comment/opinion.

Brian Candler

unread,
Jun 2, 2020, 2:55:10 AM6/2/20
to Prometheus Users
On Tuesday, 2 June 2020 00:16:17 UTC+1, kedar sirshikar wrote:
Alert configuration is as below:

admin@orchestrator[nd2bwa6drm01v]# show running-config alert rule PROCESS_STATE 
alert rule PROCESS_STATE
 expression         "docker_service_up==1 or docker_service_up==3"
 event-host-label   container_name
 message            "{{ $labels.service_name }} instance {{ $labels.module_instance }} of module {{ $labels.module }} is in Aborted state !"
 snmp-facility      application
 snmp-severity      critical
 snmp-clear-message "{{ $labels.service_name }} instance {{ $labels.module_instance }} of module {{ $labels.module }} is moved from Aborted state !"
!

Could you explain what software and platform/OS you are running?

This "show running-config" command doesn't look like any flavour of prometheus I'm familiar with.  Is this some version of prometheus embedded in another system?  If so, do you have any way to determine what the underlying version of prometheus is?

Also, regular prometheus doesn't generate events directly.  It generates HTTP calls to alertmanager, which processes those events.

kedar sirshikar

unread,
Jun 2, 2020, 4:40:53 PM6/2/20
to Prometheus Users
Please refer below details captured from prometheus container related to OS/Platform.

root@prometheus-hi-res-s101:/# /prometheus/prometheus --version

prometheus, version 2.3.1 (branch: HEAD, revision: 188ca45bd85ce843071e768d855722a9d9dabe03)

  build user:       root@82ef94f1b8f7

  build date:       20180619-15:56:22

  go version:       go1.10.3

root@prometheus-hi-res-s101:/#

root@prometheus-hi-res-s101:/# cat /etc/os-release 

NAME="Ubuntu"

VERSION="16.04.2 LTS (Xenial Xerus)"

ID=ubuntu

ID_LIKE=debian

PRETTY_NAME="Ubuntu 16.04.2 LTS"

VERSION_ID="16.04"

VERSION_CODENAME=xenial

UBUNTU_CODENAME=xenial

root@prometheus-hi-res-s101:/#


We have integrated tailf-confd (https://www.tail-f.com/confd-basic/) for CLI to configure alert rules and monitor alert status.

As mentioned below, alert is seen to be resolved in 5 seconds on few occurrences (alert mentioned in last email was resolved in 15 seconds)

NAME           EVENT HOST         STATUS MESSAGE                                                      CREATE TIME            RESOLVE TIME            UPDATE TIME  

PROCESS_STATE       haproxy-common-s109      resolved    haproxy-common instance 109 of module haproxy-common is moved from Aborted state !  2020-05-24T07:53:54.044+00:00  2020-05-24T07:53:59.057+00:00  2020-05-24T08:08:59.066+00:00  
PROCESS_STATE       binding-s122                       resolved    binding instance 122 of module binding is moved from Aborted state !                                  2020-06-01T23:45:43.997+00:00  2020-06-01T23:45:48.881+00:00  2020-06-02T00:00:48.849+00:00 
Alert which gets resolved after 15 seconds can be justified as we have got supporting evidence from grafana but proof, for alerts which got resolved in 5 seconds, are absent in logs and grafana.
Not sure, if there is something to do with duration for which alert remains active.

I am parallely continuing investigations within our product's approach to deal with alerts. In case if you get any hint from above details please let me know.
Thank you.

Brian Candler

unread,
Jun 2, 2020, 5:03:37 PM6/2/20
to Prometheus Users
On Tuesday, 2 June 2020 21:40:53 UTC+1, kedar sirshikar wrote:
Please refer below details captured from prometheus container related to OS/Platform.

root@prometheus-hi-res-s101:/# /prometheus/prometheus --version

prometheus, version 2.3.1 (branch: HEAD, revision: 188ca45bd85ce843071e768d855722a9d9dabe03)


That's almost 2 years old, so I'd suggest updating to 2.18.1 in case there's some bug fix (also lots of other performance improvements)

Otherwise sorry - any data used to alert should also have been recorded in the time series database.

kedar sirshikar

unread,
Jun 5, 2020, 12:53:17 AM6/5/20
to Prometheus Users
Thanks, It definitely makes sense to upgrade the prometheus version.

As per the details analyzed till now, time series DB details fetched from Prometheus using API (curl 'http://localhost:9090/api/v1/query_range?query=docker_service_up&start=2020-05-24T07:20:00.000Z&end=2020-05-24T08:10:00.000Z&step=1s') are not justifying alert which lasted for 5 seconds whereas same API provides TSDB records asserting alert that lasted for 15 seconds.
   
This is where I am stuck and unable to proceed further to figure out why alert was generated even if metric values from TSDB do not match to the values used in alert expression.

Thanks,
Kedar.
Reply all
Reply to author
Forward
0 new messages