Alert Rules expression value not getting changed to expected value after evaluation_interval

12 views

Skip to first unread message

Aarti Nagdev

unread,

Jun 4, 2020, 4:02:24 AM6/4/20

to Prometheus Users

I am trying to do scaling operation with Prometheus Alert Manager. I have configured a webhook receiver in Alert Manager to execute my ScaleUp and ScaleDown action.I have the following things set in configuration:
evaluation_interval: 1m
scrape_interval: 1s
There is no grouping in Alert Manager.Below are the metrics used :
current_replica is a metric that gives instantaneous value of number of PODS running.
min_replica and max_replica are the metrics that have been set to avoid scaling replicas down and up respectively beyond that count.Alert Rule goes like below:
1. Scaledown
expr: ((sum(rate(workmanager_completed_requests{weblogic_domainName="xface-domain",weblogic_clusterName="EPICluster",name="EPIClusterWorkManager",applicationName="WDTTestEAR"}[30s])) < bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"})) + (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) > bool sum(replica_exporter_min_count_cr{cluster_name="EPICluster"}))) == 2

[ (tps < bool 5 + current_replicas > bool min_replicas) == 2 ]

2. Scaleup
expr: ((sum(rate(workmanager_completed_requests{applicationName="WDTTestEAR",name="EPIClusterWorkManager",weblogic_clusterName="EPICluster",weblogic_domainName="xface-domain"}[30s])) > bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"})) + (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) < bool sum(replica_exporter_max_count_cr{cluster_name="EPICluster"}))) == 2

[ (tps > bool 5 + current_replicas bool < max_replicas) == 2 ]

Consider the following scenario:
min_replica = 2
max_replica = 4
There is no load on the applications in the POD ( tps = 0 )
Step 1. When 2 Pods running, scaledown rule is evaluated to false and no alerts are fired - expected behaviour
current_replica = 2
min_replica =2
Step 2. Now I bring up one more replica making current_replica to 3, scaledown action gets evaluated to true
because tps < 5 and current_replica > min_replica
Alert goes into "Firing" state and it brings down one replica.
Here alert stays into "Firing" state for that evaluation cycle - expected behaviour
Step 3. After 1m ( at the end of previous evaluation cycle ), scaledown action still gets evaluated to true even though current_replica is 2 now and one more replica is brought down, however alert state changes from "Firing" to "Inactive" - unexpected behaviour

I don't understand why alert rule gets evaluated to true in Step 3, when the current number of running replicas is 2.
To confirm, after scaledown in Step 2, I even checked alert rule expression in Prometheus dashboard, value is not evaluaed to 2, hence the expression should return false.Below are the webhook logs where I have printed the Rule Evalution Value of every cycle ( accessed from "annotations" under rules ), you can see that metric value is 2 for both cycles.Webhook Logs :
CURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:47:26 finished handling scaledownCURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:48:26 finished handling scaledown