I am trying to do scaling operation with Prometheus Alert Manager. I have configured a webhook receiver in Alert Manager to execute my ScaleUp and ScaleDown action.I have the following things set in configuration:evaluation_interval: 1mscrape_interval: 1sThere is no grouping in Alert Manager.Below are the metrics used :current_replica is a metric that gives instantaneous value of number of PODS running.min_replica and max_replica are the metrics that have been set to avoid scaling replicas down and up respectively beyond that count.Alert Rule goes like below:1. Scaledownexpr: ((sum(rate(workmanager_completed_requests{weblogic_domainName="xface-domain",weblogic_clusterName="EPICluster",name="EPIClusterWorkManager",applicationName="WDTTestEAR"}[30s])) < bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"})) + (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) > bool sum(replica_exporter_min_count_cr{cluster_name="EPICluster"}))) == 2
[ (tps < bool 5 + current_replicas > bool min_replicas) == 2 ]
2. Scaleup
expr: ((sum(rate(workmanager_completed_requests{applicationName="WDTTestEAR",name="EPIClusterWorkManager",weblogic_clusterName="EPICluster",weblogic_domainName="xface-domain"}[30s])) > bool sum(replica_exporter_target_avg_req_cr{cluster_name="EPICluster"})) + (sum(replica_exporter_current_replica_count{cluster_name="EPICluster"}) < bool sum(replica_exporter_max_count_cr{cluster_name="EPICluster"}))) == 2
[ (tps > bool 5 + current_replicas bool < max_replicas) == 2 ]
Consider the following scenario:
min_replica = 2
max_replica = 4
There is no load on the applications in the POD ( tps = 0 )
Step 1. When 2 Pods running, scaledown rule is evaluated to false and no alerts are fired - expected behaviour
current_replica = 2
min_replica =2
Step 2. Now I bring up one more replica making current_replica to 3, scaledown action gets evaluated to true
because tps < 5 and current_replica > min_replica
Alert goes into "Firing" state and it brings down one replica.
Here alert stays into "Firing" state for that evaluation cycle - expected behaviour
Step 3. After 1m ( at the end of previous evaluation cycle ), scaledown action still gets evaluated to true even though current_replica is 2 now and one more replica is brought down, however alert state changes from "Firing" to "Inactive" - unexpected behaviour
I don't understand why alert rule gets evaluated to true in Step 3, when the current number of running replicas is 2.
To confirm, after scaledown in Step 2, I even checked alert rule expression in Prometheus dashboard, value is not evaluaed to 2, hence the expression should return false.Below are the webhook logs where I have printed the Rule Evalution Value of every cycle ( accessed from "annotations" under rules ), you can see that metric value is 2 for both cycles.Webhook Logs :
CURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:47:26 finished handling scaledownCURRENT_VALUE is ScaleDown Action Metric Value is 2
[webhook] 2020/06/04 06:48:26 finished handling scaledown
Can someone help me here with understanding this behaviour ?