Calculating Availability SLA over multiple VMs

32 views
Skip to first unread message

Debashish Ghosh

unread,
Mar 16, 2020, 4:21:29 PM3/16/20
to Prometheus Users
Hi,
  I am currently using spring's actuator/micrometer to spit out metrics that are scraped by prometheus.
The framework generates a metric called process_uptime_seconds which is the number of seconds my app is running in a VM . I have 2 VMs where my app is running to provide high availability of 99.95 %.

I am using the formula 100-(((30*24*60*60) - increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100 to calculate the SLA.

30*24*60*60 represents the number of sencods in 30 days and the difference with the process_uptime_seconds will give the number of seconds the app was down in a VM .

But the problem with this approach is that periodically we have to restart the service to apply patch and while doing so we do it one by one so that there is no downtime.

But since the above formula creates one timeseries for each VM instance the SLA goes down since both the servers are restarted one after the another.

Is there a way to take this into consideration to calculate sla based on the time when both the servers were down together ?

Thanks
Debashish
  

Christian Hoffmann

unread,
Mar 16, 2020, 4:44:01 PM3/16/20
to Debashish Ghosh, Prometheus Users
Hi,

On 3/16/20 9:21 PM, Debashish Ghosh wrote:
>   I am currently using spring's actuator/micrometer to spit out metrics
> that are scraped by prometheus.
> The framework generates a metric called *process_uptime_seconds* which
> is the number of seconds my app is running in a VM . I have *2 VMs*
> where my app is running to provide high availability of 99.95 %.
>
> I am using the formula *100-(((30*24*60*60) -
> increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
> *to calculate the SLA.
>
> 30*24*60*60 represents the number of sencods in 30 days and the
> difference with the process_uptime_seconds will give the number of
> seconds the app was down in a VM .
>
> But the problem with this approach is that periodically we have to
> *restart *the service to apply patch and while doing so we do it one by
> one so that there is no downtime.
>
> But since the above formula creates one timeseries for each VM instance
> the SLA goes down since both the servers are restarted one after the
> another.
>
> Is there a way to take this into consideration to calculate sla based on
> the time*when both the servers were down together *?
Hrm, can't you just use the up metric to detect whether your application
was available?

That way, you could calculate availability of your service via
max(up{instance=~"server1|server2"}) == 1. I think that would make the
whole thing much easier, wouldn't it?

I fail to come up with an idea based on your process_uptime_seconds
approach. It may be possible (maybe using a recording rule which decides
for each evaluation interval whether your servers cound as available or
not...?), but it sounds like it would get complicated quickly.


Kind regards,
Christian

Roland V

unread,
Mar 16, 2020, 5:01:20 PM3/16/20
to Prometheus Users
Hi Debashish,

The way we did SLA reporting on our side was:
  • export an '*_up' metric for the VMs giving a value of 1 or 0
  • create silences via Alertmanager for maintenance periods, and ensure they contain matchers that help identify the VMs (we used matchers like 'resource_group' & 'resource_name' as the machines run in Azure)
  • export silences just like machine state via: https://github.com/FXinnovation/alertmanager-silences-exporter
    the exporter will give you a value of 1 in case the silence is active, and 0 for all other states.
  • create a recording rule to check if a VM is in an 'up', 'down' or 'under maintenance' state. We use the metric created here for the time range we want to calculate the SLA.
  • share results via Grafana to our clients
Hope this helps!

Thanks,
Roland
Reply all
Reply to author
Forward
0 new messages