Hi,
On 3/16/20 9:21 PM, Debashish Ghosh wrote:
> I am currently using spring's actuator/micrometer to spit out metrics
> that are scraped by prometheus.
> The framework generates a metric called *process_uptime_seconds* which
> is the number of seconds my app is running in a VM . I have *2 VMs*
> where my app is running to provide high availability of 99.95 %.
>
> I am using the formula *100-(((30*24*60*60) -
> increase(process_uptime_seconds{job="Interop-InboundApi"}[30d]))/(30*24*60*60))*100
> *to calculate the SLA.
>
> 30*24*60*60 represents the number of sencods in 30 days and the
> difference with the process_uptime_seconds will give the number of
> seconds the app was down in a VM .
>
> But the problem with this approach is that periodically we have to
> *restart *the service to apply patch and while doing so we do it one by
> one so that there is no downtime.
>
> But since the above formula creates one timeseries for each VM instance
> the SLA goes down since both the servers are restarted one after the
> another.
>
> Is there a way to take this into consideration to calculate sla based on
> the time*when both the servers were down together *?
Hrm, can't you just use the up metric to detect whether your application
was available?
That way, you could calculate availability of your service via
max(up{instance=~"server1|server2"}) == 1. I think that would make the
whole thing much easier, wouldn't it?
I fail to come up with an idea based on your process_uptime_seconds
approach. It may be possible (maybe using a recording rule which decides
for each evaluation interval whether your servers cound as available or
not...?), but it sounds like it would get complicated quickly.
Kind regards,
Christian