Writing an alert rule to find interfaces with traffic above weekly 95% percentile

34 views
Skip to first unread message

Cameron Kerr

unread,
Mar 24, 2020, 8:15:53 PM3/24/20
to Prometheus Users
Hi all, I'm trying to work on adding a bunch of monitoring alerts for our VPNs (very topical, huh?) using Prometheus. We have a number of VPNs that we're monitoring; monitoring via Prometheus is new (available data < 1 week) and some of the VPNs are newer (available data < 2 days). The VPNs are broadly similar in terms of SNMP, but the interface naming is a bit different.

Today's aim to look at a certain set of interfaces and alert when the amount of traffic is unusually high. As I'm still coming up to speed with Prometheus and alerting I want to ensure I'm following best practice and am working towards some reusable patterns I can include in my teams internal training.

I can get the rate of ifOutOctets for the various interfaces of interest, that's not a problem:

ifOutOctets{job="cisco_asa_vpn",ifName=~".*(VPN|MAN).*",vpn=~"vpn.*"}

and as a result I get 12 series returned, although this may be a little tidier

sum(rate(ifOutOctets{job="cisco_asa_vpn",ifName=~".*(VPN-INSIDE|MAN).*",vpn=~"vpn.*"}[2m])) by(vpn,ifName)

The result of that looks like this as an instance query

{ifName="INT-MAN",vpn="vpn.example.com"} 53710.8
{ifName="INT-MAN",vpn="vpn2.example.com"} 371.9938001033316
{ifName="INT-VPN-INSIDE-344",vpn="vpn.example.com"} 1334581.4166666667
{ifName="INT-VPN-INSIDE-344",vpn="vpn2.example.com"} 45.7325711238146
{ifName="DMZ-VPN-INSIDE",vpn="vpn5.example.com"} 1554450.8833333333
{ifName="DMZ-VPN-INSIDE",vpn="vpn6.example.com"} 5290491.866666668
{ifName="INT-MAN",vpn="vpn5.example.com"} 93529.56666666667
{ifName="INT-MAN",vpn="vpn6.example.com"} 107974.35000000003

By 'unusually high', I'm thinking either above 95th percentile of the preceeding 7 days (well, initially perhaps 2 days). So trying to get the 95th percentile for the last two days...

quantile(0.95, rate(ifOutOctets{job="cisco_asa_vpn",ifName=~".*(VPN-INSIDE|MAN).*",vpn=~"vpn.*"}[2d])) by(vpn,ifName)

However, this seems quite wrong, as the graph looks the same with the 5th percentile as it does with the 95th, which is clearly not useful, but the data (when rated over a 2m period) is quite variable.

What should the query be to give me a single value for each series {vpn,ifName} that would give the 95th percentile based on the past N days?

Thanks,
Cameron

Brian Candler

unread,
Mar 25, 2020, 4:47:29 AM3/25/20
to Prometheus Users
On Wednesday, 25 March 2020 00:15:53 UTC, Cameron Kerr wrote:
What should the query be to give me a single value for each series {vpn,ifName} that would give the 95th percentile based on the past N days?



quantile_over_time(0.95, ...)

where ... is a range vector, so you'll want to put a subquery in there, e.g. (expr)[7d:5m] if you're thinking about the 95th percentile based on 5-minute samples, which is typically what people want.
Reply all
Reply to author
Forward
0 new messages