Alert Policies randomly not working

215 views
Skip to first unread message

Benjamin Neff

unread,
Aug 13, 2019, 5:04:40 PM8/13/19
to Google Stackdriver Discussion Forum
Hello,

I created some alert policies in multiple projects and tried to test them, and some of them are just not working while the exact same policy is working in other projects. It looks like more or less random which policies work in which project.

It is clearly above the limit and it is configured to trigger if above the limit for 1 minute (so it should trigger instantly), which is also the case, but no alerts/incidents at all (not for the first spike and also not for the much longer second spike). The exact same policy is working in another project.

I also have uptime checks with policies, and in the same project some of them work, some of them just do nothing and don't create an incident even when the uptime check is down.

Is this a bug? Or am I doing something wrong?

I'm not able to test every combination of every alert policy in every project, so I don't know exactly how many of them are working and how many are broken, but from those I tested about two-third worked and one-third just didn't trigger at all. But I can't use stackdriver for alerting if I don't know if the alerts work when there is a real incident, so I need this to work 100%.

Best Regards,
Benjamin Neff

Rory Petty

unread,
Aug 13, 2019, 5:13:01 PM8/13/19
to Benjamin Neff, Google Stackdriver Discussion Forum
Hi Benjamin,

Can you file a Cloud Support case so that we can investigate the specific projects and time series where this is happening? If you don't have a Cloud Support contract, can you message me directly with the project ID, alert policy IDs, etc.?

Thanks,
Rory

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/f3a1b6c4-e963-4e2b-b9d5-d91106a51ce2%40googlegroups.com.

Benjamin Neff

unread,
Aug 15, 2019, 9:49:07 AM8/15/19
to Google Stackdriver Discussion Forum
Hi Rory,

Thanks for your response, I sent you a direct message since I don't have a support contract, but didn't get a response yet.

I just wanted to mention that two of the three alert policies that I know were broken started working the next day, but one is still broken. And everything that was working before still works. Some alerts aren't easy to trigger, so I couldn't test everything yet.

I also managed to "fix" a broken policy by deleting and recreating the same policy (before I wrote here), but that didn't work every time, sometimes the new policy was again broken. But I don't delete the last broken policy now, so you hopefully can see what is broken with it.

Regards,
Benjamin


Am Dienstag, 13. August 2019 23:13:01 UTC+2 schrieb Rory Petty:
Hi Benjamin,

Can you file a Cloud Support case so that we can investigate the specific projects and time series where this is happening? If you don't have a Cloud Support contract, can you message me directly with the project ID, alert policy IDs, etc.?

Thanks,
Rory

On Tue, Aug 13, 2019 at 5:04 PM Benjamin Neff <benj...@coding4coffee.ch> wrote:
Hello,

I created some alert policies in multiple projects and tried to test them, and some of them are just not working while the exact same policy is working in other projects. It looks like more or less random which policies work in which project.

It is clearly above the limit and it is configured to trigger if above the limit for 1 minute (so it should trigger instantly), which is also the case, but no alerts/incidents at all (not for the first spike and also not for the much longer second spike). The exact same policy is working in another project.

I also have uptime checks with policies, and in the same project some of them work, some of them just do nothing and don't create an incident even when the uptime check is down.

Is this a bug? Or am I doing something wrong?

I'm not able to test every combination of every alert policy in every project, so I don't know exactly how many of them are working and how many are broken, but from those I tested about two-third worked and one-third just didn't trigger at all. But I can't use stackdriver for alerting if I don't know if the alerts work when there is a real incident, so I need this to work 100%.

Best Regards,
Benjamin Neff

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdriver-discu...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.

---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-discussion+unsub...@googlegroups.com.

Michael Safyan

unread,
Aug 15, 2019, 10:31:40 AM8/15/19
to Benjamin Neff, Google Stackdriver Discussion Forum
How long after creating the policy are you attempting to simulate the scenario that would lead to the policy triggering?

Is it possible that what you are experiencing is the delay between creating/updating a policy and that policy being picked up by the evaluation system?

Michael Safyan

Senior Software Engineer · Stackdriver Monitoring

6425 Penn Ave 7th Floor; Pittsburgh, PA 15206

http://www.michaelsafyan.com | michae...@google.com



On Thu, Aug 15, 2019 at 9:49 AM Benjamin Neff <benj...@coding4coffee.ch> wrote:
Hi Rory,

Thanks for your response, I sent you a direct message since I don't have a support contract, but didn't get a response yet.

I just wanted to mention that two of the three alert policies that I know were broken started working the next day, but one is still broken. And everything that was working before still works. Some alerts aren't easy to trigger, so I couldn't test everything yet.

I also managed to "fix" a broken policy by deleting and recreating the same policy (before I wrote here), but that didn't work every time, sometimes the new policy was again broken. But I don't delete the last broken policy now, so you hopefully can see what is broken with it.

Regards,
Benjamin

Am Dienstag, 13. August 2019 23:13:01 UTC+2 schrieb Rory Petty:
Hi Benjamin,

Can you file a Cloud Support case so that we can investigate the specific projects and time series where this is happening? If you don't have a Cloud Support contract, can you message me directly with the project ID, alert policy IDs, etc.?

Thanks,
Rory

On Tue, Aug 13, 2019 at 5:04 PM Benjamin Neff <benj...@coding4coffee.ch> wrote:
Hello,

I created some alert policies in multiple projects and tried to test them, and some of them are just not working while the exact same policy is working in other projects. It looks like more or less random which policies work in which project.

It is clearly above the limit and it is configured to trigger if above the limit for 1 minute (so it should trigger instantly), which is also the case, but no alerts/incidents at all (not for the first spike and also not for the much longer second spike). The exact same policy is working in another project.

I also have uptime checks with policies, and in the same project some of them work, some of them just do nothing and don't create an incident even when the uptime check is down.

Is this a bug? Or am I doing something wrong?

I'm not able to test every combination of every alert policy in every project, so I don't know exactly how many of them are working and how many are broken, but from those I tested about two-third worked and one-third just didn't trigger at all. But I can't use stackdriver for alerting if I don't know if the alerts work when there is a real incident, so I need this to work 100%.

Best Regards,
Benjamin Neff

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.

---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.

---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/95c23dd5-0ec3-498c-b0ac-a4bd82b043ec%40googlegroups.com.

Benjamin Neff

unread,
Aug 15, 2019, 6:27:32 PM8/15/19
to Google Stackdriver Discussion Forum
Hi Michael

First I tested about 10 Minutes after creation, but thought that it maybe takes longer, so I waited some hours but nothing changed, and I think the delay between creating and a working policy shouldn't be longer than an hour. One non-working policy was from an uptime-check, that was pretty easy just to set it to an invalid URL, so it was constantly failing, and about 14 hours after creation it started working without any change and I received an alert-mail.

I mentioned, that recreation sometimes also fixes the problem, and I also tested that with another uptime-check in a constant failing state, and immediately after it was created again, it started working and I received an alert mail (with the usual 1-2 minute delay which is always there between the trigger for the alert and the alert actually sent, which is OK). So it looks like there usually isn't a long delay between creation and the policy becoming active.

The one policy that still doesn't work exists now for two days, and that's clearly broken for me, because it should never be a delay that long without any sign that the policy isn't really active yet.

Regards,
Benjamin
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdriver-discu...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.

---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-discussion+unsub...@googlegroups.com.

--
© 2016 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043
 
Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdriver-discu...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.

---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-discussion+unsub...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages