GCP Operations Suite Monitoring and SLI for custom service metrics

Adam Raszkiewicz

unread,

Jun 2, 2021, 6:07:14 PM6/2/21

to Google Stackdriver Discussion Forum

I have created metrics for request-based logs as `log_based_total_requests` for all requests and `log_based_errors` for all all error responses. Then using GCP Monitoring API I have tried to set SLI as follow:

```

{

"name": null,

"displayName": "99.9% - Good/Total Ratio - Rolling day",

"goal": 0.999,

"rollingPeriod": "86400s",

"serviceLevelIndicator": {

"requestBased": {

"goodTotalRatio": {

"totalServiceFilter": "metric.type=\"logging.googleapis.com/user/log_based_total_requests\" resource.type=\"gce_instance\"",

"badServiceFilter": "metric.type=\"logging.googleapis.com/user/log_based_errors\" resource.type=\"gce_instance\""

}

```

to have SLI 99.9% for all incoming requests so errors are just 0.1% of all requests.

After creating that SLI SLO in the GCP Monitoring went crazy showing as -8752.5% and when I have tried to edit it via UI it throw that message:

> There was an issue parsing the filter string used in the SLO. Only

> filters that join labels with "AND" are supported. Filtering by

> groupId is not supported. The UI does not yet support any

> request-based SLI based on a ratio between two time series that have

> different metric types.The data in the form was reset.

So my question is if my SLI definition is correct and if it is then why I'm getting that error and my SLO shows weird data?

Ruxanda Danetiu

unread,

Jun 2, 2021, 6:58:32 PM6/2/21

to Adam Raszkiewicz, Elizabeth Byerly, Nathan Johnson, Patrick Eaton, Google Stackdriver Discussion Forum

+Elizabeth Byerly +Nathan Johnson +Patrick Eaton

--
© 2021 Google Inc. 1600 Amphitheatre Parkway, Mountain View, CA 94043

Email preferences: You received this email because you signed up for the Google Stackdriver Discussion Google Group (google-stackdr...@googlegroups.com) to participate in discussions with other members of the GoogleStackdriver community.
---
You received this message because you are subscribed to the Google Groups "Google Stackdriver Discussion Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to google-stackdriver-d...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/google-stackdriver-discussion/f6681927-2e20-4b01-a092-801fa1f4dd51n%40googlegroups.com.

Adam Raszkiewicz

unread,

Jun 8, 2021, 8:17:03 AM6/8/21

to Nathan Johnson, Ruxanda Danetiu, Elizabeth Byerly, Patrick Eaton, Google Stackdriver Discussion Forum

Hi Nathan,

First I want to tell it is my first time with GCP Operations Suite so maybe I have set something incorrectly. I was trying to follow directions from that doc: https://cloud.google.com/stackdriver/docs/solutions/slo-monitoring/sli-metrics/logs-based-metrics#lbm-availability

What I have got correctly so far are log-based metrics for all requests and log-based metrics for all error responses. Then I have tried to set SLI 99.9% for Availability still following that doc mentioned above with the config file mentioned in a previous email executed via API.

Here is what I have on a test project:

SLI in the GCP UI shows correct state

But then for the Error budget tab:

And I’m not sure why is that way and if I have so much errors per “budget” then why it did not trigger alert?

If it will be easier to go over that via phone call I can schedule something next week if possible.

Thanks for any help,

A

Adam Raszkiewicz

Software Developer II

845.896.0191

araszk...@medallies.com

www.medallies.com

This communication and any files or attachments transmitted with it may contain information that is confidential, privileged and exempt from disclosure under applicable law. It is intended solely for the use of the individual or the entity to which it is addressed. If you are not the intended recipient, you are hereby notified that any use, dissemination, or copying of this communication is strictly prohibited by federal law. If you have received this communication in error, please destroy it and notify the sender.

From: Nathan Johnson <na...@google.com>
Date: Friday, June 4, 2021 at 12:16 PM
To: Ruxanda Danetiu <rux...@google.com>
Cc: Adam Raszkiewicz <araszk...@medallies.com>, Elizabeth Byerly <eby...@google.com>, Patrick Eaton <pre...@google.com>, Google Stackdriver Discussion Forum <google-stackdr...@googlegroups.com>
Subject: Re: [google-stackdriver-discussion] GCP Operations Suite Monitoring and SLI for custom service metrics

CAUTION: This email originated from outside of MedAllies. Do not click links or open attachments unless you recognize the sender and know the content is safe.

Hi Adam,

My understanding is that while the API supports the bad/total log metrics use case, the UI doesn't as of yet, so the only way to edit this is through the API.

The real question here is why it showed up as -8752.5% budget. With 100% budget = 0.1% errors, that would imply that you're getting 8.8525% errors. I'd check and see what the total number of errors over the past day is by going to the metrics explorer in GCP, filling in the bad/total metrics above and sum over an alignment period of 1d and see if that error rate is correct. If it truly has that many errors, then you can tune either the SLO or the metric generation. If it doesn't line up, I would love to hop on a call and try to figure out what's going on.