Seeking Advice on Good Metrics for Service Level Objectives (SLOs) in Rundeck

34 views
Skip to first unread message

Martin Mohan

unread,
Aug 29, 2024, 10:16:18 AMAug 29
to rundeck-discuss
Hi,

I'm currently working on defining Service Level Objectives (SLOs) for our use of Rundeck and would appreciate some insights or tips.

Background on SLOs: A Service Level Objective (SLO) represents a target level of reliability or performance for a particular service. In the context of Rundeck, this could relate to job execution success rates, response times, or other performance indicators.

Question on Current Metric Use: While considering potential metrics, I've been looking at rundeck_project_execution_status{status="succeeded"}, which measures the count of jobs that have successfully executed. At first glance, this seems like a solid metric for indicating the reliability of Rundeck jobs. However, I've noticed that this metric can be misleading because it doesn't account for user-induced errors or failures. For example, a job may succeed from a system perspective but fail to achieve its intended outcome due to incorrect inputs or misconfigurations by the user.

Request for Input: Given this observation, I'm interested in learning what others consider good metrics for SLOs in the context of Rundeck. Specifically:

  • Are there metrics that better capture the true success or failure of jobs from both a system and user perspective?
  • How do you account for user-induced errors in your SLO

Chris Gadd

unread,
Aug 29, 2024, 6:20:51 PMAug 29
to rundeck...@googlegroups.com

Our org doesn’t use have any formal SLOs for Rundeck aside from a common platform uptime and fault restoration time. But as far as our users go their concerns are:

  • Job success/failure. As you say, failures are often user-induced and difficult to account for. Not just input or misconfiguration, but we have a lot of scheduled jobs that are monitoring other platforms (ie a health check), so a job failure (and subsequent notification) is used to indicate a problem with the remote platform.
  • Job runtime. Again, this can be user-induced, but it can also be a function of system load or scheduling conflicts.

 


C2 General

From: rundeck...@googlegroups.com <rundeck...@googlegroups.com> On Behalf Of Martin Mohan
Sent: Friday, August 30, 2024 1:44 AM
To: rundeck-discuss <rundeck...@googlegroups.com>
Subject: [rundeck] Seeking Advice on Good Metrics for Service Level Objectives (SLOs) in Rundeck

 

Hi, I'm currently working on defining Service Level Objectives (SLOs) for our use of Rundeck and would appreciate some insights or tips. Background on SLOs: A Service Level Objective (SLO) represents a target level of reliability or performance

ZjQcmQRYFpfptBannerStart

This Email Is From an Untrusted Sender

CYBER SECURITY WARNING: You have not previously corresponded with this sender. Please follow the Cyber Code and report suspicious emails.

ZjQcmQRYFpfptBannerEnd

--
You received this message because you are subscribed to the Google Groups "rundeck-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rundeck-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rundeck-discuss/8ae9a0cf-08a5-4a3f-bd5e-a899832f2db2n%40googlegroups.com.

Paulo Motta

unread,
Aug 30, 2024, 4:25:28 PMAug 30
to rundeck-discuss
Not sure what I'm doing qualifies as SLOs, but I've recently started tracking:

1. Jobs with the most failures
2. Steps that fail the most
3. Jobs generating the most ServiceNow incident tickets

The target for all of them is zero, but we will probably set an SLA of less than 5%.
Reply all
Reply to author
Forward
0 new messages