Preventing automation disasters

45 views
Skip to first unread message

Paulo Motta

unread,
Oct 6, 2023, 9:36:13 AM10/6/23
to rundeck-discuss
Rundeck makes it very easy to execute jobs on many servers at once. This also means that someone can accidentally execute a job a large number of servers and cause a operational disaster. 

I'm thinking about the following best practices to minimize the risk:

  1. As a best practice, every job needs to have a step that queries ServiceNow for the existence of an approved change request and executes the job only on the servers listed in the CR.
  2. As a best practice, Node Selection in Job definition should require explicit selection of nodes.
Does anyone have other ideas?

Thanks

rac...@rundeck.com

unread,
Oct 6, 2023, 9:56:01 AM10/6/23
to rundeck-discuss
Another way could be to create a mandatory option in your jobs asking for confirmation, if so, the job will run.

Regards.

Paulo Motta

unread,
Oct 7, 2023, 10:06:36 AM10/7/23
to rundeck-discuss
That's a good idea too. 
I'm also thinking the best way would be to lock down the production environment to users. The only way a job could be executed would be through a Change Request in ServiceNow, after approval, that would trigger the Rundeck job via webhook or API on the scheduled change time. Has anyone implemented that?

Chris Gadd

unread,
Oct 8, 2023, 2:34:59 PM10/8/23
to rundeck...@googlegroups.com

In our environment we use two of those mitigations you’ve both identified, plus one more:

  1. For jobs that make changes to the environment the nodes must be explicitly selected by the user (for jobs which only query for information they may be pre-selected).
  2. Add a confirmation option which defaults to performing a dry-run of the job – just show what it is about to do. The user can then re-run the job with this flag unset to actually deploy the changes.
  3. Execute jobs using the Node First strategy with a Thread Count of 1, so that only a single node is affected at a time – this means if a step fails the job stops before too much damage is done, and also gives a chance for the job runner to kill the job if something looks amiss. This also means the log file is easy to read, compared with allowing multiple threads.

 


C2 General

From: rundeck...@googlegroups.com <rundeck...@googlegroups.com> On Behalf Of rac...@rundeck.com
Sent: Saturday, October 7, 2023 2:56 AM
To: rundeck-discuss <rundeck...@googlegroups.com>
Subject: [rundeck] Re: Preventing automation disasters

 

CYBER SECURITY WARNING: This email is from an external source - be careful of attachments and links. Please follow the Cyber Code and report suspicious emails.

--
You received this message because you are subscribed to the Google Groups "rundeck-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rundeck-discu...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rundeck-discuss/02c8f0fc-b551-424e-a561-35b963650d7en%40googlegroups.com.

Paulo Motta

unread,
Jun 12, 2025, 3:24:30 PMJun 12
to rundeck-discuss
How did you implement the dry-run? Is it in the code? I don't see an option in Rundeck.

Chris Gadd

unread,
Jun 15, 2025, 5:26:26 PMJun 15
to rundeck...@googlegroups.com

It’s a custom thing we do in some jobs:

  1. Create an option called ‘Dry Run’.
  2. In the job add some logic to stop or do nothing. Could use ruleset if you’re on Enterprise, or in some jobs I just have a brief bash statement to stop at this point:

 

 


C2 General

Reply all
Reply to author
Forward
0 new messages