Jira (PUP-11163) Prevent thundering herd via deterministic run scheduling

40 views
Skip to first unread message

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 1:01:02 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele created an issue
 
Puppet / Improvement PUP-11163
Prevent thundering herd via deterministic run scheduling
Issue Type: Improvement Improvement
Assignee: Unassigned
Created: 2021/07/07 10:00 AM
Priority: Normal Normal
Reporter: Reid Vandewiele

Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was determine a thundering herd condition in Puppet (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use a combination of splayed and deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See this module for full automation of the configuration.

Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:

  • Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
  • An agent's individual run start time should include a random splay, seeded on the agent's certname setting
  • The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.

  1. Schedule a startup run for
    now() + $splay
  2. Schedule the first regular daemon run for
    now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)
  3. Every subsequently run should be scheduled exactly $run_interval after of the previously scheduled run
  4. Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 1:07:02 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
Change By: Reid Vandewiele
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was
[ determine a thundering herd condition in Puppet |https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-]  (OSPA-21).


Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use a combination of splayed and deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution


The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random splay, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule


On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for

{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequently run should be scheduled exactly {{$run_interval}} after of the previously scheduled run
# Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run


 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 1:10:01 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use a combination of splayed and , deterministic run start times to evenly distribute agent checkins.


Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random splay, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequently run should be scheduled exactly {{$run_interval}} after of the previously scheduled run
# Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 1:14:01 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.


Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random splay, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequently subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

# Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:23:03 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random splay offset element , seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately
, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule


On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

# Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:24:02 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule


On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run
# * Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run
* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed within a certain proximity

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:25:01 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run
* Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed
too soon within a certain configured proximity


 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:26:02 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed too soon within a configured proximity . This may happen after the special startup run, for example, or if a scheduled run entered into a backoff/retry loop based on server 503 responses

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:27:01 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed too soon within a configured proximity. This may happen after the special startup run, for example, or if a scheduled run entered into a backoff/retry loop based on server 503 responses and so did not start/complete for much longer than expected

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Reid Vandewiele (Jira)

unread,
Jul 7, 2021, 2:27:02 PM7/7/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed too soon within a configured proximity. This may happen after the special startup run, for example, or if a previously scheduled run entered into a backoff/retry loop based on server 503 responses and so did not start/complete for much longer than expected


 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Ciprian Badescu (Jira)

unread,
Jul 13, 2021, 10:31:04 AM7/13/21
to puppe...@googlegroups.com

Reid Vandewiele (Jira)

unread,
Jul 14, 2021, 8:13:02 PM7/14/21
to puppe...@googlegroups.com
Reid Vandewiele commented on Improvement PUP-11163
 
Re: Prevent thundering herd via deterministic run scheduling

Beth Glenfield what information is needed for this ticket? (asking based on ticket status change, I may just not understand what that status means in the workflow)

Beth Glenfield (Jira)

unread,
Jul 26, 2021, 8:59:04 AM7/26/21
to puppe...@googlegroups.com

Hi Reid Vandewiele, I was on PTO for a couple of weeks so only getting around to this now. The team assign tickets to myself when they require some Product input, or if we need to gather requirements   When do we advise our largest customers on steps to mitigate this, is it during initial setup? Also cc/ Josh Cooper into this for some engineering expertise.

Charlie Sharpsteen (Jira)

unread,
Sep 20, 2021, 10:02:03 AM9/20/21
to puppe...@googlegroups.com
Charlie Sharpsteen assigned an issue to Unassigned
 
Change By: Charlie Sharpsteen
Assignee: Beth Glenfield

Ciprian Badescu (Jira)

unread,
Sep 20, 2021, 10:52:02 AM9/20/21
to puppe...@googlegroups.com

Yasmin Rajabi (Jira)

unread,
Sep 20, 2021, 10:52:03 AM9/20/21
to puppe...@googlegroups.com
Yasmin Rajabi commented on Improvement PUP-11163
 
Re: Prevent thundering herd via deterministic run scheduling

Reid Vandewiele Charlie Sharpsteen how often is this coming up? trying to balance this with the rest of the tickets - if you were to guess what % of customers run into it?

Reid Vandewiele (Jira)

unread,
Sep 20, 2021, 11:20:05 AM9/20/21
to puppe...@googlegroups.com

100% of customers run into the problem. A lesser percentage notice it. If the customer is small (say, less than a thousand nodes), the available Puppet compute power will eclipse the total agent workload enough to make the thundering herd issue not a problem. Small herd, small thunder.

The issue becomes a problem when the available compute power starts to become more evenly matched to the workload. Charlie Sharpsteen would have to speak to the percentage of customers that Support needs to advise on enabling our special configuration max-queued-requests option, which is the first tier workaround used to mitigate the issue.

For our largest customers, say >10,000 nodes (at a guess) we pretty universally have to talk about max-queued-requests at a minimum up front, and for most we end up evolving into suggesting the second tier workaround, which is reidmv-puppet_run_scheduler or similar—basically, stop using the service, and use cron/scheduled-tasks to run the agent instead.

That strategy, even as a pre-implemented module, is uncomfortable because all customers would rather run services daemons than cron/scheduled-tasks. The big ones grumpily end up sacrificing that in exchange for stable performance.

Reid Vandewiele (Jira)

unread,
Sep 21, 2021, 6:49:03 PM9/21/21
to puppe...@googlegroups.com
Reid Vandewiele updated an issue
Change By: Reid Vandewiele
h2. Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was [determine a thundering herd condition in Puppet|https://ospassist.puppet.com/hc/en-us/articles/360040327753-determine-a-thundering-herd-condition-in-puppet-] (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See [this module|https://forge.puppet.com/modules/reidmv/puppet_run_scheduler] for full automation of the configuration.
h2. Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:
* Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
* An agent's individual run start time should include a random offset element, seeded on the agent's certname setting _+ a random seed.†_
* The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.
# Schedule a startup run for
{{now() + $splay}}
# Schedule the first regular daemon run for
{{now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)}}
# Every subsequent run should be scheduled exactly {{$run_interval}} after of the previously scheduled run

* Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed too soon within a configured proximity. This may happen after the special startup run, for example, or if a previously scheduled run entered into a backoff/retry loop based on server 503 responses and so did not start/complete for much longer than expected

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

----
† It is not important for a given agent to always select the same wall-clock start times; only for agents to have different, evenly spread start times, and for subsequent runs to be +consistently+ spaced.

Reid Vandewiele (Jira)

unread,
Sep 21, 2021, 6:50:03 PM9/21/21
to puppe...@googlegroups.com
† It is not important for a given agent to always select the same wall-clock start times; only for agents to have different, evenly spread start times, and for subsequent runs to be +consistently+ spaced. Including a random seed element permits the (unusual, not recommended) use of the same certificate by multiple agents.

Reid Vandewiele (Jira)

unread,
Sep 21, 2021, 6:51:03 PM9/21/21
to puppe...@googlegroups.com
† It is not important for a given agent to always select the same wall-clock start times; only for agents to have different, evenly spread start times, and for subsequent runs to be +consistently+ spaced. Including a random seed element permits the (unusual, not recommended) use of the same certificate by multiple agents. Including the certname accounts for the possibility of running on a system which cannot generate true random seeds.

Nick Walker (Jira)

unread,
Sep 28, 2021, 5:12:02 PM9/28/21
to puppe...@googlegroups.com
Nick Walker commented on Improvement PUP-11163
 
Re: Prevent thundering herd via deterministic run scheduling

I think this is basically the same request I made in the past via PUP-4212.

Reid Vandewiele (Jira)

unread,
Sep 28, 2021, 5:23:03 PM9/28/21
to puppe...@googlegroups.com

Possibly. There are definitely similarities. If there's any difference, I think it would be that this ticket is more "smooth workload distribution is something I want; thundering herds shouldn't be possible", while 4212 may have taking the slightly different focus of "fix the acute problem / symptom thundering herds by breaking them up when they occur".

The tighter you get to workload tolerances at scale, the less tolerable it is to have herds in the first place.

Ciprian Badescu (Jira)

unread,
Oct 1, 2021, 8:17:02 AM10/1/21
to puppe...@googlegroups.com

Reid Vandewiele, what are the benefits of randomizing first daemon run? Does adding the startup randomization to the first daemon run randomization gives better results?

Could implementing something like this be enough?

  1. first puppet-agent run: now() + $splay, if there is a collision, }}{{do 3
  2. subsequent puppet-agent runs: each $run_interval, if there is a collision, 3, otherwise reset backoff_multiplier
  3. collisions handling (based on server 503)
    1. increment backoff_multiplier
    2. sleep for backoff_multiplier * (retry-after || fixed default)
    3. reset run interval to now() and retry daemon run

OP1: is the value of retry-after fixed (I assumed this and that's why I proposed the backoff_multiplier)

OP2: is there any soft means to detect "Thundering herds" other than HTTP 503 response?

 

 

Reply all
Reply to author
Forward
0 new messages