Jira (PUP-11163) Prevent thundering herd via deterministic run scheduling

Reid Vandewiele (Jira)

unread,

Jul 7, 2021, 1:01:02 PM7/7/21

to puppe...@googlegroups.com

Reid Vandewiele created an issue

Puppet /

PUP-11163

Prevent thundering herd via deterministic run scheduling

Issue Type:	Improvement
Assignee:	Unassigned
Created:	2021/07/07 10:00 AM
Priority:	Normal
Reporter:	Reid Vandewiele

Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was determine a thundering herd condition in Puppet (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use a combination of splayed and deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See this module for full automation of the configuration.

Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:

Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
An agent's individual run start time should include a random splay, seeded on the agent's certname setting
The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.

Schedule a startup run for
now() + $splay
Schedule the first regular daemon run for
now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)
Every subsequently run should be scheduled exactly $run_interval after of the previously scheduled run
Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.

Jira (PUP-11163) Prevent thundering herd via deterministic run scheduling

Reid Vandewiele (Jira)

Problem

Desired Solution

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Ciprian Badescu (Jira)

Reid Vandewiele (Jira)

Beth Glenfield (Jira)

Charlie Sharpsteen (Jira)

Ciprian Badescu (Jira)

Yasmin Rajabi (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Reid Vandewiele (Jira)

Nick Walker (Jira)

Reid Vandewiele (Jira)

Ciprian Badescu (Jira)