Problem One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was determine a thundering herd condition in Puppet (OSPA-21). Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use a combination of splayed and deterministic run start times to evenly distribute agent checkins. Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See this module for full automation of the configuration. Desired Solution The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode. Characteristics:
- Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
- An agent's individual run start time should include a random splay, seeded on the agent's certname setting
- The agent should schedule and perform a first-start run quickly (immediately or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule
On startup, an agent might schedule runs thusly.
- Schedule a startup run for
now() + $splay
- Schedule the first regular daemon run for
now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)
- Every subsequently run should be scheduled exactly $run_interval after of the previously scheduled run
- Optionally, a provision can be made to skip the first scheduled daemon run if it would occur too soon after the startup run
We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations. |