Jira (PUP-11365) Prevent thundering herd via deterministic run scheduling

3 views
Skip to first unread message

Ciprian Badescu (Jira)

unread,
Nov 17, 2021, 7:21:03 AM11/17/21
to puppe...@googlegroups.com
Ciprian Badescu created an issue
 
Puppet / Epic PUP-11365
Prevent thundering herd via deterministic run scheduling
Issue Type: Epic Epic
Assignee: Yasmin Rajabi
Created: 2021/11/17 4:20 AM
Priority: Normal Normal
Reporter: Ciprian Badescu

Problem

One of the most-viewed articles in Open Source Puppet Assist between September of 2020 and March of 2021 was determine a thundering herd condition in Puppet (OSPA-21).

Thundering herds have been an out-of-box problem in Puppet for more than a decade, and for users "in the know", there is a clear, bulletproof solution: use splayed, deterministic run start times to evenly distribute agent checkins.

Problematically, this doesn't work with our daemonized agent. Customers must instead disable the agent, and use OS schedulers such as Cron (Posix) and Scheduled Tasks (Windows). See this module for full automation of the configuration.

Desired Solution

The Puppet agent should provide a deterministic run scheduling mode which evenly distributes agent runs automatically, and eliminates the possibility of thundering herds developing. This mode should become the new default mode.

Characteristics:

  • Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times.
  • An agent's individual run start time should include a random offset element, seeded on the agent's certname setting + a random seed.†
  • The agent should schedule and perform a first-start run quickly (immediately, or splayed if splay is configured) on daemon startup, after which subsequent run starts should adhere to the computed wall-clock schedule

On startup, an agent might schedule runs thusly.

  1. Schedule a startup run for
    now() + $splay
  2. Schedule the first regular daemon run for
    now() + ((($run_interval - (now() % $run_interval)) + fqdn_rand($certname)) % $run_interval)
  3. Every subsequent run should be scheduled exactly $run_interval after of the previously scheduled run
  • Optionally, a provision can be made to skip a scheduled run if the previous scheduled run completed too soon within a configured proximity. This may happen after the special startup run, for example, or if a previously scheduled run entered into a backoff/retry loop based on server 503 responses and so did not start/complete for much longer than expected

 

 

We advise our largest customers to achieve this scheduling algorithm today in order to better distribute load, and to prevent thundering herds. We should eliminate the onerous workaround by incorporating it into our product, and in so doing deliver the same benefits to all Puppet users, as well as eliminate one of our most common troubleshooting exasperations.


† It is not important for a given agent to always select the same wall-clock start times; only for agents to have different, evenly spread start times, and for subsequent runs to be consistently spaced. Including a random seed element permits the (unusual, not recommended) use of the same certificate by multiple agents. Including the certname accounts for the possibility of running on a system which cannot generate true random seeds.

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v8.13.2#813002-sha1:c495a97)
Atlassian logo

Reid Vandewiele (Jira)

unread,
Nov 17, 2021, 11:50:02 AM11/17/21
to puppe...@googlegroups.com
Reid Vandewiele commented on Epic PUP-11365
 
Re: Prevent thundering herd via deterministic run scheduling

Ciprian Badescu if I'm understanding it correctly, the idea of step 3 collision handling above violates the first non-volatility characteristic given in the description: 

  • Scheduled run start times should be based on computed wall clock times. Not on variable-time events such as previous run end times

Receiving a 503 is a variable-time event. It is permissible for receipt of a 503 to delay the start time of a single run event, but it cannot be permitted to shift the schedule. A single run can be delayed up to a point, but If the single run cannot be started within its allocated portion of the schedule, it should be cancelled. The next scheduled run should still occur at the next non-volatile pre-computed run-interval offset.

Permitting the normal schedule to be shifted or changed by variable events is what permits Thundering Herds to form.

Reply all
Reply to author
Forward
0 new messages