I've been load testing Salt the last couple days, and have run into some less than ideal situations with the way timeouts are handled.
My test setup:
~275 hosts running 6 minions each (1650 minions)
Some number of those hosts (~10) are down, so I never hit the if len(set(ret.keys()) ==len(minions) short-circuit in LocalClient.get_returns.
I'm distributing a directory of a large amount of files to the hosts using cp.get_dir (actually a small amount to each minion using the filename templating feature in 10.5). This takes somewhere between 30 and 90 seconds to finish.
Which is to say, if I set the timeout to 30, I don't get any replies (I looks like salt is handling the get_dir sub-commands 'breadth-first', so no single minion completes in the first 30 seconds). If I set the timeout to 90 seconds they are return, but I have to wait the full 90 seconds since there are down hosts (and in production, we will probably always have a non-0 number of hosts offline for maintenance of some type).
What I'd really _like_ the logc to be is something like:
- PUB the jid to all minions
- The minions send back a status jid_in_progress {jid: 1234..} event every X seconds (configurable, something like jid_event_interval, default 10 seconds-ish).
- Consider the jid done when one of:
-- all minions are accounted for
-- the timeout is hit
-- no minion has sent a jid_in_progress event in for my jid in X seconds (configurable, something like jid_event_timeout, default to something like 2x the minion jid_event_interval default)
Of course, this would only work if:
- Some of the minions can reliably send a jid_in_progress event
- These events can be propagated back to the LocalClient and seen in the get_returns and get_iter_returns loops
So, my questions are:
- Am I on the right track here? Is this an already solved issue that I'm overlooking the solution for? Or are otehrs running into it?
- Can the events be generated via SaltEvent and access in the LocalClient? Is that what get_event_iter_returns is for?
- Is there anything else I should be leveraging for such a system? Especially wondering if anything new the develop branch with overstate and reactive I should be looking for.
If this is something worth pursuing I'd like to try to tackle it sometime in December (and hopefully get back to my TLS/x509 support as well, but we'll see).
Thanks,
Ryan