Thoughts on Salt command timeout

1,025 views
Skip to first unread message

Ryan Schneider

unread,
Nov 28, 2012, 10:46:48 PM11/28/12
to salt-...@googlegroups.com

I've been load testing Salt the last couple days, and have run into some less than ideal situations with the way timeouts are handled.

My test setup:

~275 hosts running 6 minions each (1650 minions)

Some number of those hosts (~10) are down, so I never hit the if len(set(ret.keys()) ==len(minions) short-circuit in LocalClient.get_returns.

I'm distributing a directory of a large amount of files to the hosts using cp.get_dir (actually a small amount to each minion using the filename templating feature in 10.5).  This takes somewhere between 30 and 90 seconds to finish.

Which is to say, if I set the timeout to 30, I don't get any replies (I looks like salt is handling the get_dir sub-commands 'breadth-first', so no single minion completes in the first 30 seconds).  If I set the timeout to 90 seconds they are return, but I have to wait the full 90 seconds since there are down hosts (and in production, we will probably always have a non-0 number of hosts offline for maintenance of some type).

What I'd really _like_ the logc to be is something like:

- PUB the jid to all minions
- The minions send back a status jid_in_progress {jid: 1234..} event every X seconds (configurable, something like jid_event_interval, default 10 seconds-ish).
- Consider the jid done when one of:
 -- all minions are accounted for
 -- the timeout is hit
 -- no minion has sent a jid_in_progress event in for my jid in X seconds (configurable, something like jid_event_timeout, default to something like 2x the minion jid_event_interval default)

Of course, this would only work if:

- Some of the minions can reliably send a jid_in_progress event
- These events can be propagated back to the LocalClient and seen in the get_returns and get_iter_returns loops

So, my questions are:

- Am I on the right track here?  Is this an already solved issue that I'm overlooking the solution for?  Or are otehrs running into it?
- Can the events be generated via SaltEvent and access in the LocalClient?  Is that what get_event_iter_returns is for?
- Is there anything else I should be leveraging for such a system?  Especially wondering if anything new the develop branch with overstate and reactive I should be looking for.

If this is something worth pursuing I'd like to try to tackle it sometime in December (and hopefully get back to my TLS/x509 support as well, but we'll see).

Thanks,
Ryan

Thomas S Hatch

unread,
Nov 28, 2012, 11:42:29 PM11/28/12
to salt-...@googlegroups.com
I am curious, how are you running the salt command? If you are executing from the command line then it calls the cmd_cli method in the LocalClient class and does get pings back from the minions wrt progress.

Or you could send the command out static with a verbose option to get the jid, then you can use the job system to get the return data later on.

Or are you running the command via the api?

Ryan Schneider

unread,
Nov 29, 2012, 10:40:15 AM11/29/12
to salt-...@googlegroups.com
Ya, I'm using the API, which is being called from our overall provisioning/deployment daemon. I'm currently using get_returns but will probably want to switch to get_iter_returns to provide status feedback. The daemon runs a series of salt commands in sequence, depending on what we need to provision, targeting minions in groups sized by how many services/hosts we can take down in a batch dependent on user load.

Ill do some quick tests today to see if running some of the steps from the cli behaves radically differently.

Thanks,
Ryan

Thomas S Hatch

unread,
Nov 29, 2012, 12:10:29 PM11/29/12
to salt-...@googlegroups.com
Sure thing, take a look at some of the other methods in the LocalClient class, namely cmd_cli and cmd_iter. I just updated cmd_iter in git though, and these methods are generators.

Ryan Schneider

unread,
Nov 29, 2012, 2:57:10 PM11/29/12
to salt-...@googlegroups.com

So I'm actually seeing some really weird results with different timeouts, on both the command line and the LocalClient API.

For example:

> time ./bin/salt '*' -t 5 -c /my/virtualenv/master/etc/salt cp.get_file salt://test/{{grains.id}}.json /tmp/{{grains.id}}.json gzip=5 template=jinja makedirs=True > /tmp/out

real 1m3.740s
user 0m2.687s
sys 0m0.561s
> wc-l /tmp/out
1640 /tmp/out

> time ./bin/salt '*' -t 90 -c master/etc/salt cp.get_file salt://test/{{grains.id}}.json /tmp/{{grains.id}}.json gzip=5 template=jinja makedirs=True > /tmp/out

real 2m29.166s
user 0m2.693s
sys 0m0.517s
> wc -l /tmp/out
1640 /tmp/out

So, I feel like the timeout is for the command line, how long to wait 'idle' (which is what I want).  But the LocalClient API behaves differently:

-- time_test.py --
import time
import salt
import salt.client

c = salt.client.LocalClient('etc/salt/master')

def timing(f):
    def wrap(*args):
        time1 = time.time()
        ret = f(*args)
        time2 = time.time()
        print('%s function took %0.3f ms' % (f.func_name, (time2-time1)*1000.0))
        return ret
    return wrap

@timing
def get_file(timeout=5):
    ret = c.cmd(
        tgt='*',
        fun='cp.get_file',
        arg=[ 'salt://test/{{grains.id}}.json',  '/tmp/{{grains.id}}.json', 'template=jinja',
        'makedirs=True', 'gzip=5' ],
        timeout=timeout
    )
    print(len(ret))

get_file(5)

--
> . bin/activate #in my virtual env
> ./bin/python -i time_test.py
13
get_file function took 9310.855 ms

Which is to say, it stopped hard after ~9 seconds, and reported back only 13 results (the 13 minions that had reported by then).  The master was clearly still getting results back.  If I set the timeout to 90, I get:
1640
get_file function took 94943.794 ms

So, it seems like cmd and cmd_cli are treating timeout semantically very differently.

Thanks,
Ryan

Thomas S Hatch

unread,
Nov 30, 2012, 2:42:47 PM11/30/12
to salt-...@googlegroups.com
Yes, they are, very differently.
So, the timeout for cmd is static, after the last minion replies it waits for the timeout and then stops listening. The minions still reply and you can use the job system to lookup the results after the fact of course.
cmd_cli sends out additional commands to check the status of the running job using the timeout as an interval, so it will wait until the registered minions all finish the executed job

Ryan Schneider

unread,
Dec 3, 2012, 6:46:56 PM12/3/12
to salt-...@googlegroups.com

Are they meant to be that divergent, or did they just grow apart organically?  I guess I'm wondering if I should really just be using cmd_cli in my LocalClient even though from the name it feels reserved for the cli, or if I should be trying to get a Pull Request together that supports the cmd_cli timeout functionality in cmd() and vice versa.

Thanks,
Ryan

Thomas S Hatch

unread,
Dec 4, 2012, 3:43:00 AM12/4/12
to salt-...@googlegroups.com
Yes, it was more organic. I have actually cleaned this up a lot in 0.10.6, so that the cmd_iter method is the same algorithm as the cmd_cli method but without the printing. I have also added cmd_async that will just return the job id allowing for the job id to be watched manually instead of waiting.
All in all, yes,this needed some cleaning, hopefully these additions have done just that.

Ryan Schneider

unread,
Dec 10, 2012, 7:08:31 PM12/10/12
to salt-...@googlegroups.com

Awesome, I'll test out the develop branch with cmd_iter and see how it performs in my testing scenarios above.

Thanks,
Ryan

Ryan Schneider

unread,
Dec 11, 2012, 10:59:08 PM12/11/12
to salt-...@googlegroups.com

So far cmd_iter is proving to be much better, and does seem to be behaving the same as the command line.

That said, in my testing it looks like the real issue is cp.get_dir just doesn't seem to scale well.  I switched to creating a tarball and then doing a cp.get_file and archive.tar and things are much more reliable.  I think with a large number of minions (1000+) all the back-and-forth with _file_list/_serve_file for each minion is just too much for my master to keep up, especially when combined with gzip.

That said, it might make sense to re-write cp.get_dir like so:

- on the master, use shutil.make_archive (2.7+) or tarfile and gzip to create an archive of the dir
- have the minions _serve_file the tarball
- the minions would then use tarfile to extract it (IMO preferable to shelling out to tar)
- delete the .tar.gz when the jid completes

That would be cleaner then what I'm doing right now, and should scale much nicer (number of calls to the master would be O(m) instead of O(mn) for a directory of n files and m minions).

FWIW, to support the way I was using cp.get_dir with archive.tar I also added the same template argument to the cmd.run* and archive.* commands that I added to cp.get_file and get_dir in 10.5, I'll do a Pull Request once I add the docs and unit tests.

Thanks,
Ryan

Thomas S Hatch

unread,
Dec 12, 2012, 12:38:59 PM12/12/12
to salt-...@googlegroups.com
Thanks ryan, I have been contemplating something like this, I think that there are a lot of things that we can do to the file server in general to beef it up.
Reply all
Reply to author
Forward
0 new messages