Slow salt-call: how to profile?

2,165 views
Skip to first unread message

Peter Waller

unread,
Dec 9, 2013, 12:19:09 PM12/9/13
to salt-...@googlegroups.com
Hi All,

I'm still evaluating whether salt is suitable for our purposes and a significant concern is how quickly it runs. I ran our toy high state using salt-call and it was significantly slower than expected, taking a minute when I expected it to take seconds. During this time salt-master and salt-minion were consuming lots of CPU across ~10 processes. Looking at the live debugging output showed an output at a rate of ~Hz for things like file.append, which I find surprisingly slow.

It doesn't help that when you run `salt '*' state.highstate` you are blind until it finishes.

I'd like to try and understand what is actually taking time so that I can accurately report to the list what I'm experiencing preferably without having to divulge the full log.

Is there any mechanism to find out what is going slowly? Any advice on that front?

Thanks,

- Peter

Seth House

unread,
Dec 9, 2013, 12:25:56 PM12/9/13
to salt-...@googlegroups.com
If you start the minion in the foreground with logging tuned up you
can see things as they happen right on the console:

salt-minion -l debug
> --
> You received this message because you are subscribed to the Google Groups
> "Salt-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to salt-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/groups/opt_out.

Mike Place

unread,
Dec 9, 2013, 12:27:40 PM12/9/13
to salt-...@googlegroups.com, Seth House
Out of curiosity, is this slowness happening in a virtualized environment? Can you post some more details about the infrastructure that you’re testing on?

-mp

Peter Waller

unread,
Dec 9, 2013, 12:41:16 PM12/9/13
to salt-...@googlegroups.com
I'm testing on a micro instance on AWS, which may be partially creditable with the slowness.

I did look at the -l output. Is it possible to get timestamps on these?

I'm seeing stuff like this (loaded minion key) which looks slow, repeated an awful lot. Is this normal?

[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[INFO    ] Executing state user.present for asdf
[INFO    ] User asdf is present and up to date
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Decrypting the current master AES key
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem
[DEBUG   ] Loaded minion key: /etc/salt/pki/minion/minion.pem


It is surprising to me to watch several salt processes compete for CPU for tens of seconds for what seems it should be trivial. `salt-call` is even using 30 seconds of user time. Without being able to easily peek inside it is really hard to understand why it is being so slow. If I could compare a run to another one and see that the trivial parts are all slower, then I might more easily be able to blame "the crappy instance". Otherwise I don't know if it is stalling on the network or something else.

However, the fact that salt-call is using ~30 seconds of user time is pretty damning for what I'm doing. I'd *really* like to be able to see where that time went. My state configuration is simple enough that I'm concerned. It's just:

* Install ntp: pkg.installed, service.running, file.managed
* Install some devops packages: pkg.installed ~40 packages
* Configure some users: file.managed, 5*(user.present, file.directory, file.managed, cmd.run, file.append).

And it's taking a minute. I'd really expect it to take maybe as much as ten seconds. How will this scale when I have ten times the configuration?

Another question: can I make salt-call do things in parallel? There are a few cmd.runs which do network activity and it would be nice if they were concurrent.

Thanks,

- Peter

David Anderson

unread,
Dec 9, 2013, 12:53:21 PM12/9/13
to salt-...@googlegroups.com
If you're running salt-master and salt-minion in addition to other
services on the micro instance you are probably swapping. Try watching
'vmstat 1' during your highstate to get an idea if this is true.
--
Dave
> <mailto:m...@saltstack.com>> wrote:
>
> Out of curiosity, is this slowness happening in a virtualized
> environment? Can you post some more details about the
> infrastructure that you�re testing on?
>
> -mp
>
> On December 9, 2013 at 10:25:59 AM, Seth House (se...@eseth.com
>> <mailto:salt-users%2Bunsu...@googlegroups.com>.
>> > For more options, visit https://groups.google.com/groups/opt_out.
>>
>> --
>> You received this message because you are subscribed to the
>> Google Groups "Salt-users" group.
>> To unsubscribe from this group and stop receiving emails from it,
>> send an email to salt-users+...@googlegroups.com
>> <mailto:salt-users%2Bunsu...@googlegroups.com>.
>> For more options, visit https://groups.google.com/groups/opt_out.
> --
> You received this message because you are subscribed to the Google
> Groups "Salt-users" group.
> To unsubscribe from this group and stop receiving emails from it,
> send an email to salt-users+...@googlegroups.com
> <mailto:salt-users%2Bunsu...@googlegroups.com>.

Peter Waller

unread,
Dec 10, 2013, 3:42:42 AM12/10/13
to salt-...@googlegroups.com
On 9 December 2013 17:53, David Anderson <da...@dubkat.com> wrote:
If you're running salt-master and salt-minion in addition to other services on the micro instance you are probably swapping. Try watching 'vmstat 1' during your highstate to get an idea if this is true.

It doesn't look that way, no.

Peter Waller

unread,
Dec 10, 2013, 5:21:19 AM12/10/13
to salt-...@googlegroups.com
So I did some profiling using pyprof2calltree [1] on salt-call and it looks like 50% of the wall time is spent in poll and the other 50% in event.fire_master and the majority of that in __authenticate and sign_in with an even split of that time between payload.py:send_auto -> send -> zmq_poll and crypt.py:verify_master -> decrypt_aes.

The crypt verify-style functions are being called 30 times for one salt-call run. Is that expected?

I attach an xz'd log file which can be viewed with kcachegrind. A screenshot of this profile is available here [2].

Peter Waller

unread,
Dec 10, 2013, 5:22:37 AM12/10/13
to salt-...@googlegroups.com
Apologies, I managed to not attach it. Here is a link instead:

Peter Waller

unread,
Dec 16, 2013, 3:59:45 AM12/16/13
to salt-...@googlegroups.com
On 10 December 2013 10:21, Peter Waller <pe...@scraperwiki.com> wrote:
So I did some profiling using pyprof2calltree [1] on salt-call and it looks like 50% of the wall time is spent in poll and the other 50% in event.fire_master and the majority of that in __authenticate and sign_in with an even split of that time between payload.py:send_auto -> send -> zmq_poll and crypt.py:verify_master -> decrypt_aes.

The crypt verify-style functions are being called 30 times for one salt-call run. Is that expected?

Hi All,

I didn't get a response to this. I'm really interested to know if this is the expected behaviour or if I can file an issue for it. My vague understanding is that it's repeatedly burning a lot of CPU on crypto functions I would intuitively expect shouldn't need to be called repeatedly. Is it doing a setup/teardown of a connection every time?

If it is something which can be fixed, it could potentially have a significant impact on my highstate runtime and the CPU usage of salt. I see 42% of the CPU time of salt-call is spent in 30 calls of __authenticate (1 per salt state, it looks like).

Here are the links to a profile screenshot and the profile data itself (viewable with kcachegrind): <http://pwaller.net/sw/salt-call-screenshot.png>, <http://pwaller.net/sw/salt-call.log.xz>.

All the best,

- Peter

Colton Myers

unread,
Dec 17, 2013, 1:10:59 PM12/17/13
to salt-...@googlegroups.com
Thanks for the bump, Peter.  I'm not familiar enough with these particular sections of the code to know whether this is expected and/or can be fixed.  When Tom gets back in on Thursday, I'll run through this with him and get back to you.

--
Colton Myers


--

Volker

unread,
Dec 18, 2013, 11:23:57 AM12/18/13
to salt-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hi,
>
> The crypt verify-style functions are being called 30 times for one
> salt-call run. Is that expected?
>
i can not really answer your questions, but i'd like to jump in on the
the decryption/verification/authentication topic.

Currently a salt-minion has a static connection to salt-master:4505 to
receive aes-encrypted commands.

All returns a minions sends, mine-data it returns, files it requests
from the master etc. are send through new connections to
salt-master:4506. These connections are made dynamically whenever
necessary.

For example a mine.send call [0] (which sends data from a minion to
the master) creates a salt.crypt.SAuth-object [1] which handles
logging into the master. Once the login succeeded, an SREQ-Object is
created [2], which does the actual send of the data.

A brief look at the states-code showed, that the fileclient is
involved, which basically does the same thing as the mine-module
(SAuth, SReq, etc.)

Looking at my states which are still _really_ simple, i can tell, that
a state.highstate does at least three minion-authentications and
various decryptions [3]. Depending on what you use in your states,
there might be many more.

In case you're interested in how salt actually authenticates, there's
an abstract [4] which seems to have vanished from the docs since 0.12.1.

Generally i'm wondering why the decision was made, to use dynamic
connections when returning to or requesting data from the salt-master.
A static connection could be re-used anytime without the
authentication/encryption/decryption overhead.

Its by the way exactly that overhead, that really hurts scalability in
my current setup. Im running several thousand minions on a single
master and having them authenticate all the time does not scale well.

I started addressing this with this patch [5] to at least be able ease
the burden on the master, but it still does not scale as i'd like it
to. Maybe adding the possibility to use a static connection for
returning data to the master can be added? I'd even do it myself, but
that part of salt is quite complicated.

Lets see what Thomas has to say about this :-)

[0]
https://github.com/saltstack/salt/blob/develop/salt/modules/mine.py#L87
[1] https://github.com/saltstack/salt/blob/develop/salt/crypt.py#L488
[2] https://github.com/saltstack/salt/blob/develop/salt/payload.py#L152
[3] http://pastebin.com/NjLf6fFQ
[4]
http://salt.readthedocs.org/en/v0.12.1/topics/specs/salt_auth_proto_abs.html
[5] https://github.com/saltstack/salt/pull/9235

- - volker
-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlKxzB0ACgkQHaTGAGocg2ILqACcCF0G1MaUddOaucP80LxoP3+T
nQ4AnRjGlb/taGmX7UWe+DqpchnWHVPl
=XHFr
-----END PGP SIGNATURE-----

Volker

unread,
Jan 5, 2014, 7:24:29 AM1/5/14
to salt-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 12/18/13 5:23 PM, Volker wrote:
>
> Generally i'm wondering why the decision was made, to use dynamic
> connections when returning to or requesting data from the
> salt-master. A static connection could be re-used anytime without
> the authentication/encryption/decryption overhead.
>
> Its by the way exactly that overhead, that really hurts scalability
> in my current setup. Im running several thousand minions on a
> single master and having them authenticate all the time does not
> scale well.
>

> I started addressing this with this patch [5] to at least be able
> ease the burden on the master, but it still does not scale as i'd
> like it to. Maybe adding the possibility to use a static connection
> for returning data to the master can be added? I'd even do it
> myself, but that part of salt is quite complicated.

Bump!

A comment on this would be highly appreciated.

Improving in this area is a must for salt if it wants to be an
enterprise solution.

How and why do i claim the above?

At my current state i would need 6+ DELL 610 with Dual-Quadcores to be
able to administrate all our servers (~13000 VMs). That is, because
having minions fire events on the master, reporting data to the mine,
applying states etc. does not scale beyond 2000 minions per master
because of all the connections, authentication and decryption being
done. Dropping the keysize to 2048 helps, but not very much [0].

Also having more than one master implies more requirements to be
enterprise:

1) Having many masters requires another logic layer which is (not yet)
provided by salt:

Which minion is on which master?
What master should the next new minion be added to?
What key needs to be accepted/revoked/deleted on which master?
Master1 is broken, which minions cant be reached now?
etc.

The syndic does help in some areas, but not with key management on
various masters, distributing minions across multiple masters
"atomically", etc.

This is just a general thought. Key management requirements differ
from company to company and maybe even from project to project. Salt
itself can not provide this logic, just APIs to work with.

2) Many hardware masters are not the desired approach economically.
Each and every server running means more power consumption and less
profit in the end. Which basically is, what the actual deciders in
upper management look at. 5000 minions per hardware master are much
easier to sell than 2000.

3) Each release has new feature x or feature y. Sometimes i wish, that
besides adding new features, adding only stability, performance and
scalability improvements would be a better choice for a release.


Pretty much all the patches i submitted so far aim at improving salt
in the scalabilty area, but this "problem" is salt-core. Using the
correct and "Hatch-Accepted-Approach" is absolutely required!

So please comment on this.

- - volker

[0] https://github.com/saltstack/salt/pull/9235

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlLJTv0ACgkQHaTGAGocg2JPjQCff5+5mdIRKmE+30OJfyiyIQDM
R0wAniuILfNOTW7Xv5YUGNdhLC/AeHnm
=cf1y
-----END PGP SIGNATURE-----

JC Lawrence

unread,
Jan 5, 2014, 1:19:10 PM1/5/14
to salt-...@googlegroups.com

On 5 Jan 2014, at 04:24, Volker <g...@schwicking.de> wrote:

> Which minion is on which master?

My first thought is to use a hash with the minion's name or IP address or whatever (just pick some known value or combination of values with good entropy) as input. That way the mapping is deterministic, never need be stored, and the answers to the following questions becomes trivial.

> What master should the next new minion be added to?
> What key needs to be accepted/revoked/deleted on which master?
> Master1 is broken, which minions cant be reached now?
> etc.

-- JCL

Peter Waller

unread,
Jan 13, 2014, 5:32:54 AM1/13/14
to salt-...@googlegroups.com, Thomas Hatch, Colton Myers, developers
[back on-list again. Summary of what was missed: not very much, just a few pings. Colton suspected that it was that state progress events were causing an authenticate every time.]

I re-ran the original experiment using the same version as before just now, on the same machine.

salt '*' state.highstate
 - first run: 37 seconds
 - second run: 22 seconds

(compared to previously when it was more than a minute)

So I suspect that when I was running these commands the AWS EC2 hardware may have been under heavy load, making the authentication operation quite expensive.

I switched versions using salt-bootstrap to 'daily' and tried again:

$ salt --versions-report
           Salt: 0.17.4
         Python: 2.7.5+ (default, Sep 19 2013, 13:48:49)
         Jinja2: 2.7
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6
         PyYAML: 3.10
          PyZMQ: 13.1.0
            ZMQ: 3.2.3

$ time sudo salt '*' state.highstate
real    0m12.912s
user    0m0.261s
sys     0m0.126s

That was quite a bit faster (than the original minute), and closer to what I would think is reasonable.

In the future if we run halite and want to see progress and that sort of thing, will that mean paying the ~2x penalty, or is there a plan in the works to avoid the repeated authentication cost?

Thanks again,

- Peter

On 13 January 2014 09:23, Peter Waller <pe...@scraperwiki.com> wrote:
$ salt --versions-report
           Salt: 0.17.0-5095-g1af467e
         Python: 2.7.5+ (default, Sep 19 2013, 13:48:49)
         Jinja2: 2.7
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6
         PyYAML: 3.10
          PyZMQ: 13.1.0
            ZMQ: 3.2.3

On 10 January 2014 20:54, Colton Myers <colton...@gmail.com> wrote:
For awhile, on the develop branch, state progress events were turned on for default, and these events would cause an auth for every defined state.  Tom is looking into it, but he thinks that in recent versions of salt, since we turned those events off by default, it shouldn't be authing on every state.

Corey Quinn

unread,
Jan 13, 2014, 6:25:56 AM1/13/14
to salt-...@googlegroups.com, salt-...@googlegroups.com, Thomas Hatch, Colton Myers, developers


On Jan 13, 2014, at 5:32 AM, Peter Waller <pe...@scraperwiki.com> wrote:

[back on-list again. Summary of what was missed: not very much, just a few pings. Colton suspected that it was that state progress events were causing an authenticate every time.]

I re-ran the original experiment using the same version as before just now, on the same machine.

salt '*' state.highstate
 - first run: 37 seconds
 - second run: 22 seconds

(compared to previously when it was more than a minute)

So I suspect that when I was running these commands the AWS EC2 hardware may have been under heavy load, making the authentication operation quite expensive.


Eep. One of the Joyent guys has a whole series on benchmarking in the cloud. The takeaway really distills down to "lots can change here."

Any chance of repeating this in a known environment?

--

Peter Waller

unread,
Jan 13, 2014, 7:25:11 AM1/13/14
to Corey Quinn, salt-...@googlegroups.com, Thomas Hatch, Colton Myers, developers
On 13 January 2014 11:25, Corey Quinn <co...@sequestered.net> wrote:
Eep. One of the Joyent guys has a whole series on benchmarking in the cloud. The takeaway really distills down to "lots can change here."

I'm aware that this is somewhat fraught. I'm afraid I don't have a suitable environment to test in so I can only report how it feels like "in real life", which I think is still useful information to the developers.

However, less anecdotally, you can see from these profiles the difference, which confirms what Colton said. The .log files can be viewed with kcachegrind.


(in the latter, the crypto functions are nowhere to be seen, explaining the ~2x speedup).


Peter Waller

unread,
Jan 30, 2014, 6:17:33 AM1/30/14
to Corey Quinn, salt-...@googlegroups.com, Thomas Hatch, Colton Myers, developers
Bump. This thread was never resolved IIRC. I'm still interested to hear from Tom et al what's on the horizon. I'd like to know in particular if salt will get a "progress" mechanism which will not slow everything down in the manner observed here.

Cheers,

- Peter

Colton Myers

unread,
Feb 4, 2014, 2:34:23 PM2/4/14
to salt-...@googlegroups.com
I replied to Peter directly (in response to a direct e-mail), but I wanted to reply here as well.  The event system was recently fixed so that each new event doesn't have to reauth with the master, but uses the existing session instead.  This means that progress events (and other events) are substantially faster.  These improvements are in the 2014.1.0 release, which should go out this week.  (It's in RC now)

--
Colton Myers


--

Peter Waller

unread,
Mar 10, 2014, 12:45:23 PM3/10/14
to salt-...@googlegroups.com, Colton Myers, developers
On 4 February 2014 19:34, Colton Myers <colton...@gmail.com> wrote:
The event system was recently fixed so that each new event doesn't have to reauth with the master, but uses the existing session instead.  This means that progress events (and other events) are substantially faster.  These improvements are in the 2014.1.0 release, which should go out this week.  (It's in RC now)

I've not done any profiling but we've just updated from 0.17.5 to 2014.1.0 and our NOOP highstate time has gone from 14s (which we were happy with) to 2 minutes (a >6x slowdown). It's trivially reproducible on several different machines, just by downgrading/upgrading the minion.

The difference is so stark we didn't think it could have worked properly with the old minion. But it did.

We have 218 states so it's a slowdown of ~500ms/state which looks a lot like the amount of slowdown I was seeing before.

Thoughts? Is this likely to improve? I'm a bit disappointed.

Peter Waller

unread,
Mar 13, 2014, 5:25:15 AM3/13/14
to salt-...@googlegroups.com

Volker

unread,
Mar 14, 2014, 10:34:38 AM3/14/14
to salt-...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 2/4/14 8:34 PM, Colton Myers wrote:
> I replied to Peter directly (in response to a direct e-mail), but I
> wanted to reply here as well. The event system was recently fixed
> so that each new event doesn't have to reauth with the master, but
> uses the existing session instead. This means that progress events
> (and other events) are substantially faster. These improvements
> are in the 2014.1.0 release, which should go out this week. (It's
> in RC now)
>

This is great news! I'm still running 0.17.x, buts thats a very good
reason to upgrade. Thank you very much for sharing that!

- - felskrone

-----BEGIN PGP SIGNATURE-----
Version: GnuPG/MacGPG2 v2.0.17 (Darwin)
Comment: GPGTools - http://gpgtools.org
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlMjE34ACgkQHaTGAGocg2Lv6QCeIuuVf6U3BQ7lNx6msi0EOK8e
gysAoJyixr06asmhULMxIM/WapVp5o7w
=AvZz
-----END PGP SIGNATURE-----

Peter Waller

unread,
Mar 14, 2014, 10:51:22 AM3/14/14
to salt-...@googlegroups.com
Volker, you missed the most recent post where I pointed out that this is not fixed.

See this issue:

https://github.com/saltstack/salt/issues/10798


--
You received this message because you are subscribed to the Google Groups "Salt-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to salt-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply all
Reply to author
Forward
0 new messages