Jira (BOLT-1454) Bolt is super slow

0 views
Skip to first unread message

Cyril Cordoui (JIRA)

unread,
Jul 3, 2019, 10:19:03 AM7/3/19
to puppe...@googlegroups.com
Cyril Cordoui created an issue
 
Puppet Task Runner / Bug BOLT-1454
Bolt is super slow
Issue Type: Bug Bug
Assignee: Unassigned
Created: 2019/07/03 7:18 AM
Priority: Normal Normal
Reporter: Cyril Cordoui

When running bolt on a couple of servers, tasks are super slow to execute.

With a simple shell (or python) doing only an echo it takes more than 80 seconds to run on 100 servers, whereas other tools are nearly twenty times faster:

 

# ansible -f100 -oi inventory.ini -a '/bin/echo "{\"hello\":\"Tha world\"}"' srvs > /dev/null
ansible -f100 -oi inventory.ini -a srvs > /dev/null 14.38s user 14.77s system 366% cpu 7.964 total
# ansible -oi inventory.ini -a '/bin/echo "{\"hello\":\"Tha world\"}"' srvs > /dev/null
ansible -oi inventory.ini -a '/bin/echo "{\"hello\":\"Tha world\"}"' srv 11.70s user 9.47s system 144% cpu 14.626 total
 
# pssh -Ph srv_list 'echo {"Hello": "tha world"}' > /dev/null
pssh -Ph srv_list 'echo {"Hello": "tha world"}' > /dev/null 13.71s user 1.58s system 
271% cpu 5.643 total
 
# /usr/local/bin/bolt task run xxx::echo2 --nodes=all > /dev/null sudo /usr/local/bin/bolt task run xxx::echo2 --nodes=all > /dev/null 80.14s user 2.83s system 103% cpu 1:20.52 total

 

The content of echo2.sh:

 

#!/bin/sh
echo '{"hello":"tha world"}'

 

It may be related to the fact that bolt is only using one core where we have 24 threads available on our (physical) server.

 

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.7.1#77002-sha1:e75ca93)
Atlassian logo

Nick Lewis (JIRA)

unread,
Jul 9, 2019, 7:10:04 PM7/9/19
to puppe...@googlegroups.com
Nick Lewis commented on Bug BOLT-1454
 
Re: Bolt is super slow

This is really interesting, thanks for the report. Do you notice the same slowness when using bolt command run as well, or is it specific to tasks? I wonder if the difference could be related to having to scp the task file...

I wouldn't be surprised if Bolt were a little slower in this case, but 80 seconds is extremely wrong.

Cyril Cordoui (JIRA)

unread,
Jul 10, 2019, 8:09:03 AM7/10/19
to puppe...@googlegroups.com

No same issue

time /usr/local/bin/bolt command run 'echo \"{"hello": "tha world"}\"' --nodes=all > /dev/null
/usr/local/bin/bolt command run 'echo \"{"hello": "tha world"}\"' > 81.92s user 2.22s system 103% cpu 1:21.32 total

Nick Lewis (JIRA)

unread,
Jul 10, 2019, 6:37:03 PM7/10/19
to puppe...@googlegroups.com
Nick Lewis commented on Bug BOLT-1454

I tested on 50 nodes and a simple command run took ~3.3 seconds. The same thing in pssh takes ~0.8 seconds and ~6.6 seconds in ansible. This is with four cores and concurrency set to 50 for all of them.

Is it possible that Bolt is using a different, slower authentication method? I'm not sure how that would account for such a large difference though.

Unfortunately, since it seems to be an environmental issue of some sort, it's hard to say what might be going on...

Cyril Cordoui (JIRA)

unread,
Jul 11, 2019, 7:02:02 AM7/11/19
to puppe...@googlegroups.com

We are on RHEL7.6, 128G of ram, 24 threads, the authentication is done through ssh keys (on the three tools used in the benchmark)

When you run the test on your box, is bolt using multiple cores? because it seems to be the bottleneck from what we observed.

Nick Lewis (JIRA)

unread,
Jul 11, 2019, 6:03:03 PM7/11/19
to puppe...@googlegroups.com
Nick Lewis commented on Bug BOLT-1454

It's also only using one core on my machine. That's a limitation of Ruby, unfortunately. But it doesn't seem to be a computationally intensive task in my case, since it's still running very quickly.

I found a potentially related issue: https://github.com/net-ssh/net-ssh/issues/567

If I cat my known_hosts file into itself a few times, Bolt takes more than twice as long to run. Does the host you're running from have a particularly large known_hosts file?

Weirdly, this seems to even be the case if I run with --no-host-key-check set...

Nick Lewis (JIRA)

unread,
Jul 11, 2019, 6:47:02 PM7/11/19
to puppe...@googlegroups.com
Nick Lewis commented on Bug BOLT-1454

I've found a couple things. The first is that net-ssh's parsing of known_hosts is inherently quite slow (~300ms with 20k lines). The other is that net-ssh parses the known_hosts file once for every host being targeted. The first issue is relatively easy to fix, but the latter requires a bigger restructure.

Nick Lewis (JIRA)

unread,
Jul 11, 2019, 8:05:03 PM7/11/19
to puppe...@googlegroups.com
Nick Lewis commented on Bug BOLT-1454

I filed a PR against net-ssh with some improvements to known_hosts parsing. There's more work to be done to only parse it once, but this is a substantial improvement anyway.

https://github.com/net-ssh/net-ssh/pull/682

Cyril Cordoui (JIRA)

unread,
Jul 12, 2019, 7:07:02 AM7/12/19
to puppe...@googlegroups.com

Nice, you pin down the issue, just by adding a temp UserKnownHostsFile=/dev/null in our ssh config:

time /usr/local/bin/bolt command run 'echo \"{"hello": "tha world"}\"' --nodes=all > /dev/null

/usr/local/bin/bolt command run 'echo \"{"hello": "tha world"}\"' > 8.22s user 0.50s system 76% cpu 11.367 total

We have indeed tens of thousands hosts in that file

Reply all
Reply to author
Forward
0 new messages