SSH slave performance degradation

381 views
Skip to first unread message

Dean Yu

unread,
Jul 29, 2014, 2:51:42 PM7/29/14
to jenkin...@googlegroups.com
Hi folks,
  We just upgraded our cluster from 1.509.4 to 1.554.3, and discovered a significant increase in our build times. Builds that typically took ~50 to complete started taking ~90 minutes to finish, sometimes spiking to 2 hours. While researching, we found this JIRA[1] which reported that downgrading the trilead-ssh2 jar solved the performance issues.
  While this ticket talks specifically artifact downloads, we see that our builds as a whole were slower.
  The trilead-ssh2 dependency version was updated by [2], so it was introduced into 1.536, show would only have made it to LTS with 1.554.1 in April.
  Looking at the trilead-ssh2 repo[3], it looks like there were a small set of changes:
   - changes by ndeloof to merge a newer upstream (build214 to build217)
   - changes by stephenc to fix connection bugs
   - changes by kohsuke to support package window sizes

  Anyone have thoughts on the likely culprit? Given the severity of the performance hit we took, I'm surprised that more people haven't reported this.

  -- Dean



Stephen Connolly

unread,
Jul 29, 2014, 5:17:05 PM7/29/14
to jenkin...@googlegroups.com
* KK's changes to window sizes should have *increased* performance
* My connection bug fixes were surgical IIRC
* Nicolas's merge of upstream seems to include an EOL change, so hard to see what changed there with the Github diff tool: https://github.com/jenkinsci/trilead-ssh2/compare/trilead-ssh2-build214-jenkins-3...trilead-ssh2-build217-jenkins-5

I wonder if something changed upstream...

BTW you are using /dev/./urandom as an entropy source for the JVM?


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Dean Yu

unread,
Jul 30, 2014, 12:44:17 AM7/30/14
to jenkin...@googlegroups.com
Obviously, going from 1.509.4 to 1.554.3 is a pretty big jump that included lots and lots of changes. However, the fact that the singular act of downgrading that library got us back to our prior build times is a big smoking gun to me.

> I wonder if something changed upstream...

From the upstream release notes:

build217, 2013-06-03:

- Support for SSH agent based authentication.

build216, 2013-03-04:

- Support of unencrypted entries in the known_hosts file.
- Improved timeout handling. 
> BTW you are using /dev/./urandom as an entropy source for the JVM?

Nope. Should we?

  -- Dean

Mark Waite

unread,
Jul 30, 2014, 2:34:39 AM7/30/14
to jenkin...@googlegroups.com
I thought that a common default on Linux was to block if /dev/random was to block if the pool of random data was emptied.  Refer to http://en.wikipedia.org/?title=/dev/random for a description.

I thought that /dev/urandom did not block if the pool of random data was emptied.  That same article describes the differences between the two.

I've seen cases with some versions of Java and some Linux variants where Java performance suffered badly when I had emptied the pool of random data.  I think that is why Stephen recommends using /dev/urandom so that your program won't block while waiting for random data.

Mark Waite
Thanks!
Mark Waite

Stephen Connolly

unread,
Jul 30, 2014, 2:42:52 AM7/30/14
to jenkin...@googlegroups.com
In my scalability testing I have found you cannot scale out ssh slaves with /dev/random as the entropy source. You need to use /dev/./urandom (JVM bug requires that name btw)

The master on windows is a different story though


--
Sent from my phone

Dean Yu

unread,
Jul 30, 2014, 9:48:55 AM7/30/14
to jenkin...@googlegroups.com
This is great info, but how big of a pool of ssh slaves does this become a problem at? We have 12. (And again, the problem goes away by downgrading the library.)

Stephen Connolly

unread,
Jul 30, 2014, 10:04:34 AM7/30/14
to jenkin...@googlegroups.com
On an AWS m3.large I could not even get to 10 SSH slaves connected without switching to /dev/./urandom

Mike Chmielewski

unread,
Jul 30, 2014, 10:23:54 AM7/30/14
to jenkin...@googlegroups.com
What's the most straightforward way to add this to my installation, add to container args for master, and the node configuration for slaves? Is this just needed on master or just needed on slaves?
--
Mike Chmielewski
 

Stephen Connolly

unread,
Jul 30, 2014, 10:51:25 AM7/30/14
to jenkin...@googlegroups.com
on all linux machines you can just add `-Djava.security.egd=file:/dev/./urandom` to the JVM startup command.

This is more critical on the Jenkins master than the slaves as the slaves typically only have one connection back to the master where as the master has multiple slaves.

If you have multiple slaves sharing the same machine then you would probably need it for the slaves also.

Windows machines do not have this issue as far as I am aware.

Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM

Stephen Connolly

unread,
Jul 30, 2014, 10:53:57 AM7/30/14
to jenkin...@googlegroups.com
On 30 July 2014 14:48, Dean Yu <dea...@gmail.com> wrote:
the problem goes away by downgrading the library.

It would be great if you could determine whether the new version of the library is selecting a different cipher suite from the old version. It may be a change in the cipher priority that could have impacted performance, or perhaps there was a replay attack that the older version was vulnerable to and the fix may require more entropy than the old version... 

Stephen Connolly

unread,
Jul 30, 2014, 11:03:51 AM7/30/14
to jenkin...@googlegroups.com
Release Notes:
==============

build217, 2013-06-03:

- Support for SSH agent based authentication.

build216, 2013-03-04:

- Support of unencrypted entries in the known_hosts file.
- Improved timeout handling. 

build214, 2011-04-25:

- Project build procedure uses Gradle; project artifacts from now on 
  are available at TMate Software Maven repository at http://maven.tmatesoft.com/

build213, 2008-04-01:

- Added a workaround for servers that violate RFC4253 when sending the
  SSH_MSG_SERVICE_ACCEPT and the SSH_MSG_KEXDH_REPLY messages.
  Thanks to Gordon Brockway.

- Fixed encodings for alien platforms (e.g., EBCDIC based). Use "ISO-8859-1" in
  most places where we used the default platform encoding so far.

- API change: atime and mtime attributes in SFTPv3FileAttributes are now
  of type Long (not Integer). Makes it easier to properly handle values > 2^31.

- Fixed the blowfish-ctr cipher, it could not be instantiated (a typo that
  got in during the move to the trilead namespace). Thanks to Roelof Kemp.

- Still in the queue: SSH server support.

Dean Yu

unread,
Jul 30, 2014, 11:08:02 AM7/30/14
to jenkin...@googlegroups.com
> Oh and yes that crazy path is the only way to get it to work... it's a bug/feature of the JVM

I'm taking that by the lack of mention of JVM version that this is an outstanding issue?

Dean Yu

unread,
Jul 30, 2014, 11:16:17 AM7/30/14
to jenkin...@googlegroups.com
Jenkins 1.535 used build 213, so while those changes look interesting, they are part of the old version of the library.

  -- Dean

From: Stephen Connolly <stephen.al...@gmail.com>
Reply-To: "jenkin...@googlegroups.com" <jenkin...@googlegroups.com>
Date: Wednesday, July 30, 2014 at 8:03 AM
To: "jenkin...@googlegroups.com" <jenkin...@googlegroups.com>
Subject: Re: SSH slave performance degradation

Stephen Connolly

unread,
Jul 30, 2014, 12:24:28 PM7/30/14
to jenkin...@googlegroups.com
yep

Stephen Connolly

unread,
Jul 30, 2014, 12:26:55 PM7/30/14
to jenkin...@googlegroups.com
IIUC http://bugs.java.com/view_bug.do?bug_id=4705093 assigned special meaning to "/dev/urandom" so to avoid that special meaning you need to add the /./

Jesse Glick

unread,
Aug 5, 2014, 3:16:19 PM8/5/14
to jenkin...@googlegroups.com
Reply all
Reply to author
Forward
0 new messages