JenkinsRule tests sometimes timeout

17 views
Skip to first unread message

Jacob Keller

unread,
Apr 9, 2018, 3:29:03 PM4/9/18
to Jenkins Developers
Hi,

I've been noticing at least on my test machine, and in the jenkins tests run when I submit pull requests, that the JenkinsRule from jenkins-test-harness can timeout when running tests. It would run the tests multiple times, and sometimes they would pass and get marked as "flaky" while other times they would just fail. Essentially, something was causing them to take longer than the 180 second timeout. Interestingly, if I run the test cases 1 at a time, (i.e. with -Dtest=<class name>) the tests *always* pass, and do not take a long time. It is only when running the tests as a whole that some appear to timeout.

I tried extending the jenkins test timout system property to 10 minutes, and the failure went away, but the tests took a significantly longer time to run, so I started digging into the actual tests that were timing out to see what was going on. The actual content of the test was running incredibly fast, but the Jenkins initialization was taking too long and timing out before completion.

Further digging, and I discovered that it could take a significantly long time for the SSHD server to start. It turns out that JenkinsRule defaults to the 1.651.2 jenkins-war which does not disable the server by default. (This wasn't changed until later). I do not understand why the server can take so long to startup. I tried running tests with various values of --forkCount, and --resuseForks, which did not change the outcome.

Then, I dug into the actual JenkinsRule code, and tried to figure out if we could just disable the SSHD server (since newer versions of the jenkins do this by default anyways). Updating jenkins-test-harness to target the latest 2.x version did not work, and probably would cause difficulty in testing plugins wishing to support older versions.

Directly calling SSHD.get().setPort() doesn't work, because we need to change the default port value prior to launching the server, as otherwise we still pay the initialization cost.

I settled on adding a PresetData which has the configuration file for the SSHD module in place. This works ok, but requires putting @PresetData for every test. Additionally, I couldn't figure out any way to get @PresetData to work with @ClassRule.

To ease testing, I modified the JenkinsRule to simply force the correct HomeLoader, instead of using an empty directory.

Once I did so, it resolved the issues in the Git plugin, and made the tests more reliable.

I'm wondering if there's any better ideas for how to solve this? It's very annoying to have some tests arbitrarily fail due to this timeout, especially when we don't need the SSHD server for these test cases. It's very troublesome, as in some cases, it can cause the automated bot that tests github pull requests to report test failures, even though they are actually fine and it's just the timeout bug that caused the failure. Sometimes it passes because the tests just get marked as "flaky" after running a few times, but other times all 5 runs fail.

It's even worse when debugging locally, since it's very weird to have test failures when running mvn test, vs when running mvn test -Dtest=<testname>

Other areas I haven't explored:

(1) modifying the jenkins-war to simply excise the sshd-module entirely so that it doesn't even try to load (would cause problems for any test case that actually requires the SSHD to start).
(2) attempting to debug why the sshd-module can sometimes take forever. (It's possible it's platform/OS related?)
(3) modifying the JenkinsRule to not start the timeout countdown until after initialization (would fix the test failure, but ultimately leaves the tests taking a significantly longer time to run)

Thanks,
Jake

More information on the root cause so far is at https://issues.jenkins-ci.org/browse/JENKINS-50642

Jesse Glick

unread,
Apr 9, 2018, 4:26:23 PM4/9/18
to Jenkins Dev
On Mon, Apr 9, 2018 at 2:44 PM, Jacob Keller <jacob....@gmail.com> wrote:
> I discovered that it could take a significantly long
> time for the SSHD server to start. It turns out that JenkinsRule defaults to
> the 1.651.2 jenkins-war which does not disable the server by default.

`JenkinsRule` is _built_ against some older version of Jenkins core.
What you _run_ against is specified by `jenkins.version` in your POM.
Select something newish, and SSHD will not be started by default, and
your problem is solved.

> attempting to debug why the sshd-module can sometimes take forever.
> (It's possible it's platform/OS related?)

Sure, maybe.

> modifying the JenkinsRule to not start the timeout countdown until after
> initialization

No, this would allow tests which genuinely hang in e.g. `@LocalData`
to never terminate.

Jacob Keller

unread,
Apr 9, 2018, 5:15:27 PM4/9/18
to Jenkins Developers


On Monday, April 9, 2018 at 1:26:23 PM UTC-7, Jesse Glick wrote:
On Mon, Apr 9, 2018 at 2:44 PM, Jacob Keller <jacob....@gmail.com> wrote:
> I discovered that it could take a significantly long
> time for the SSHD server to start. It turns out that JenkinsRule defaults to
> the 1.651.2 jenkins-war which does not disable the server by default.  
`JenkinsRule` is _built_ against some older version of Jenkins core.
What you _run_ against is specified by `jenkins.version` in your POM.
Select something newish, and SSHD will not be started by default, and
your problem is solved.


Ok, that fixes the cases for testing against newer versions of Jenkins. I think the git plugin still wants to point against older versions.
 
> attempting to debug why the sshd-module can sometimes take forever.
> (It's possible it's platform/OS related?)

Sure, maybe. 

I'm not exactly sure how to do that, but I suppose I could at least add more logging statements to sshd-module and build a version of Jenkins that includes those. Ultimately, it won't resolve the issue if building against older versions (unless we backport such a fix to maintenance branches?)
 
> modifying the JenkinsRule to not start the timeout countdown until after
> initialization

No, this would allow tests which genuinely hang in e.g. `@LocalData`
to never terminate. 

Ok this makes sense to avoid.

Thanks,
Jake
Reply all
Reply to author
Forward
0 new messages