[log4net] dotnet test host process crashes or hangs when run by jenkins docker container

1,264 views
Skip to first unread message

Dominik Psenner

unread,
Feb 3, 2018, 5:51:32 AM2/3/18
to nunit-...@googlegroups.com
The first message was rejected by the mailing list, please see the forward. Please cc the dev at logging.apache.org mailing list.

---------- Forwarded message ----------
From: "Dominik Psenner" <dpse...@apache.org>
Date: 3 Feb 2018 11:47 a.m.
Subject: [log4net] dotnet test host process crashes or hangs when run by jenkins docker container
To: <nunit-...@googlegroups.com>
Cc: <d...@logging.apache.org>

Hi,

I am reaching out to you wearing the hat of the apache log4net community. I'll give you some context first.

We are in the process of automating the builds and tests for our various targeted frameworks. We still struggle with gettibg the netstandard-1.3 tests to work. Some time ago I already had a conversation with Rob Prouse about this. I was able to find a good combination of arguments to get the test results logged into a trx fike that can be consumed by jenkins. The tests are run on a ubuntu machine that starts up a docker container based on microsoft/dotnet:1.1-sdk.

What we observe is that the test suite is run and when a test fails, the tedt process does not terminate and hangs there until jenkins kills the pipeline after the 4 hour timeout. The last lifesign is this message:

    [exec] The active test run was aborted. Reason: Unable to communicate with test host process.
Sending interrupt signal to process
Cancelling nested steps due to timeout
After 10s process did not stop

Unfortunately we are unable to reproduce the issue when running the test locally and we have yet found no way to find out what actually goes wrong or what we could do to avoid the issue. Would you please help us troubleshoot and find a solution to this situation?


The pipeline configuration is in the Jenkinsfile. The test is run via nant. So the build target can be found at the very end of this file: https://github.com/apache/logging-log4net/blob/feature/cd-pipeline/tests/nant.build

And last but not least, the full jenkins build console output can be found here: https://builds.apache.org/job/logging-log4net/job/feature%252Fcd-pipeline/192/console

Best regards,
Dominik

Dominik Psenner

unread,
Feb 9, 2018, 8:23:40 AM2/9/18
to nunit-...@googlegroups.com
*bump*

Has anyone out there an idea how we could troubleshoot the issue or any
ideas what else we could do?

Dominik Psenner

unread,
May 20, 2018, 8:30:46 AM5/20/18
to nunit-...@googlegroups.com
I once more sent the mail from the wrong address. Please see the forwarded message that follows.

---------- Forwarded message ---------
From: Dominik Psenner <dpse...@apache.org>
Date: Sun, 20 May 2018, 14:24
Subject: Re: [log4net] dotnet test host process crashes or hangs when run by jenkins docker container
To: <nunit-...@googlegroups.com>
Cc: <d...@logging.apache.org>


Hi!

I've got an update on this topic. In the past days I had the chance to work on the build pipeline and today I have diagnostical test logs at hand. Unfortunately I've no idea what causes the following output:

```
runtests-netstandard-1.3:

     [exec] Build started, please wait...
     [exec] Build completed.
     [exec] 
     [exec] Test run for /home/jenkins/jenkins-slave/workspace/log4net_feature_cd-pipeline-76KUCPODUF6LCE45226EBUR4GNVLYPMYVC23Z4ITBOMNJT3CA2WA/netstandard/1.3/log4net.tests/bin/Debug/netcoreapp2.0/log4net.tests.dll(.NETCoreApp,Version=v2.0)
     [exec] Microsoft (R) Test Execution Command Line Tool Version 15.7.0
     [exec] Copyright (c) Microsoft Corporation.  All rights reserved.
     [exec] 
     [exec] Starting test execution, please wait...
     [exec] Logging Vstest Diagnostics in file: /home/jenkins/jenkins-slave/workspace/log4net_feature_cd-pipeline-76KUCPODUF6LCE45226EBUR4GNVLYPMYVC23Z4ITBOMNJT3CA2WA/netstandard/1.3/log4net.tests/test.diagnostics.log
     [exec] The active test run was aborted. Reason: 
     [exec] 
     [exec] Test Run Aborted.
     [exec] /home/jenkins/jenkins-slave/workspace/log4net_feature_cd-pipeline-76KUCPODUF6LCE45226EBUR4GNVLYPMYVC23Z4ITBOMNJT3CA2WA/tests/nant.build(869,10):
     [exec] External Program Failed: dotnet (return code was 1)
     [move] 3 files moved.

BUILD SUCCEEDED - 1 non-fatal error(s), 2 warning(s)
```

The following location [1] contains the diagnostical logfiles for the build that produced the output and I would greatly appreciate any help with diagnosing what's going wrong here.

Cheers,
Dominik


--
Dominik Psenner

Mikkel Bundgaard

unread,
May 21, 2018, 4:26:32 PM5/21/18
to NUnit-Discuss
Hi Dominik

I know very little about the technologies in your build-pipeline, but the output your are quoting are from vstest and sounds - somewhat - similar to https://github.com/nunit/nunit/issues/2817. So perhaps the blame flag can also provide you with some more information about the failing test (see also https://github.com/Microsoft/vstest/issues/627).

Kind regards,
Mikkel

Dominik Psenner

unread,
May 22, 2018, 3:45:06 AM5/22/18
to nunit-...@googlegroups.com
Hi Mikkel,

thanks for your valuable input. I already ran across [1] and the
diagnostic logs also mention similar network stream related exceptions.
Unfortunately this does not give any clue to pursue. I don't know
whether [2] is related but I updated the docker image from
microsoft/dotnet:1-sdk to microsoft/dotnet:2-sdk to remedy the symptom
that the tests would run but the test runner got stuck, hanging
indefinitely even after completion. Now, looking into the diagnostic
logs of the most recent build the following gives a rough overview of
what happens:

```

TpTrace Verbose: 0 : 528, 9, 2018/05/20, 11:46:57.915, 49369169062240712, testhost.dll, BaseRunTests.RunTestInternalWithExecutors: Running tests for executor://NUnit3TestExecutor
TpTrace Information: 0 : 528, 9, 2018/05/20, 11:46:57.951, 49369169097807798, testhost.dll, NUnit Adapter 3.10.0.21: Test execution started
TpTrace Information: 0 : 528, 9, 2018/05/20, 11:46:57.971, 49369169117450207, testhost.dll, Running all tests in /home/jenkins/jenkins-slave/workspace/log4net_feature_cd-pipeline-76KUCPODUF6LCE45226EBUR4GNVLYPMYVC23Z4ITBOMNJT3CA2WA/netstandard/1.3/log4net.tests/bin/Debug/netcoreapp2.0/log4net.tests.dll
TpTrace Information: 0 : 528, 9, 2018/05/20, 11:46:57.982, 49369169128929849, testhost.dll, AssemblyResolver: Mono.Cecil: Resolving assembly.
TpTrace Information: 0 : 528, 9, 2018/05/20, 11:46:57.983, 49369169130250657, testhost.dll, AssemblyResolver: Mono.Cecil: Resolved assembly.
TpTrace Information: 0 : 528, 9, 2018/05/20, 11:46:58.203, 49369169349596413, testhost.dll, NUnit3TestExecutor converted 91 of 91 NUnit test cases
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.338, 49369169484598006, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.341, 49369169488290160, testhost.dll, TestRunCache: No test found corresponding to testResult 'ReadOnlyToArrayTest Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.342, 49369169488996148, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.342, 49369169489126577, testhost.dll, TestRunCache: No test found corresponding to testResult 'ToArrayTest Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.384, 49369169531186928, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.384, 49369169531309420, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestBufferSize5 Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.385, 49369169531793738, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.385, 49369169531902474, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestSetupAppender Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.398, 49369169544484411, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.398, 49369169544600458, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestComputeCheckPeriod Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.493, 49369169639872299, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.493, 49369169640070127, testhost.dll, TestRunCache: No test found corresponding to testResult 'RollingCombinedWithPreserveExtension Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.498, 49369169645373332, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.499, 49369169645502680, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestDefaultLockingModel Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.504, 49369169650485916, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.504, 49369169650617259, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestExclusiveLockFails Passed' in inProgress list.
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.507, 49369169653633549, testhost.dll, InProgressTests is null
TpTrace Warning: 0 : 528, 10, 2018/05/20, 11:46:58.507, 49369169653765241, testhost.dll, TestRunCache: No test found corresponding to testResult 'TestExclusiveLockLocks Passed' in inProgress list.
TpTrace Information: 0 : 528, 10, 2018/05/20, 11:46:58.511, 49369169657847205, testhost.dll, Sending test run statistics
[...]
TpTrace Warning: 0 : 515, 8, 2018/05/20, 11:46:59.127, 49369170274245452, vstest.console.dll, ProxyOperationManager: Timed out waiting for test host to exit. Will terminate process.
TpTrace Verbose: 0 : 515, 8, 2018/05/20, 11:46:59.134, 49369170281160451, vstest.console.dll, ParallelProxyExecutionManager: HandlePartialRunComplete: Total completed clients = 1, Run complete = True, Run canceled: False.
TpTrace Verbose: 0 : 515, 8, 2018/05/20, 11:46:59.228, 49369170374419855, vstest.console.dll, TestRunRequest:TestRunComplete: Starting. IsAborted:True IsCanceled:False.
TpTrace Verbose: 0 : 515, 8, 2018/05/20, 11:46:59.228, 49369170374872917, vstest.console.dll, TestLoggerManager.HandleTestRunComplete: Ignoring as the object is disposed.
TpTrace Information: 0 : 515, 8, 2018/05/20, 11:46:59.228, 49369170375126414, vstest.console.dll, TestRunRequest:TestRunComplete: Completed.
TpTrace Verbose: 0 : 515, 8, 2018/05/20, 11:46:59.228, 49369170375266139, vstest.console.dll, TcpClientExtensions.MessageLoopAsync: exiting MessageLoopAsync remoteEndPoint: 127.0.0.1:43526 localEndPoint: 127.0.0.1:58259
TpTrace Verbose: 0 : 515, 1, 2018/05/20, 11:46:59.228, 49369170375283146, vstest.console.dll, TestRunRequest.Dispose: Starting.
TpTrace Information: 0 : 515, 1, 2018/05/20, 11:46:59.229, 49369170375621213, vstest.console.dll, TestRunRequest.Dispose: Completed.
TpTrace Information: 0 : 515, 1, 2018/05/20, 11:46:59.229, 49369170375765305, vstest.console.dll, TestRequestManager.RunTests: run tests completed.
TpTrace Information: 0 : 515, 1, 2018/05/20, 11:46:59.229, 49369170375968109, vstest.console.dll, RunTestsArgumentProcessor:Execute: Test run is completed.
TpTrace Verbose: 0 : 515, 1, 2018/05/20, 11:46:59.229, 49369170376159565, vstest.console.dll, Executor.Execute: Exiting with exit code of 1

```

One suspicious thing here is the mention of Mono.Cecil, but that may
also be fine because the implementation of the netstandard-1.3 api may
be provided by Mono in the docker image microsoft/dotnet:2-sdk. The next
suspicious thing is that some communication within the testing process
times out and aborts the test run. As previously mentioned, with the
docker container microsoft/dotnet:1-sdk the tests were run but the test
process got stuck and hanged there until the build timed out (after four
hours). With microsoft/dotnet:2-sdk this behavior changes. Now the test
process detects (possibly the same) problem and times out because the
test host does not terminate. This means to me that the tests are
resolved and run and a similar issue as with microsoft/dotnet:1-sdk
still persists. At first sight, vstest.console.dll (test hosting
process?) does its job while testhost.dll (nunit?) does not terminate
after test completion. Interestingly, exactly the same tests are run
against several other frameworks on multiple operating systems. This
gives me a hint that this is not an issue of the test implementations
but rather something of the framework that runs the tests. As far as I
can tell, nunit is at the latest available version on nuget. Are you
aware of any nunit issue that relates to these symptoms?

Cheers,
Dominik

[1] https://github.com/nunit/nunit/issues/2817
[2] https://github.com/Microsoft/vstest/issues/627
>> <javascript:>>:
Reply all
Reply to author
Forward
0 new messages