[JIRA] (JENKINS-50219) Jenkins agent windows service fails to restart with an unhandled COMException in WinSw's log

3 views
Skip to first unread message

tom.manning@third.com (JIRA)

unread,
Mar 16, 2018, 9:49:02 AM3/16/18
to jenkinsc...@googlegroups.com
Tom Manning created an issue
 
Jenkins / Bug JENKINS-50219
Jenkins agent windows service fails to restart with an unhandled COMException in WinSw's log
Issue Type: Bug Bug
Assignee: Oleg Nenashev
Components: windows-slave-installer-module
Created: 2018-03-16 13:48
Environment: Jenkins 2.9.84
windows-slave-installer 1.9.2
jenkins-slave.exe running on Windows 10 + jre 1.8.0.161 under local admin credentials.
Priority: Minor Minor
Reporter: Tom Manning

We have noticed that our Windows 10 nodes have been intermittently failing to come back online after Jenkins / master restarts.

 

The following entries are shown in WinSw's Jenkins-slave.wrapper.log file at the time of the failures. In particular it appears that the unhandled COMException is preventing the service restarting successfully.

 

2018-03-16 12:26:38,011 DEBUG - Starting ServiceWrapper in the CLI mode
2018-03-16 12:26:38,589 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
2018-03-16 12:26:38,682 DEBUG - Completed. Exit code is 0
2018-03-16 12:26:38,870 DEBUG - Starting ServiceWrapper in the CLI mode
2018-03-16 12:26:39,151 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
2018-03-16 12:26:39,276 INFO  - Stopping jenkinsslave-c__jenkins
2018-03-16 12:26:39,292 DEBUG - ProcessKill 332
2018-03-16 12:26:39,386 INFO  - Found child process: 420 Name: conhost.exe
2018-03-16 12:26:39,433 INFO  - Stopping process 420
2018-03-16 12:26:39,448 INFO  - Send SIGINT 420
2018-03-16 12:26:39,464 WARN  - SIGINT to 420 failed - Killing as fallback
2018-03-16 12:26:39,464 INFO  - Stopping process 332
2018-03-16 12:26:39,479 INFO  - Send SIGINT 332
2018-03-16 12:26:39,495 WARN  - SIGINT to 332 failed - Killing as fallback
2018-03-16 12:26:39,511 INFO  - Finished jenkinsslave-c__jenkins
2018-03-16 12:26:39,511 DEBUG - Completed. Exit code is 0
2018-03-16 12:26:40,342 FATAL - Unhandled exception
System.Runtime.InteropServices.COMException (0x80040150): Could not read key from registry (Exception from HRESULT: 0x80040150 (REGDB_E_READREGDB))
   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
   at System.Management.ManagementObjectSearcher.Get()
   at WMI.WmiRoot.ClassHandler.Invoke(Object proxy, MethodInfo method, Object[] args)
   at winsw.WrapperService.Run(String[] _args, ServiceDescriptor descriptor)
   at winsw.WrapperService.Main(String[] args)

After some inspection of WinSw's source code we have determined that the exception is being thrown from the line "s = svc.Select(d.Id);" in the following snippet (found in the "Run" method in "src/Core/ServiceWrapper/Main.cs lines 687-704). When a secondary WinSw process is run with the "restart!" parameter by Jenkins.

 
if (args[0] == "restart")
{
    Log.Info("Restarting the service with id '" + d.Id + "'");
    if (s == null)
        ThrowNoSuchService();
    if(s.Started)
        s.StopService();
    while (s.Started)
    {
        Thread.Sleep(1000);
        s = svc.Select(d.Id);
    }
    s.StartService();
    return;
}
We are currently able to work around the problem by restarting the service manually on affected nodes.

 

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v7.3.0#73011-sha1:3c73d0e)
Atlassian logo

tom.manning@third.com (JIRA)

unread,
Mar 16, 2018, 10:50:02 AM3/16/18
to jenkinsc...@googlegroups.com
Tom Manning updated an issue
Change By: Tom Manning
We have noticed that our Windows 10 nodes have been intermittently failing to come back online after Jenkins / master restarts.

 

The following entries are shown in WinSw's Jenkins-slave.wrapper.log file at the time of the failures. In particular it appears that the unhandled COMException is preventing the service restarting successfully.

 
{code:java}

2018-03-16 12:26:38,011 DEBUG - Starting ServiceWrapper in the CLI mode
2018-03-16 12:26:38,589 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
2018-03-16 12:26:38,682 DEBUG - Completed. Exit code is 0
2018-03-16 12:26:38,870 DEBUG - Starting ServiceWrapper in the CLI mode
2018-03-16 12:26:39,151 INFO  - Restarting the service with id 'jenkinsslave-c__jenkins'
2018-03-16 12:26:39,276 INFO  - Stopping jenkinsslave-c__jenkins
2018-03-16 12:26:39,292 DEBUG - ProcessKill 332
2018-03-16 12:26:39,386 INFO  - Found child process: 420 Name: conhost.exe
2018-03-16 12:26:39,433 INFO  - Stopping process 420
2018-03-16 12:26:39,448 INFO  - Send SIGINT 420
2018-03-16 12:26:39,464 WARN  - SIGINT to 420 failed - Killing as fallback
2018-03-16 12:26:39,464 INFO  - Stopping process 332
2018-03-16 12:26:39,479 INFO  - Send SIGINT 332
2018-03-16 12:26:39,495 WARN  - SIGINT to 332 failed - Killing as fallback
2018-03-16 12:26:39,511 INFO  - Finished jenkinsslave-c__jenkins
2018-03-16 12:26:39,511 DEBUG - Completed. Exit code is 0
2018-03-16 12:26:40,342 FATAL - Unhandled exception
System.Runtime.InteropServices.COMException (0x80040150): Could not read key from registry (Exception from HRESULT: 0x80040150 (REGDB_E_READREGDB))
   at System.Runtime.InteropServices.Marshal.ThrowExceptionForHRInternal(Int32 errorCode, IntPtr errorInfo)
   at System.Management.ManagementObjectSearcher.Get()
   at WMI.WmiRoot.ClassHandler.Invoke(Object proxy, MethodInfo method, Object[] args)
   at winsw.WrapperService.Run(String[] _args, ServiceDescriptor descriptor)
   at winsw.WrapperService.Main(String[] args)
{code}
After some inspection of WinSw's source code we have determined that the exception is being thrown from the line "s = svc.Select(d.Id);" in the following snippet (found in the "Run" method in "src/Core/ServiceWrapper/Main.cs {color:#333333}lines 687-704){color}. When a secondary WinSw process is run with the "restart!" parameter by Jenkins.

 
{code:java}
if (args[0] == "restart")
\ {
    Log.Info("Restarting the service with id '" + d.Id + "'");
    if (s == null)
        ThrowNoSuchService();
    if(s.Started)
        s.StopService();
    while (s.Started)
   
\
{

        Thread.Sleep(1000);
        s = svc.Select(d.Id);
    }
    s.StartService();
    return;
} {code}
We are currently able to work around the problem by restarting the service manually on affected nodes.

 

o.v.nenashev@gmail.com (JIRA)

unread,
Mar 16, 2018, 10:53:02 AM3/16/18
to jenkinsc...@googlegroups.com

tom.manning@third.com (JIRA)

unread,
May 24, 2018, 6:48:01 AM5/24/18
to jenkinsc...@googlegroups.com

After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service.

The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff".
 
The following articles describe the cause:
https://support.microsoft.com/en-gb/help/2287297/a-com-application-may-stop-working-on-windows-server-2008-when-a-user
https://support.microsoft.com/en-gb/help/3114011/800703fa-illegal-operation-attempted-on-a-registry-key-that-has-been-m

mario.klebsch@ime-actia.de (JIRA)

unread,
Dec 5, 2018, 8:11:01 AM12/5/18
to jenkinsc...@googlegroups.com

I also see this error regularly after server reboots. But I am sure, that there were no interactive sessions on the "Log On As" account of the client failing to restart.

I change the policy setting anyway, I will see, whether that helps or not.

 

But I wonder, whether the restart of the jenkins client really is needed. In jenkins-slave.err.log I see the following entries:

 

Dez 02, 2018 4:56:56 AM hudson.remoting.jnlp.Main$CuiListener status
INFORMATION: Terminated
Dez 02, 2018 4:57:11 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException connect timed out
...
Dez 02, 2018 4:58:22 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFORMATION: Failed to connect to the master. Will try again: java.net.ConnectException Connection refused: connect
...
Dez 02, 2018 5:03:46 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFORMATION: Failed to connect to the master. Will try again: java.net.SocketTimeoutException Read timed out
...
Dez 02, 2018 5:04:56 AM org.jenkinsci.remoting.engine.JnlpAgentEndpointResolver waitForReady
INFORMATION: Master isn't ready to talk to us on http://jenkins01:8080/tcpSlaveAgentListener/. Will try again: response code=503
...
Dez 02, 2018 5:09:20 AM jenkins.slaves.restarter.JnlpSlaveRestarterInstaller$FindEffectiveRestarters$1 onReconnect
INFORMATION: Restarting agent via jenkins.slaves.restarter.WinswSlaveRestarter@63a1d34b

 

None of the errors listed should (IMHO) be a reason to restart the client.

The log does not tell, why the clients decide to restart itself. It may be the duration of the downtime, but the time difference between 5:09:20 and 4:56:56 is 744 seconds. I have 65 "waitForReady" messages and 26 "Will try again: response code=503" messages.

At least the "response code=503" error message prevent the jenkins client reboot, especially since it is a known jenkins problem, that jeninks reboots take ages.

 

Mario

 

 

This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 4:23:02 AM1/2/19
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 4:23:02 AM1/2/19
to jenkinsc...@googlegroups.com

o.v.nenashev@gmail.com (JIRA)

unread,
Jan 2, 2019, 4:24:01 AM1/2/19
to jenkinsc...@googlegroups.com
Oleg Nenashev commented on Bug JENKINS-50219
 
Re: Jenkins agent windows service fails to restart with an unhandled COMException in WinSw's log

I might be able to work on it at some point, but I will unassign it for now so that anybody else can pick it up and work on it

aloria@haworthmedia.com (JIRA)

unread,
Feb 3, 2020, 3:36:04 PM2/3/20
to jenkinsc...@googlegroups.com

Tom Manning added a comment - 2018-05-24 05:47

After some further investigation I believe I have found a more permanent work around for the problem. After observing that the problem occurs when the master is restarted after logging out from an interactive session on the "Log On As" account that runs the Jenkins agent service.

The fix is to enable the group policy setting "Local Computer Policy->Administrative Templates->System->User Profiles->Do not forcefully unload the users registry at user logoff".

FYI this resolved the issue for us on a Windows Server 2019 VM.

This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages