We are currently using GoCD 17.8.0 and occasionally we see our agents getting stuck in Cancelling state, I managed to capture the logs from the latest instance and I believe applying some configuration to tinker with connection pool settings should mitigate or resolve the issue (this is based on
https://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception and the source documentation from http client).
As an aside, we've had another similar issue overwhelming our repository server which causes Jobs to fail intermittently (This has spawned a separate discussion internally). We can separate these problems into two broad categories: one is recoverable and rerunning the Job is sufficient, the other (this case) the agent does not recover and one must restart the agent service.
My question is, has anyone else encountered this issue where agents become stuck in Cancelling state? How did you overcome it?
Could someone enlighten me about where in the GoCD configuration I can look to follow up on connection pool settings or injecting a retry handler?
Agent logs from the web interface:
11:51:50.107 [go] Job Started: 2017-10-25 11:51:50 PDT
11:51:50.107
11:51:50.107 [go] Start to prepare pipeline-name/8/stage-name/1/job-name on machine-name [/godata/agents/0]
11:51:50.111 [go] Start to update materials.
11:51:50.111
11:51:50.111 [go] Start updating files at revision commit-sha from https://repo-url.git
11:51:50.118 Initialized empty Git repository in /godata/agents/0/pipelines/pipeline-name/.git/
11:52:43.904 STDERR: error: RPC failed; result=22, HTTP code = 500
Agent logs on the machine:
2017-10-25
12:20:23,675 [pingThread] ERROR
thoughtworks.go.agent.AgentHTTPClientController:84 - Error occurred when
agent tried to ping server:
org.springframework.remoting.RemoteAccessException:
Could not access HTTP invoker remote service at
[https://go-server-name:8154/go/remoting/remoteBuildRepository]; nested exception
is org.apache.http.NoHttpResponseException: go-server-name:8154 failed to
respond
at
org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:212)
at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:145)
at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
at com.sun.proxy.$Proxy6.ping(Unknown Source)
at com.thoughtworks.go.agent.AgentHTTPClientController.ping(AgentHTTPClientController.java:80)
at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
at org.springframework.scheduling.support.MethodInvokingRunnable.run(MethodInvokingRunnable.java:65)
at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
at java.util.TimerThread.mainLoop(Timer.java:555)
at java.util.TimerThread.run(Timer.java:505)
Caused by: org.apache.http.NoHttpResponseException: go-server-name:8154 failed to respond
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at com.thoughtworks.go.agent.common.ssl.GoAgentServerHttpClient.execute(GoAgentServerHttpClient.java:53)
at
com.thoughtworks.go.agent.GoHttpClientHttpInvokerRequestExecutor.doExecuteRequest(GoHttpClientHttpInvokerRequestExecutor.java:63)
at
org.springframework.remoting.httpinvoker.AbstractHttpInvokerRequestExecutor.executeRequest(AbstractHttpInvokerRequestExecutor.java:136)
at
org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.executeRequest(HttpInvokerClientInterceptor.java:192)
at
org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.executeRequest(HttpInvokerClientInterceptor.java:174)
at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:142)
... 12 more
... This runs in a loop, until I cancelled the service ...
2017-10-30
06:52:56,565 [Thread-6 ] INFO
springframework.context.support.ClassPathXmlApplicationContext:1032 -
Closing
org.springframework.context.support.ClassPathXmlApplicationContext@3d646c37:
startup date [Tue Oct 24 20:13:27 PDT 2017]; root of context hierarchy
2017-10-30
06:52:56,653 [Thread-6 ] INFO
beans.factory.support.DefaultListableBeanFactory:434 - Destroying
singletons in
org.springframework.beans.factory.support.DefaultListableBeanFactory@497470ed:
defining beans
[org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,defaultPluginJarChangeListener,defaultPluginJarLocationMonitor,defaultPluginRegistry,goPluginDescriptorBuilder,pluginWriter,pluginValidator,goFileSystem,defaultPluginManager,felixGoPluginOSGiFramework,defaultGoPluginOSGiManifestCreator,pluginRequestProcessorRegistry,pluginsZip,zipUtil,systemEnvironment,propertyConfigurer,buildLoopServer,urlService,httpService,artifactsManipulator,sslInfrastructureService,agentUpgradeService,agentRegistry,subprocessLogger,agentPluginsInitializer,packageRepositoryExtension,scmExtension,taskExtension,httpClient,httpClientBuilder,webSocketClientBuilder,webSocketClientHandler,webSocketSessionHandler,agentControllerFactory,agentController,agentControllerLooper,agentControllerPinger,agentControllerInstructionExecutor,loopThread,pingThread,instructionExecuteThread,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor];
root of factory hierarchy
2017-10-30 06:52:56,667 [Thread-6 ] INFO springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30 06:52:56,681 [Thread-6 ] INFO springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30 06:52:56,682 [Thread-6 ] INFO springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30
06:52:56,775 [Thread-1 ] WARN thoughtworks.go.util.SubprocessLogger:65
- Following processes were alive at shutdown:
PID: 23870
name: git owner: agent0 command: git clone
--branch=branch-name --depth=3
https://repo-url.git
/godata/agents/0/pipelines/pipeline-nameThere were no unusual logs on the Go server or uncharacteristic load during this time. To summarize the Go server's logs, we occasionally redeploy the Go configuration and there was a syntax error which caused it to roll back to the latest good config. There was also the expected logs of an estimated 100-200 Jobs which were scheduled at the time, all of which were successful other than this one.
Quick links of the code involved in the above stacktrace: