Intermittent communication failure between Agents and Controller

424 views
Skip to first unread message

Carl D

unread,
Oct 30, 2017, 1:57:04 PM10/30/17
to go-cd

We are currently using GoCD 17.8.0 and occasionally we see our agents getting stuck in Cancelling state, I managed to capture the logs from the latest instance and I believe applying some configuration to tinker with connection pool settings should mitigate or resolve the issue (this is based on https://stackoverflow.com/questions/10558791/apache-httpclient-interim-error-nohttpresponseexception and the source documentation from http client).

As an aside, we've had another similar issue overwhelming our repository server which causes Jobs to fail intermittently (This has spawned a separate discussion internally). We can separate these problems into two broad categories: one is recoverable and rerunning the Job is sufficient, the other (this case) the agent does not recover and one must restart the agent service.

My question is, has anyone else encountered this issue where agents become stuck in Cancelling state? How did you overcome it?

Could someone enlighten me about where in the GoCD configuration I can look to follow up on connection pool settings or injecting a retry handler?

Agent logs from the web interface:
11:51:50.107 [go] Job Started: 2017-10-25 11:51:50 PDT
11:51:50.107 
11:51:50.107 [go] Start to prepare pipeline-name/8/stage-name/1/job-name on machine-name [/godata/agents/0]
11:51:50.111 [go] Start to update materials.
11:51:50.111 
11:51:50.111 [go] Start updating files at revision commit-sha from https://repo-url.git
11:51:50.118 Initialized empty Git repository in /godata/agents/0/pipelines/pipeline-name/.git/
11:52:43.904 STDERR: error: RPC failed; result=22, HTTP code = 500


Agent logs on the machine:

2017-10-25 12:20:23,675 [pingThread] ERROR thoughtworks.go.agent.AgentHTTPClientController:84 - Error occurred when agent tried to ping server:
org.springframework.remoting.RemoteAccessException: Could not access HTTP invoker remote service at [https://
go-server-name:8154/go/remoting/remoteBuildRepository]; nested exception is org.apache.http.NoHttpResponseException: go-server-name:8154 failed to respond
    at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.convertHttpInvokerAccessException(HttpInvokerClientInterceptor.java:212)
    at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:145)
    at org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:172)
    at org.springframework.aop.framework.JdkDynamicAopProxy.invoke(JdkDynamicAopProxy.java:202)
    at com.sun.proxy.$Proxy6.ping(Unknown Source)
    at com.thoughtworks.go.agent.AgentHTTPClientController.ping(AgentHTTPClientController.java:80)
    at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.springframework.util.MethodInvoker.invoke(MethodInvoker.java:273)
    at org.springframework.scheduling.support.MethodInvokingRunnable.run(MethodInvokingRunnable.java:65)
    at org.springframework.scheduling.timer.DelegatingTimerTask.run(DelegatingTimerTask.java:70)
    at java.util.TimerThread.mainLoop(Timer.java:555)
    at java.util.TimerThread.run(Timer.java:505)
Caused by: org.apache.http.NoHttpResponseException:
go-server-name:8154 failed to respond
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:143)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:57)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:167)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:271)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:184)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:88)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:184)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
    at com.thoughtworks.go.agent.common.ssl.GoAgentServerHttpClient.execute(GoAgentServerHttpClient.java:53)
    at com.thoughtworks.go.agent.GoHttpClientHttpInvokerRequestExecutor.doExecuteRequest(GoHttpClientHttpInvokerRequestExecutor.java:63)
    at org.springframework.remoting.httpinvoker.AbstractHttpInvokerRequestExecutor.executeRequest(AbstractHttpInvokerRequestExecutor.java:136)
    at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.executeRequest(HttpInvokerClientInterceptor.java:192)
    at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.executeRequest(HttpInvokerClientInterceptor.java:174)
    at org.springframework.remoting.httpinvoker.HttpInvokerClientInterceptor.invoke(HttpInvokerClientInterceptor.java:142)
    ... 12 more

... This runs in a loop, until I cancelled the service ...

2017-10-30 06:52:56,565 [Thread-6 ] INFO  springframework.context.support.ClassPathXmlApplicationContext:1032 - Closing org.springframework.context.support.ClassPathXmlApplicationContext@3d646c37: startup date [Tue Oct 24 20:13:27 PDT 2017]; root of context hierarchy
2017-10-30 06:52:56,653 [Thread-6 ] INFO  beans.factory.support.DefaultListableBeanFactory:434 - Destroying singletons in org.springframework.beans.factory.support.DefaultListableBeanFactory@497470ed: defining beans [org.springframework.context.annotation.internalConfigurationAnnotationProcessor,org.springframework.context.annotation.internalAutowiredAnnotationProcessor,org.springframework.context.annotation.internalRequiredAnnotationProcessor,org.springframework.context.annotation.internalCommonAnnotationProcessor,defaultPluginJarChangeListener,defaultPluginJarLocationMonitor,defaultPluginRegistry,goPluginDescriptorBuilder,pluginWriter,pluginValidator,goFileSystem,defaultPluginManager,felixGoPluginOSGiFramework,defaultGoPluginOSGiManifestCreator,pluginRequestProcessorRegistry,pluginsZip,zipUtil,systemEnvironment,propertyConfigurer,buildLoopServer,urlService,httpService,artifactsManipulator,sslInfrastructureService,agentUpgradeService,agentRegistry,subprocessLogger,agentPluginsInitializer,packageRepositoryExtension,scmExtension,taskExtension,httpClient,httpClientBuilder,webSocketClientBuilder,webSocketClientHandler,webSocketSessionHandler,agentControllerFactory,agentController,agentControllerLooper,agentControllerPinger,agentControllerInstructionExecutor,loopThread,pingThread,instructionExecuteThread,org.springframework.context.annotation.ConfigurationClassPostProcessor.importAwareProcessor]; root of factory hierarchy
2017-10-30 06:52:56,667 [Thread-6 ] INFO  springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30 06:52:56,681 [Thread-6 ] INFO  springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30 06:52:56,682 [Thread-6 ] INFO  springframework.scheduling.timer.TimerFactoryBean:165 - Cancelling Timer
2017-10-30 06:52:56,775 [Thread-1 ] WARN  thoughtworks.go.util.SubprocessLogger:65 - Following processes were alive at shutdown:
    PID: 23870    name: git    owner: agent0    command: git clone --branch=branch-name --depth=3 https://repo-url.git /godata/agents/0/pipelines/pipeline-name


There were no unusual logs on the Go server or uncharacteristic load during this time. To summarize the Go server's logs, we occasionally redeploy the Go configuration and there was a syntax error which caused it to roll back to the latest good config. There was also the expected logs of an estimated 100-200 Jobs which were scheduled at the time, all of which were successful other than this one.

Quick links of the code involved in the above stacktrace:

Aravind SV

unread,
Nov 6, 2017, 11:34:26 AM11/6/17
to go...@googlegroups.com
Hello Carl,

GoAgentServerHttpClient and GoAgentServerHttpClientBuilder might interest you. I find it strange that the server is not busy, but says NoHttpResponseException. I would have assumed it's somehow so busy that it cannot service this request.

Of course, since you say it never recovers, you might be right that it's a problem with the connection somehow. If you find out anything, by adding a retry, let us know.

Cheers,
Aravind

Carl D

unread,
Nov 14, 2017, 6:40:06 PM11/14/17
to go-cd
Hi!

So I've resolved this issue, and I'll update future readers about what happened and what they can look into.

When I posted this originally I didn't realize the number of factors working against me that ultimately resulted in our GoCD server being overwhelmed.

Here is what I experienced, with the fixes I applied:
  • Jobs intermittently failing and agents stuck in cancelling state, this error exclusively appeared during material updates in the go agents.
    • This was correlated to a set of fan out pipelines whose jobs effectively DDOSd our Git Stash server by cloning/updating a problematic repo (problematic in its contents, not GoCD related). This was fixed by applying 2 immediate changes:
      1. Using shallowClone.
      2. Reducing the raw number of clones by putting that in an initial stage and passing a trimmed down set of files to subsequent stages (most notably omitting the .git directory) this also requires following stages to not check out materials and rely on fetch artifacts instead: https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#stage and https://docs.gocd.org/17.8.0/configuration/configuration_reference.html#git
      3. Finally, file a future to use git large file store or change the storage mechanism for the problematic files.


  • Jobs intermittently failing and agents recover gracefully. Ultimately the NoHttpResponseException was sort of a canary in this case and its frequency increased steadily finally resulting in an OOM exception on the GoCD server. It appears that this problem had several contributing factors:
    • Increased Go server statistic monitoring. The most notable offender was a monitor for job runtime averages.
      • This was addressed by decreasing both the frequency and shrinking the time window of this statistic.
    • Increased total Jobs executed/required due to more concurrent product development. This resulted in more fanning out in our dependency graph.
      • The solution is the same as the following point, however a word of warning would be that one should closely monitor the GoCD server CPU and memory usage!
    • Increased Go agent power, while giving us 30~% wins in end to end time there was enough additional load from agents polling to finally push our GoCD server over the edge.
      • The brute force solution was to give our Go server more resources, the presence of the following log messages forced my hand in requesting this (I had not seen these log messages initially).
2017-11-14 09:22:45,306  WARN [Scheduler-1678207403] LowResourceMonitor:292 - Low Resources: Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86}, Low on threads: qtp1793329556{STARTED,20<=300<=300,i=0,q=86}
...
2017-11-14 10:05:51,556  WARN [qtp1793329556-11768] QueuedThreadPool:617 - Unexpected thread death: org.eclipse.jetty.util.thread.QueuedThreadPool$3@24550918 in qtp1793329556{STARTED,20<=300<=300,i=0,q=469}


Cheers,
Reply all
Reply to author
Forward
0 new messages