Agent showing Last Connection

35 views
Skip to first unread message

anil kumar

unread,
Feb 17, 2023, 9:23:30 AM2/17/23
to go-cd
Hello,

GoCD server console showing Lost Contact for some of the agents, even though agents are running without any issue on target nodes.
 
I checked GOCD server logs and Go-agent logs and did not find any clue.
If i restart the go-agent again on target server, it is showing status Idle in GoCD Serve console and after some time (some times after 30min, some times after 2 hours, no fixed time frames) it showing Last Contact.

Could you please help me, if there is any known issue or something needs to be checked ?

Thanks and regards
Anil  

Chad Wilson

unread,
Feb 18, 2023, 3:32:52 AM2/18/23
to go...@googlegroups.com
What type of agents are you using (docker images? linux? windows?)? On which GoCD version?

There should be something in the agent logs (especially go-agent.log), as the agent continually pings the server and should complain if it can't talk to the server, which would then report some kind of error message which might give you the reason for the connectivity failure. You should be able to see logs on the agent some time before the 'lost contact" which might explain what the reason is.

If you have some networking components (e.g a load balancer) or a TLS-terminating reverse proxy/WAF etc in front of your GoCD server you may also want to see what is logged there for a given agent in the 5m - 5m:30s  before the server thinks it lost contact with the agent.

-Chad

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/7a04d1ba-9ca1-4427-aa6d-5bebc9aefbe7n%40googlegroups.com.

anil kumar

unread,
Mar 23, 2023, 9:27:52 AM3/23/23
to go...@googlegroups.com
Hi,

Actual process shows running state.

image.png


But the API call is getting Bad!, instead of OK.
Below is the error we can see in go-agent logs.

XXXXXXXXXXX:/var/log/go-agent # curl http://localhost:8152/health/v1/isConnectedToServer
Bad!
XXXXXXXXXX:/var/log/go-agent #


image.png

Temporaray workaround, we implemented a cron job based on curl output to restart the go-agent.
Could you please let me know, if anything needs to be adjusted in the go-agent configuration.


Thanks







Chad Wilson

unread,
Mar 23, 2023, 10:13:25 AM3/23/23
to go...@googlegroups.com
That particular error message "connection pool shutdown" only appears temporarily during shutdown of the agent. I can see in the top two lines of logs that the agent is shutting down for some reason - maybe your automated restart(?) - but those exceptions are not the root cause of any issue and can't help us.

To get any real clues, you probably need to look higher in the logs to find the first occurrence of connection related exceptions - as I said in my original message "for a given agent in the 5m - 5m:30s  before the server thinks it lost contact with the agent."

-Chad

anil kumar

unread,
Mar 23, 2023, 10:43:00 AM3/23/23
to go...@googlegroups.com
Hi,

What type of agents are you using (docker images? linux? windows?)? On which GoCD version?
Linux  and  GoCD Version: 22.1.0 
Agent is version: 22.1.0-13913


Apart from the error which provided in screen shot, we did not see any other clue in the logs.

Below is the script, which will check the API call using curl, if it shows not OK agetn will be restarted.
#!/bin/bash

# Set the URL to check and the expected output
URL="http://localhost:8152/health/v1/isConnectedToServer"
EXPECTED_OUTPUT="OK"

# Make the curl request and store the output
OUTPUT=$(curl -s $URL)


# Check if the output contains the expected text
if [[ "$OUTPUT" != *"$EXPECTED_OUTPUT"* ]]; then
    echo "Error: $URL is not responding correctly"
    # Restart the service
    echo "restarting go-agent at at $(date)"
    systemctl restart go-agent.service
fi

Please let me know, if you need any specific logs from go-agent. 

Thanks



Chad Wilson

unread,
Mar 23, 2023, 11:11:39 AM3/23/23
to go...@googlegroups.com
Are there literally no other errors/exceptions/stack traces in the agent log from time of start-up to that error during shutdown?

The agent health status API you are using is based on the results of server pings. If enough consecutive server pings fail or don't happen, it logs big obvious errors like "Error occurred when agent tried to ping server" which you should easily be able to see/find. If somehow the pings are not even running because the agent has somehow lost its registration with the server, there also should be something logged.

I'd suggest you supply a full log, but it can be difficult to redact private information. Alternatively you could try and enable trace logging for an agent (`com.thoughtworks.go` or `com.thoughtworks.go.agent` logger name) to see exactly when it is sending pings, but you'd still need to put more effort into creating a timeline of what you see happening from the server's perspective and correlate to what you see in the agent logs.

-Chad

anil kumar

unread,
Apr 3, 2023, 5:04:53 AM4/3/23
to go...@googlegroups.com
Hi,

I can see the error you mentioned  in your email.  "Error occurred when agent tried to ping server".
4:34 PM  restart was taken place and again 6:35 PM restart go-agent was taken place.
Below are the logs, which logged between that time. 
+++++++++++++++++++++++++++++++++++++++
2023-03-28 16:34:09,390 INFO  [scheduler-1] AgentHTTPClientController:139 - About to get cookie from the server.
2023-03-28 16:34:09,418 INFO  [scheduler-1] AgentHTTPClientController:142 - Got cookie: fae0d356-b4e7-4d6e-afe4-d4e2c27192e1
2023-03-28 17:04:11,142 ERROR [scheduler-2] AgentController:92 - [Agent Loop] Error occurred during loop:
java.lang.NullPointerException: Cannot invoke "org.apache.http.Header.getValue()" because "md5Header" is null
        at com.thoughtworks.go.agent.service.AgentUpgradeService.validateMd5(AgentUpgradeService.java:111)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:84)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:72)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:84)
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-03-28 17:34:12,637 ERROR [scheduler-1] AgentController:92 - [Agent Loop] Error occurred during loop:
java.lang.NullPointerException: Cannot invoke "org.apache.http.Header.getValue()" because "md5Header" is null
        at com.thoughtworks.go.agent.service.AgentUpgradeService.validateMd5(AgentUpgradeService.java:111)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:84)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:72)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:84)
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-03-28 18:04:14,155 ERROR [scheduler-2] AgentController:92 - [Agent Loop] Error occurred during loop:
java.lang.NullPointerException: Cannot invoke "org.apache.http.Header.getValue()" because "md5Header" is null
        at com.thoughtworks.go.agent.service.AgentUpgradeService.validateMd5(AgentUpgradeService.java:111)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:84)
        at com.thoughtworks.go.agent.service.AgentUpgradeService.checkForUpgradeAndExtraProperties(AgentUpgradeService.java:72)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:84)
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-03-28 18:34:02,821 INFO  [Thread-3] ClassPathXmlApplicationContext:985 - Closing org.springframework.context.support.ClassPathXmlApplicationContext@2d7275fc: startup date [Tue Mar 28 16:34:06 CEST 2023]; root of context hierarchy
2023-03-28 18:34:02,823 INFO  [Thread-3] ThreadPoolTaskScheduler:203 - Shutting down ExecutorService 'scheduler'
2023-03-28 18:34:02,825 INFO  [scheduler-1] RetryExec:97 - I/O exception (java.net.SocketException) caught when processing request to {s}->https://XXXXXXXXXXXXXXX:443: Socket closed
2023-03-28 18:34:02,825 INFO  [scheduler-1] RetryExec:113 - Retrying request to {s}->https:/XXXXXXXXXXXXXXXXXt:443
2023-03-28 18:34:02,825 ERROR [scheduler-1] AgentHTTPClientController:114 - Error occurred when agent tried to ping server:
java.lang.IllegalStateException: Connection pool shut down
 at org.apache.http.util.Asserts.check(Asserts.java:34)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at com.thoughtworks.go.agent.common.ssl.GoAgentServerHttpClient.execute(GoAgentServerHttpClient.java:45)
        at com.thoughtworks.go.agent.RemotingClient.post(RemotingClient.java:110)
        at com.thoughtworks.go.agent.RemotingClient.ping(RemotingClient.java:74)
        at com.thoughtworks.go.agent.AgentHTTPClientController.ping(AgentHTTPClientController.java:109)
        at jdk.internal.reflect.GeneratedMethodAccessor14.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-03-28 18:34:02,827 INFO  [scheduler-3] RetryExec:97 - I/O exception (java.net.SocketException) caught when processing request to {s}->https://XXXXXXXXXXXcorpintra.net:443: Socket closed
2023-03-28 18:34:02,827 INFO  [scheduler-3] RetryExec:113 - Retrying request to {s}->https://fxxxxxxxxxxxxxxxxx:443
2023-03-28 18:34:02,827 ERROR [scheduler-3] AgentController:92 - [Agent Loop] Error occurred during loop:
java.lang.IllegalStateException: Connection pool shut down
        at org.apache.http.util.Asserts.check(Asserts.java:34)
        at org.apache.http.impl.conn.PoolingHttpClientConnectionManager.requestConnection(PoolingHttpClientConnectionManager.java:269)
        at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:176)
        at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:186)
        at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
        at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110)
        at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
        at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
        at com.thoughtworks.go.agent.common.ssl.GoAgentServerHttpClient.execute(GoAgentServerHttpClient.java:45)
        at com.thoughtworks.go.agent.RemotingClient.post(RemotingClient.java:110)
        at com.thoughtworks.go.agent.RemotingClient.getWork(RemotingClient.java:79)
        at com.thoughtworks.go.agent.AgentHTTPClientController.retrieveWork(AgentHTTPClientController.java:153)
        at com.thoughtworks.go.agent.AgentHTTPClientController.work(AgentHTTPClientController.java:130)
        at com.thoughtworks.go.agent.AgentController.loop(AgentController.java:86)
        at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
        at java.base/java.lang.reflect.Method.invoke(Unknown Source)
        at org.springframework.scheduling.support.ScheduledMethodRunnable.run(ScheduledMethodRunnable.java:65)
        at org.springframework.scheduling.support.DelegatingErrorHandlingRunnable.run(DelegatingErrorHandlingRunnable.java:54)
        at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
        at java.base/java.util.concurrent.FutureTask.runAndReset(Unknown Source)
        at java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
        at java.base/java.lang.Thread.run(Unknown Source)
2023-03-28 18:35:06,210 INFO  [main] ClassPathXmlApplicationContext:583 - Refreshing org.springframework.context.support.ClassPathXmlApplicationContext@2d7275fc: startup date [Tue Mar 28 18:35:06 CEST 2023]; root of context hierarchy
2023-03-28 18:35:06,239 INFO  [main] XmlBeanDefinitionReader:317 - Loading XML bean definitions from class path resource [applicationContext.xml]
2023-03-28 18:35:06,397 INFO  [main] XmlBeanDefinitionReader:317 - Loading XML bean definitions from class path resource [applicationContext-plugin-infra.xml]
++++++++++++++++++++++++++++++++++++++++


Thanks 





You received this message because you are subscribed to a topic in the Google Groups "go-cd" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/go-cd/wAQvIHI-DvU/unsubscribe.
To unsubscribe from this group and all its topics, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/CAA1RwH-V4%2BPuZoSqjnndhxkweE8n%2BmYi1ydNNTmiwNgqOuuEBg%40mail.gmail.com.

Chad Wilson

unread,
Apr 3, 2023, 11:54:02 PM4/3/23
to go...@googlegroups.com
Was your agent actually listed as "lost contact" on the server during this specific time period?

The reason I ask is that there's nothing major there that indicates permanent loss of connectivity. The NullPointerExceptions listed there are temporary connectivity issues which should fix themselves on the next agent work loop iteration (every 10s by default). A proper fix is in GoCD 23.1.0 (see this and this). As I said earlier, the "Connection pool shut down" is presumably because your script triggered an agent restart.

Depending on how often your automated "check health and restart" is running, you may be restarting the agent prematurely before it has time to automatically recover unless you wait for multiple consecutive failures before deciding to restart. You can likely remove some of this "NullPointerException noise" by upgrading to 23.1.0 where this "NullPointerException every 30 mins" issue is fixed, however if your agent is losing contact for more than a minute or so, the below log might not be the example of that specific problem - would need to find an example where you can confirm the server actually saw the agent as "lost contact" and had not recovered. I suspect you would see a lot more errors in such a real case.

Perhaps you should create a test agent that no builds get assigned to and does not get automatically restarted to more easily look at the root cause, without the mitigations you've put in place.

-Chad

anil kumar

unread,
Apr 4, 2023, 11:37:31 AM4/4/23
to go...@googlegroups.com
Hello Chad,

Thank you for getting back to me.

Was your agent actually listed as "lost contact" on the server during this specific time period?
There is no specific time frame interval.
If we don't do the agent restart, that curl output shows Bad! Until the next restart.
I implemented that workaround on some of the agents only.

Some agents are stable, where 
Some agents Lost Contact for a very long time, where I have not implemented a restart. They were still in lost Contact for many days.
Where agents show Lost contact, pipeline has not been assigned by agents and which are getting failed.

Where the agents are still showing Lost Contact, we have seen only the provided errors in last emails.
You are suggesting us to update the GoCD server and agent with 23.1.0 and it will fix the issue correct ?

Thanks and Regards
Anil


Chad Wilson

unread,
Apr 4, 2023, 12:19:42 PM4/4/23
to go...@googlegroups.com
OK, my suggestion is to supply a log from an agent that had lost contact for a long time and not recovered and confirm specific dates/times of what you observe and when so people don't need to guess e.g.
  • Agent A observed to be "Lost Contact" from time X
  • Agent A observed to still be "Lost Contact" at time Y
  • <logs from agent A up until time period Y>
Without this you are supplying subsets of random logs but not enough specific context to debug. Also, what you are supplying is not clean - it includes your restart workaround, for which I cannot see the logic as not controlled by GoCD. This makes it harder for anyone to see what the root cause might be because they have to guess at your manual interventions.

And no, I'm not saying that updating to 23.1.0 will fix your main issue. I said it will remove the NullPointerException "noise" and make it easier to diagnose, but those exceptions are unlikely to be the main problem you have. It is also unlikely to make anything worse though, so you can try.

I gave a number of suggestions earlier which you don't appear to have tried - not much else I can suggest.

-Chad

anil kumar

unread,
Apr 5, 2023, 3:35:01 AM4/5/23
to go...@googlegroups.com
Hello Chad,

Logs may contain some private data.
Where can I provide the logs only to you, any specific share where you and myself can view the data ?

May I know, Thougtworks is providing a premium subscription for their customers who use GOCD ?
If yes, where can I find the details? 


Thanks and Regards
Anil





Chad Wilson

unread,
Apr 5, 2023, 6:31:49 AM4/5/23
to go...@googlegroups.com
Yeah, I understand that redacting logs can be challenging, but I wouldn't want to receive any proprietary information anyway, and would always suggest redacting them first. If logs are just at standard INFO level, you could probably use grep --invert-match "CommandLine" to remove any of the console output logs from individual jobs and replace the gocd instance/pipeline/stage/job names or agent names with placeholders.

If you're still concerned that you might not have redacted sufficiently you could GPG encrypt with my public key and put it somewhere or send encrypted email.

I am not an official spokesperson here (this is just in my personal volunteer capacity as a GoCD maintainer who happens to still be a Thoughtworker), but Thoughtworks hasn't provided commercial subscription-based support for GoCD for the last 2+ years. You can read more about the history and current status here.

-Chad

Reply all
Reply to author
Forward
0 new messages