Hi Ash,
Forgive me but can you tell me where I can find the agent process logs? When the agent dies and a new one is spun up, I cant access the agent status page...its only for the new one that spun up.
In the server logs I see this: Doesnt give me any reason on why it hung and ultimately started a new pod. I want to stress that this is random though. I have 30 pipelines and all are using the same agent profile I have defined for each stage.
Some keep spawning in various stages and there is no predictive reason why. When I check the Cluster, memory is low and CPU is low.
I'm using C6i.2xlarge in a 5 node cluster. I was using R4.xlarge and didn't see this issue but can attribute any of these issues to the instance change which I did a few weeks ago.
2022-04-12 13:57:05,540 WARN [ThreadPoolTaskScheduler-1] ScheduleService:611 - Found hung job[id=JobInstance{stageId=184, name='PLATFORM_DEPLOY_JOB', state=Building, result=Unknown, agentUuid='3ff453d7-6a6b-413f-a845-728d96eec351', stateTransitions=[], scheduledDate
=Tue Apr 12 13:46:05 UTC 2022, timeProvider=com.thoughtworks.go.util.TimeProvider@715bdc39, ignored=false, identifier=JobIdentifier[Platform-Deploy, 2, 2, platform_deploy_qa, 5, PLATFORM_DEPLOY_JOB, 353], plan=null, runOnAllAgents=false, runMultipleInstance=false}],
rescheduling it