Hello -- Curious to know if anyone has seen this behavior before -
GoCD version v20.2.0
Kubernetes Elastic Agent Plugin v3.2.0-187
We noticed in the console.log file for an actively running job, midway through running there is a request to create a new pod for the job and the job starts over again:
[go] Received request to create a pod for job JobIdentifier{pipelineName='REDACTED', pipelineCounter=412, pipelineLabel='412', stageName='deployStage', stageCounter='1', jobName='deploy', jobId=182,399} in cluster https://REDACTED.gr7.us-east-1.eks.amazonaws.com at 2021-10-08 19:02:25 +00:00
Looking through the go-server log, I do see an entry suggesting it thinks the job is hung:
2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] ScheduleService:611 - Found hung job[id=JobInstance{stageId=169360, name='deploy', state=Building, result=Unknown, agentUuid='1cdec80
5-11a3-43aa-aa00-301004541e87', stateTransitions=[], scheduledDate=Fri Oct 08 18:49:13 UTC 2021, timeProvider=com.thoughtworks.go.util.TimeProvider@237ff2e, ignored=false, identifier=JobIdent
ifier[REDACTED, 412, 412, deployStage, 1, deploy, 182391], plan=null, runOnAllAgents=false, runMultipleInstance=false}], rescheduling it
2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] ScheduleService:660 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=169360, name='deploy', state=
Building, result=Unknown, agentUuid='1cdec805-11a3-43aa-aa00-301004541e87', stateTransitions=[], scheduledDate=Fri Oct 08 18:49:13 UTC 2021, timeProvider=com.thoughtworks.go.util.TimeProvider
@237ff2e, ignored=false, identifier=JobIdentifier[REDACTED 412, 412, deployStage, 1, deploy, 182391], plan=null, runOnAllAgents=false, runMultipleInstance=false}
2021-10-08 19:02:22,082 INFO [ThreadPoolTaskScheduler-1] ScheduleService:678 - [Job Reschedule] Scheduled new job: JobIdentifier[REDACTED, 412, 412, deployStage, 1, deploy, 182399]. Replacing old job: JobIdentifier[REDACTED, 412, 412, deployStage, 1, deploy, 182391]
There's nothing in the job's console to indicate the job was hung. Turning on timestamps shows that there was a log entry about 2 minutes before the entry for receiving the new pod creation request.
However, I did notice that at the same time GoCD detected that hung job, it detected another hung job, which in all likelihood was hung, or at least was slow to report any console output (we have to tune this job better as its a bit of a resource hog)
What *seems* to have happened is that gocd detects a legit hung job and reschedules it, but in the process *also* reports it detected another hung job that is likely NOT actually hung and end up rescheduling that one as well.
Is this a plausible scenario? Anyone experience anything like this before?
I was looking through release notes for subsequent versions beyond what we have current deployed for gocd and the k8 EA cluster plugin and I'm not seeing any bug fixes that might address this.
In the meantime, I'll see if I can mine through our go-server log history and find other cases of this scenario.
Thanks.
Doug