Is it possible a hung job causes other running jobs to reschedule as well?

121 views

Skip to first unread message

dle...@gmail.com

unread,

Oct 12, 2021, 1:17:26 PM10/12/21

to go-cd

Hello -- Curious to know if anyone has seen this behavior before -

GoCD version v20.2.0

Kubernetes Elastic Agent Plugin v3.2.0-187

We noticed in the console.log file for an actively running job, midway through running there is a request to create a new pod for the job and the job starts over again:

[go] Received request to create a pod for job JobIdentifier{pipelineName='REDACTED', pipelineCounter=412, pipelineLabel='412', stageName='deployStage', stageCounter='1', jobName='deploy', jobId=182,399} in cluster https://REDACTED.gr7.us-east-1.eks.amazonaws.com at 2021-10-08 19:02:25 +00:00

Looking through the go-server log, I do see an entry suggesting it thinks the job is hung:

2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] ScheduleService:611 - Found hung job[id=JobInstance{stageId=169360, name='deploy', state=Building, result=Unknown, agentUuid='1cdec80 5-11a3-43aa-aa00-301004541e87', stateTransitions=[], scheduledDate=Fri Oct 08 18:49:13 UTC 2021, timeProvider=com.thoughtworks.go.util.TimeProvider@237ff2e, ignored=false, identifier=JobIdent ifier[REDACTED, 412, 412, deployStage, 1, deploy, 182391], plan=null, runOnAllAgents=false, runMultipleInstance=false}], rescheduling it 2021-10-08 19:02:22,074 WARN [ThreadPoolTaskScheduler-1] ScheduleService:660 - [Job Reschedule] Rescheduling and marking old job as ignored: JobInstance{stageId=169360, name='deploy', state= Building, result=Unknown, agentUuid='1cdec805-11a3-43aa-aa00-301004541e87', stateTransitions=[], scheduledDate=Fri Oct 08 18:49:13 UTC 2021, timeProvider=com.thoughtworks.go.util.TimeProvider @237ff2e, ignored=false, identifier=JobIdentifier[REDACTED 412, 412, deployStage, 1, deploy, 182391], plan=null, runOnAllAgents=false, runMultipleInstance=false} 2021-10-08 19:02:22,082 INFO [ThreadPoolTaskScheduler-1] ScheduleService:678 - [Job Reschedule] Scheduled new job: JobIdentifier[REDACTED, 412, 412, deployStage, 1, deploy, 182399]. Replacing old job: JobIdentifier[REDACTED, 412, 412, deployStage, 1, deploy, 182391]

There's nothing in the job's console to indicate the job was hung. Turning on timestamps shows that there was a log entry about 2 minutes before the entry for receiving the new pod creation request.

However, I did notice that at the same time GoCD detected that hung job, it detected another hung job, which in all likelihood was hung, or at least was slow to report any console output (we have to tune this job better as its a bit of a resource hog)

What *seems* to have happened is that gocd detects a legit hung job and reschedules it, but in the process *also* reports it detected another hung job that is likely NOT actually hung and end up rescheduling that one as well.

Is this a plausible scenario? Anyone experience anything like this before?

I was looking through release notes for subsequent versions beyond what we have current deployed for gocd and the k8 EA cluster plugin and I'm not seeing any bug fixes that might address this.

In the meantime, I'll see if I can mine through our go-server log history and find other cases of this scenario.

Thanks.

Doug

Ashwanth Kumar

unread,

Oct 13, 2021, 7:37:06 PM10/13/21

to go...@googlegroups.com

Scenario(s) that I've personally experienced in GoCD:

Job is actually hung or it doesn't write anything to stdout / stderr for a long time, which is the definition of job hung in GoCD -- GoCD still doesn't kill the job or reschedule it. If you've configured job hung timeout then it fails the job, but never reschedules it automatically. The reason is, GoCD as a system can't decide if you want to rerun a hung job on it's own. I've had some jobs hung for months and we generally notice it only after we run out of agents for our pipelines or manually running our CD pipelines for some deployment.
GoCD Rescheduling jobs onto a different agent. This happens automatically if during an active job the agent goes missing or doesn't return heartbeats to the server. Since GoCD now has detected that an agent which was running an active job has been killed or went non-responsive, it would automatically reschedule the job once again.

Usually these both are not related to each other.

I've personally not used the k8s plugin but the behaviour I've described above is specific to GoCD-Server so it shouldn't matter if the agents are running on k8s or not. As next steps, I would suggest you start by looking at the agent logs on which the job was running initially (if that's possible) or check the go-server.log for possible agent disconnection messages / errors which might explain the behaviour too.

Thanks,

--
You received this message because you are subscribed to the Google Groups "go-cd" group.
To unsubscribe from this group and stop receiving emails from it, send an email to go-cd+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/go-cd/6fa47495-9387-42af-821f-5343ad942102n%40googlegroups.com.