[JIRA] (JENKINS-53926) Builds hang ocassionally on resume after Jenkins restart

4 views
Skip to first unread message

mszpak@wp.pl (JIRA)

unread,
Oct 5, 2018, 12:47:02 PM10/5/18
to jenkinsc...@googlegroups.com
Marcin Zajączkowski created an issue
 
Jenkins / Bug JENKINS-53926
Builds hang ocassionally on resume after Jenkins restart
Issue Type: Bug Bug
Assignee: Carlos Sanchez
Components: kubernetes-plugin
Created: 2018-10-05 16:46
Environment: Jenkins 2.138.1, Kubernetes plugin 1.12.6, Kubernetes 1.11
Priority: Minor Minor
Reporter: Marcin Zajączkowski

Occasionally, we observe an issue with resuming jobs after a Jenkins Master restart.

Our configuration. Jenkins 2.138.1 (some older versions had the same problem) with the init.groovy.d configuration + persistent jobs running on top of the Kubernetes cluster. Slaves/executors are setup with the Kubernetes plugin 1.12.6 on Kubernetes 1.11.

After the Jenkins master restart a simple job (a git clone + a shell script) sometimes resume properly:

15:19:18 Commit message: "Quick commit"
15:19:18  > git rev-list --no-walk aaaaaa9bd4a093e0364df0f52f5447c15f3785f0 # timeout=10
[Pipeline] }
[Pipeline] // dir
[Pipeline] }
[Pipeline] // stage
[Pipeline] stage
[Pipeline] { (Execute my fancy shell script)
[Pipeline] sh
Resuming build at Wed Oct 01 13:19:59 GMT 2018 after Jenkins restart
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
15:19:19 [xxxxx-xxxxx] Running shell script
15:19:22 + vault login '-method=aws' 'role=jenkins'
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #40: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #40
Ready to run at Wed Oct 01 13:20:13 GMT 2018
15:20:13 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
15:20:13 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx): 
15:20:14 (script execution performed properly)
(...)
Finished: SUCCESS

However, usually it hangs at:

15:31:50 Running in /home/jenkins/workspace/xxxxx-xxxxx/playbooks
[Pipeline] {
[Pipeline] git
[Pipeline] }
[Pipeline] // dir
Resuming build at Wed Oct 01 13:32:39 GMT 2018 after Jenkins restart
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Waiting to resume part of xxxxx-xxxxx #42: ‘jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx’ is offline
Ready to run at Wed Oct 01 13:33:07 GMT 2018
15:33:07 Agent jenkins-slave-xxxxx-xxxxx-xxxxx-xxxxx is provisioned from template Kubernetes Pod Template
15:33:07 Agent specification [Kubernetes Pod Template] (jenkins-slave-xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxx): 
15:33:07 

and it needs to be aborted.

From the Jenkins logs of the failed resume I see:

Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Agent has already been launched, activating: {}
Oct 01, 2018 1:32:13 PM INFO org.csanchez.jenkins.plugins.kubernetes.KubernetesLauncher launch
Agent has already been launched, activating: {}
Oct 01, 2018 1:33:03 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
 Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxx-xxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskAccepted
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerStepExecution
onResume
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
Building connection to Kubernetes kubernetes URL  namespace default
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
Autoconfiguring Kubernetes client
Oct 01, 2018 1:33:07 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesFactoryAdapter
Creating Kubernetes client: KubernetesFactoryAdapter [serviceAddress=, namespace=default, caCertData=null, credentials=null, skipTlsVerify=true, connectTimeout=0, readTimeout=0]
Oct 01, 2018 1:33:08 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesCloud
Connected to Kubernetes kubernetes URL 

the next lines in the successful resume (there are not present on failure) are:

Oct 01, 2018 1:20:15 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launch proc with environment: [AGENT_WORKDIR=/home/jenkins/agent, ...]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing shell script inside container [myContainer] of pod [jenkins-slave-ansible-update-adfs-mgmt-lfm45-02g27]
Oct 01, 2018 1:20:16 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onOpen : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:17 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Launching with env vars: [...]
Oct 01, 2018 1:20:19 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
Executing command: "... my shell command ..." 
Oct 01, 2018 1:20:19 PM INFO org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator$1 doLaunch
Created process inside pod: [jenkins-slave-ansible-update-xxxxx-xxxxx], container: [myContainer] with pid:[-1]
Oct 01, 2018 1:20:44 PM FINEST org.csanchez.jenkins.plugins.kubernetes.pipeline.ContainerExecDecorator
onClose : java.util.concurrent.CountDownLatch@7b82ed2d[Count = 0]
Oct 01, 2018 1:20:45 PM FINE org.csanchez.jenkins.plugins.kubernetes.KubernetesComputer
 Computer KubernetesComputer name: jenkins-slave-ansible-update-xxxxx-xxxxx slave: KubernetesSlave name: jenkins-slave-ansible-update-xxxxx-xxxxx taskCompleted
...

Looking at the plugin source code didn't help. We don't know the job hangs and Jenkins is not able to carry on with the executor.

Do you have any suggestion how it could be fixed or at least investigated deeply?

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.11.2#711002-sha1:fdc329d)

svenstaro@gmail.com (JIRA)

unread,
Jan 29, 2019, 3:30:02 PM1/29/19
to jenkinsc...@googlegroups.com
Sven-Hendrik Haase commented on Bug JENKINS-53926
 
Re: Builds hang ocassionally on resume after Jenkins restart

Have you by any chance made any progress on this?

jglick@cloudbees.com (JIRA)

unread,
Jul 16, 2019, 3:44:07 PM7/16/19
to jenkinsc...@googlegroups.com
Jesse Glick assigned an issue to Unassigned
 
Change By: Jesse Glick
Assignee: Carlos Sanchez

tamerlaha@gmail.com (JIRA)

unread,
Mar 16, 2020, 10:32:03 PM3/16/20
to jenkinsc...@googlegroups.com
ipleten commented on Bug JENKINS-53926
 
Re: Builds hang ocassionally on resume after Jenkins restart

We are affected with that as well.

This message was sent by Atlassian Jira (v7.13.12#713012-sha1:6e07c38)
Atlassian logo
Reply all
Reply to author
Forward
0 new messages