[JIRA] (JENKINS-60667) Jobs hanging indefinitely on ec2 slaves

12 views
Skip to first unread message

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 6, 2020, 6:50:02 PM1/6/20
to jenkinsc...@googlegroups.com
Handi Gao created an issue
 
Jenkins / Bug JENKINS-60667
Jobs hanging indefinitely on ec2 slaves
Issue Type: Bug Bug
Assignee: FABRIZIO MANFREDI
Attachments: ec2_slave_dump.txt, master_dump.txt
Components: clone-workspace-scm-plugin, core, ec2-plugin, htmlpublisher-plugin
Created: 2020-01-06 23:49
Priority: Major Major
Reporter: Handi Gao

We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

 

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

 

Some issues that may be relative: 

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

Add Comment Add Comment
 
This message was sent by Atlassian Jira (v7.13.6#713006-sha1:cc4451f)
Atlassian logo

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 6, 2020, 7:02:02 PM1/6/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
Change By: Handi Gao
We have jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

 

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

 

Some issues that may be relative: 

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119


 

Any help would be much appreciated

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 7, 2020, 2:21:05 PM1/7/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
Change By: Handi Gao
Component/s: maven-plugin

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 7, 2020, 2:26:02 PM1/7/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
Change By: Handi Gao
Environment: Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 17, 2020, 2:18:03 PM1/17/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.


 

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

 

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

 

Some issues that may be relative: 

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

 

Any help would be much appreciated


 

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

 

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
*06:08:46* Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.*06:08:46* [JENKINS] Recording test results*06:08:50* [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [https://jenkins.io/redirect/serialization-of-anonymous-classes/]*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] BUILD SUCCESS*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* [INFO] Total time: 35:32 min*06:08:50* [INFO] Finished at: 2020-01-16T11:08:50+00:00*06:08:50* [INFO] Final Memory: 30M/746M*06:08:50* [INFO] ------------------------------------------------------------------------*06:08:50* Waiting for Jenkins to finish collecting data*06:08:53* [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom*06:08:53* channel stopped*06:08:53* [htmlpublisher] Archiving HTML reports...*06:08:53* [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report*06:08:53* TestNG Reports Processing: START*06:08:53* Looking for TestNG results report in workspace using pattern: **/testng-results.xml*06:08:54* Saving reports...*06:08:54* Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'*06:08:54* 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE*06:08:54* TestNG Reports Processing: FINISH*06:08:54* Build step 'Publish TestNG Results' changed build result to FAILURE*06:08:58* [WS-CLEANUP] Deleting project workspace...*06:08:58* [WS-CLEANUP] Deferred wipeout is used...*06:08:58* [WS-CLEANUP] done

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 17, 2020, 2:18:04 PM1/17/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

 

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

 

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

 

Some issues that may be relative: 

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

 

Any help would be much appreciated

 

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

 

Logs from a hanging job:
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
* {code:java}
06:08:46 * Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results. *
06:08:46 * [JENKINS] Recording test results *
06:08:50 * [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: [ https://jenkins.io/redirect/serialization-of-anonymous-classes/ ]*
06:08:50 * [INFO] ------------------------------------------------------------------------ *
06:08:50 * [INFO] BUILD SUCCESS *
06:08:50 * [INFO] ------------------------------------------------------------------------ *
06:08:50 * [INFO] Total time: 35:32 min *
06:08:50 * [INFO] Finished at: 2020-01-16T11:08:50+00:00 *
06:08:50 * [INFO] Final Memory: 30M/746M *
06:08:50 * [INFO] ------------------------------------------------------------------------ *
06:08:50 * Waiting for Jenkins to finish collecting data *
06:08:53 * [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom *
06:08:53 * channel stopped *
06:08:53 * [htmlpublisher] Archiving HTML reports... *
06:08:53 * [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report *
06:08:53 * TestNG Reports Processing: START *
06:08:53 * Looking for TestNG results report in workspace using pattern: **/testng-results.xml *
06:08:54 * Saving reports... *
06:08:54 * Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml' *
06:08:54 * 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE *
06:08:54 * TestNG Reports Processing: FINISH *
06:08:54 * Build step 'Publish TestNG Results' changed build result to FAILURE *
06:08:58 * [WS-CLEANUP] Deleting project workspace... *
06:08:58 * [WS-CLEANUP] Deferred wipeout is used... *
06:08:58 * [WS-CLEANUP] done

{code}

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 17, 2020, 2:20:03 PM1/17/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
We have test jobs hanging on ec2 slaves indefinitely when the jobs are trying to clone ws from master using clone-workspace-scm plugin or publish html reports to master using htmlpublisher plugin. I don't think the issue is related to these two plugins though.

 

We also notice that this issue mostly happens after the ec2 slave is up for a few days, i.e. if we take down the current slave and create a new instance, the jobs will run successfully at the begining, but start hanging after a few days. So I suspect that something is clogged over time.

 

Our test jobs have the same pipeline: clone workspace -> run maven surefire for testing -> publish test results using html publisher

 

Some issues that may be relative: 

https://issues.jenkins-ci.org/browse/JENKINS-5977

https://issues.jenkins-ci.org/browse/JENKINS-57119

 

Any help would be much appreciated

 

Update:

We now shorten the Idle termination time so that we have new instances more often and find some pattern in this hanging behaviour.

It appears to happen everyday around 11:00-11:30 AM UTC. We originally have two test jobs scheduled around 11:00 AM UTC. But for testing purpose, we changed the schedule for some other test jobs to 11:00 AM as well. The conclusion we get is that ANY job runs at that time in ec2 cloud will time out (job timeout set to 2 hours) after maven surefire tests and hang on html publish. Once the jobs time out, any new test jobs scheduled after that will hang on cloning workspace. If we move the jobs scheduled around 11:00 AM UTC to master (also in AWS), none of the jobs will hang. Also, if we run these test jobs at a different time in ec2 cloud, they will finish successfully as well.

 

Logs from a hanging job: (we are in EST so 5 hours behind UTC)
{code:java}
06:48:29 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
08:06:19 Build timed out (after 120 minutes). Marking the build as failed.
08:06:19 Build was aborted
08:06:19 [htmlpublisher] Archiving HTML reports...
08:06:19 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report{code}
Logs from a working job:
{code:java}
06:08:46 Please refer to /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports for the individual test results.
06:08:46 [JENKINS] Recording test results
06:08:50 [WARNING] Attempt to (de-)serialize anonymous class org.jfrog.hudson.maven2.MavenDependenciesRecorder$1; see: https://jenkins.io/redirect/serialization-of-anonymous-classes/
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] BUILD SUCCESS
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 [INFO] Total time: 35:32 min
06:08:50 [INFO] Finished at: 2020-01-16T11:08:50+00:00
06:08:50 [INFO] Final Memory: 30M/746M
06:08:50 [INFO] ------------------------------------------------------------------------
06:08:50 Waiting for Jenkins to finish collecting data
06:08:53 [JENKINS] Archiving /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/pom.xml to com.pointclickcare.automation/pcc_quality_automation/4.1.0-SNAPSHOT/pcc_quality_automation-4.1.0-SNAPSHOT.pom
06:08:53 channel stopped
06:08:53 [htmlpublisher] Archiving HTML reports...
06:08:53 [htmlpublisher] Archiving at PROJECT level /data/jenkins/workspace/Care Management/SP Azure/Care Bears - CM_POINT_OF_CARE (SSD)/target/surefire-reports/html to /var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/htmlreports/HTML_20Report
06:08:53 TestNG Reports Processing: START
06:08:53 Looking for TestNG results report in workspace using pattern: **/testng-results.xml
06:08:54 Saving reports...
06:08:54 Processing '/var/jenkins_home/jobs/Care Management/jobs/SP Azure/jobs/Care Bears - CM_POINT_OF_CARE (SSD)/builds/2/testng/testng-results.xml'
06:08:54 100.000000% of tests were skipped, which exceeded threshold of 0%. Marking build as FAILURE
06:08:54 TestNG Reports Processing: FINISH
06:08:54 Build step 'Publish TestNG Results' changed build result to FAILURE
06:08:58 [WS-CLEANUP] Deleting project workspace...
06:08:58 [WS-CLEANUP] Deferred wipeout is used...
06:08:58 [WS-CLEANUP] done
{code}

h35gao@edu.uwaterloo.ca (JIRA)

unread,
Jan 17, 2020, 2:49:03 PM1/17/20
to jenkinsc...@googlegroups.com
Handi Gao updated an issue
Change By: Handi Gao
Environment:
Jenkins: 2.204.1 (base image: jenkinsci/blueocean:1.21.0)
ec2 plugin: 1.47
maven plugin: 3.4
clone-workspace-scm plugin: 0.6
htmlpublisher plugin: 1.21

surefire: 2.17
Reply all
Reply to author
Forward
0 new messages