[JIRA] [gearman-plugin] (JENKINS-32963) FATAL: no longer a configured node for XXXX

4 views
Skip to first unread message

hashar@free.fr (JIRA)

unread,
Feb 15, 2016, 4:55:04 PM2/15/16
to jenkinsc...@googlegroups.com
Antoine Musso created an issue
 
Jenkins / Bug JENKINS-32963
FATAL: no longer a configured node for XXXX
Issue Type: Bug Bug
Assignee: Unassigned
Attachments: npmjob.xml
Components: gearman-plugin
Created: 15/Feb/16 9:54 PM
Environment: Jenkins 1.625.3
Gearman plugin 0.1.3 + https://review.openstack.org/#/c/271543/ "Update to Jenkins LTS 1.625.3 and fix function registration"
Priority: Major Major
Reporter: Antoine Musso

I have:

Today I have added a new job that runs a test suite. On build completion I have a few publishers:

  • Archive the artifacts ( logs/* ). Note the build produce no log but archiver is set to not fail
  • PostBuild, to trigger another project (named castor-save).

The archiver fails because the node went offline while it was executing:
{{
✓ retrieve en.wp main page via mobile-sections (364ms)
✓ retrieve lead section of en.wp main page via mobile-sections-lead (306ms)
FATAL: no longer a configured node for ci-jessie-wikimedia-33866
java.lang.IllegalStateException: no longer a configured node for ci-jessie-wikimedia-33866
at hudson.model.AbstractBuild$AbstractBuildExecution.getCurrentNode(AbstractBuild.java:456)
at hudson.model.AbstractBuild$AbstractBuildExecution.reportBrokenChannel(AbstractBuild.java:813)
at hudson.model.AbstractBuild$AbstractBuildExecution.perform(AbstractBuild.java:788)
at hudson.model.Build$BuildExecution.build(Build.java:205)
at hudson.model.Build$BuildExecution.doRun(Build.java:162)
at hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:537)
at hudson.model.Run.execute(Run.java:1741)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:98)
at hudson.model.Executor.run(Executor.java:408)
ERROR: Step ‘Archive the artifacts’ failed: no workspace for mobileapps-deploy-npm-node-4.3 #1
[PostBuildScript] - Execution post build scripts.
[PostBuildScript] Build is not success : do not execute script
Finished: FAILURE
}}

I am apparently not the only one impacted. From a recent IRC log at http://eavesdrop.openstack.org/irclogs/%23openstack-infra/%23openstack-infra.2016-01-18.log.html

> Thelo greghaynes: once in a while I get this error : FATAL: no longer a configured node for d-p-c-local_01-769 in my job's console

JENKINS-26665 "Complete lack of correct synchronization or concern for thread safety in mansion cloud plugin" has a similar stack trace.

Job page: https://integration.wikimedia.org/ci/job/mobileapps-deploy-npm-node-4.3/1/ (hopefully Jenkins will keep it). I have attached the XML configuration. It ran on node ci-jessie-wikimedia-33866.

The job failure occurred on Feb 15th 2016 at 17:39:02

In my case I had two different jobs running on the same node. Which goes something like:

{{
2016-02-15 17:31:37,287 INFO nodepool.NodeLauncher: Node id: 33866 is ready
2016-02-15 17:31:41,056 INFO nodepool.NodeLauncher: Node id: 33866 added to jenkins
2016-02-15 17:37:21,325 DEBUG nodepool.NodeUpdateListener: Received: onStarted {"name":"integration-config-tox-py27-jessie" ... "node_name":"ci-jessie-wikimedia-33866"
2016-02-15 17:38:01,808 DEBUG nodepool.NodeUpdateListener: Received: onFinalized {"name":"integration-config-tox-py27-jessie" ... "node_name":"ci-jessie-wikimedia-33866"
}}

And half a minute after, a different job is assigned to the same node:
{{
2016-02-15 17:38:33,867 DEBUG nodepool.NodeUpdateListener: Received: onStarted {"name":"mobileapps-deploy-npm-node-4.3" ... "node_name":"ci-jessie-wikimedia-33866"
2016-02-15 17:38:33,871 INFO nodepool.NodeUpdateListener: Setting node id: 33866 to USED
2016-02-15 17:39:01,875 DEBUG nodepool.NodePool: Deleting node id: 33866 which has been in used state for 0.00802109248108 hours
2016-02-15 17:39:02,942 DEBUG nodepool.NodeUpdateListener: Received: onCompleted

{"name":"mobileapps-deploy-npm-node-4.3" ... "node_name":"ci-jessie-wikimedia-33866" FAILURE a0ab290726d747608dcac63b1f1a33b5","ZUUL_VOTING":"1"}

,"node_name":"ci-jessie-wikimedia-33866" FAILURE
2016-02-15 17:39:06,763 INFO nodepool.NodePool: Deleted jenkins node id: 33866
}}

Add Comment Add Comment
 
This message was sent by Atlassian JIRA (v6.4.2#64017-sha1:e244265)
Atlassian logo

hashar@free.fr (JIRA)

unread,
Feb 15, 2016, 5:03:01 PM2/15/16
to jenkinsc...@googlegroups.com
Antoine Musso commented on Bug JENKINS-32963
 
Re: FATAL: no longer a configured node for XXXX

Most probably one job is not properly setting OFFLINE_NODE_WHEN_COMPLETE and when another job starts it ends up dieing because of Nodepool garbage collecting the Node.

Looking at the Zuul debug logs:

2016-02-15 17:37:21,293 DEBUG zuul.Gearman: Custom parameter function used for job integration-config-tox-py27-jessie

That does not set OFFLINE_NODE_WHEN_COMPLETE. So a build got scheduled on it and had a trouble when Nodepool deleted the Node.

I have filled this task merely for reference for other people.

hashar@free.fr (JIRA)

unread,
Feb 15, 2016, 5:04:03 PM2/15/16
to jenkinsc...@googlegroups.com
Antoine Musso resolved as Not A Defect
 

Make sure OFFLINE_NODE_WHEN_COMPLETE or Nodepool will get rid of the node while it might be building a second build.

Change By: Antoine Musso
Status: Open Resolved
Resolution: Not A Defect
Reply all
Reply to author
Forward
0 new messages