automatic behavior after job fails on a node

665 views
Skip to first unread message

Jay Berkenbilt

unread,
Dec 14, 2015, 12:08:12 PM12/14/15
to jenkin...@googlegroups.com
Executive Summary: I'm looking to build/modify/extend a plugin to give
me something like a global post-build hook for the purpose of mitigating
a problem we have with sick nodes gobbling up jobs from the build queue.
I have some ideas but am looking for advice before I put a lot of work
into solving this.

This is my first post to jenkinsci-dev. I have never developed a Jenkins
plugin, but I've written lots of groovy code against the APIs that I run
as post build hooks, system groovy steps, or just from the script
console or CLI. I am aware of
https://wiki.jenkins-ci.org/display/JENKINS/Offline+Node+On+Failure+Plugin
and I am a heavy user of
https://wiki.jenkins-ci.org/display/JENKINS/Groovy+Postbuild+Plugin.

Our Jenkins installation scales up to several hundred slaves and has
tens of thousands of jobs. Once in a while, one of the slaves goes rogue
and gets itself in a state where jobs that hit that slave fail fast.
Sometimes it looks like a problem on Jenkins or the network environment
(disconnection, Jenkins deciding the slave is no longer configured,
failure to archive artifacts, loss of connection to the slave agent, or
other problems) and sometimes it's a problem on the slave itself (full
disk, git reference repository corruption, some problem etc.), but
either way, when the slave goes bad, it basically just vacuums jobs off
the build queue, fails them, and then goes onto attack its next victim.
It's great for keeping the build queue short but not so great for
developer productivity and morale.

We have some mitigation in the form of a post-build hook that looks at
the log, tries to detect whether the failure was one of a certain list
of well-known failures, and if so, marks the build as "not build",
changes the labels on the node to prevent it from picking up new jobs,
and then requeues the build with the same parameters. This approach
captures a decent fraction of the problems but definitely not all, and
sometimes the node gets into a state where jobs fail too fast or in a
way that prevents the post-build hook to get run.

Given the size of our Jenkins installation, iterative plugin development
is going to be tricky because of the need to restart, so I'm looking for
a solution where the behavior can be changed at runtime in as flexible
as possible a fashion.

Given all this, what I *think* I want is something that, upon failure of
any job, runs some groovy code that can operate in the same way the
groovy post-build publisher operates. Maybe something like the ability
to globally configure additional post-build hooks that run
conditionally. In my ideal world, I could, as an administrator, go into
global configuration and set up a post-build hook to run automatically
after every failed job. Then I could take my existing groovy code and
improve it to be able to recognize a wider range of failures and to work
automatically on all jobs regardless of their configuration including
whether they already have a post-build hook. It seems like something
could be built using either the groovy-post-build plugin or the
offlinefailure plugin or some combination of the two.

I am a little concerned about performance though. We've noticed that
expensive post-build hooks can have a hugely detrimental affect on the
performance of our Jenkins master since they all run in master's context
but can run in parallel at a scale equal to the number of slaves. (Or
something approximating that.)

Does this seem like a reasonable thing to try? Am I looking in the right
place? Has someone already done this and I've overlooked it? Any tips?
Thanks!

--Jay Berkenbilt

oliver gondža

unread,
Dec 14, 2015, 1:17:57 PM12/14/15
to jenkin...@googlegroups.com, Jay Berkenbilt
https://wiki.jenkins-ci.org/display/JENKINS/Adaptive+disconnector+plugin
is here to prevent some of that putting slave temporarily offline in case
some of the monitors is triggered. Most notably full workspace, temp dir.

If you implement your check as Jenkins monitor, it will integrate nicely.

--
oliver

Robert Sandell

unread,
Dec 15, 2015, 4:36:35 AM12/15/15
to jenkin...@googlegroups.com, Jay Berkenbilt
There is a todo in the Build Failure Analyser plugin to add steps to take in case a specific failure cause is found. But we never got around to implementing it.
If you are interested you're welcome to taking a crack at it.

https://wiki.jenkins-ci.org/display/JENKINS/Build+Failure+Analyzer

/B



--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/jenkinsci-dev/op.x9nbrwrssbfict%40localhost.localdomain.

For more options, visit https://groups.google.com/d/optout.



--
Robert Sandell
Software Engineer
CloudBees Inc.

Jay Berkenbilt

unread,
Dec 23, 2015, 12:43:40 PM12/23/15
to jenkin...@googlegroups.com
Ultimately I ended up getting this working by using https://wiki.jenkins-ci.org/display/JENKINS/Global+Post+Script+Plugin with a one line change, which will be in 1.1.0, to allow the hook to run on aborted jobs as well as others. These other plugins were very helpful. I was mostly through writing a new plugin that was triggering in the right place when, in the process of looking for other examples, bumped into the global post script plugin. Thanks for all the responses.
Reply all
Reply to author
Forward
0 new messages