Jenkins plugin for HPC job systems

759 views
Skip to first unread message

James Hetherington

unread,
Jun 11, 2014, 10:05:58 AM6/11/14
to jenkin...@googlegroups.com, Dr James Hetherington
I am considering starting to write a Jenkins plugin to support the use of HPC job scheduler systems like SGE, LL and PBS.

Jobs currently queued in the batch system would show as pending jobs in the Jenkins UI.
Jobs would be qsubbed when they are requested to be built, and the job would transition to currently running when it runs on the scheduler

Configuration for the scheduler-based slave would specify the commands to be used for job submission, termination and status monitoring, with standard options for common queue systems,
as well as SSH connection details to the login node of an HPC system.

I have two questions:

1. Is anyone aware of a plugin which already does this?
2. Would anyone be interested in collaborating on such a project?

--

                              Dr James Hetherington,

Team Leader,                                   Honorary Lecturer
Research Software Development    Department of Computer Science
Research IT Services                       
Information Services Division           Faculty of Engineering

                            University College London

Tel: 02035495164 (Int. 65164)
Mobile: 07946868834
Skype: ucgajhe
Twitter: @uclrcsoftdev
Blog: blogs.ucl.ac.uk/research-software-development/
Site: development.rc.ucl.ac.uk

Jesse Glick

unread,
Jun 11, 2014, 12:11:54 PM6/11/14
to jenkin...@googlegroups.com
On Wed, Jun 11, 2014 at 10:05 AM, James Hetherington <jame...@gmail.com> wrote:
> Is anyone aware of a plugin which already does this?

No but I have heard of someone interested in SGE support. My idea was
to implement the durable-task-plugin API, at which point any client of
that plugin can use the system (and Jenkins does not need to be
continuously running while the scheduling system runs your batch job).
Jenkins Enterprise by CloudBees has one such client, which looks and
feels like a freestyle project; the upcoming Workflow plugin suite has
another caller, which is used routinely for running forked commands
like shell scripts.

Bruno P. Kinoshita

unread,
Jun 11, 2014, 8:29:56 PM6/11/14
to jenkin...@googlegroups.com
Hi Jesse,

I'm waiting for the workflow plugin but hadn't heard about the durable tasks plugin. 

The current implementation of the pbs plugin [1] was a POC to submit jobs via the qsub command to a PBS Torque server. In our tests a PBS job was triggered from a Freestyle project to the server via a SSH jump box in a university cluster. It is working and creating new jobs.

However, indeed if Jenkins goes offline the build stops running, even though the PBS job might still be running. I thought about re-using the monitor-external-job plug-in, but in some places qsub might be the only option to submit jobs.
 
I will experiment with the durable task plugin. Any advice on how to add items created in the cluster to the build queue in Jenkins? 

The current implementation creates a PBSSlaveComputer that represents a PBS Server. A Widget is created to retrieve the list of queues and its jobs from the server.

Thanks!
Bruno



From: Jesse Glick <jgl...@cloudbees.com>
To: jenkin...@googlegroups.com
Sent: Wednesday, June 11, 2014 1:11 PM
Subject: Re: Jenkins plugin for HPC job systems


--
You received this message because you are subscribed to the Google Groups "Jenkins Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-dev+unsub...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


James Hetherington

unread,
Jun 12, 2014, 8:38:18 AM6/12/14
to jenkin...@googlegroups.com
I had a look at Bruno's plugin -- it looks like a great starting point for what we need.

I think generalising Bruno's PBS Java API to support other similar systems with qsub/qstat type commands should work well: it seems that a PBSSlaveComputer could be generalised into a BatchSystemSlave.

Regarding the problem of Jenkins going offline, my feeling would be to mitigate this risk by: (1) Warning and doing a qdel on all Jenkins-submitted jobs when Jenkins shuts down cleanly. (2) Warning and doing a qdel on all lingering Jenkins jobs when Jenkins comes back up after an unclean shutdown.

I don't think I understand the durable-task solution, as to me that sounds like one would not be able to move existing freestyle jobs back and forth between conventional nodes and batch system nodes... It seems to me that the right solution would be a new kind of slave node, rather than a new kind of job, though I'm happy to be corrected...



--
You received this message because you are subscribed to a topic in the Google Groups "Jenkins Developers" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/jenkinsci-dev/nLOBkO8ttVM/unsubscribe.
To unsubscribe from this group and all its topics, send an email to jenkinsci-de...@googlegroups.com.

Jesse Glick

unread,
Jun 12, 2014, 9:59:38 AM6/12/14
to jenkin...@googlegroups.com
On Wed, Jun 11, 2014 at 8:29 PM, 'Bruno P. Kinoshita' via Jenkins
Developers <jenkin...@googlegroups.com> wrote:
> I thought about re-using the monitor-external-job plug-in

It makes for pretty weak integration since it does not let Jenkins
*initiate* the build. Cf.

http://developer-blog.cloudbees.com/2014/03/support-for-long-running-builds-in.html

> Any advice on how to add items created in the cluster to the build queue in Jenkins?

I guess you could create a dummy FlyweightTask (tied to the master
label) whose Executable does nothing but sleep, and schedule it when
the cluster item is created, but have it claim to be “blocked” (with a
new cause you define) until the cluster actually starts running it.

I am not sure if that is really intuitive, or if you would better
define a new Widget showing cluster status in a specialized manner.

James Hetherington wrote:
> I don't think I understand the durable-task solution, as to me that sounds like one would not be able to move existing freestyle jobs
> back and forth between conventional nodes and batch system nodes

Correct; freestyle projects are incapable of surviving Jenkins
restarts, and I do not think that can be changed. Thus in Jenkins
Enterprise we added a new project type superficially similar to
freestyle but using durable-task-plugin for the main build step and
which does survive restarts. Analogously, workflow builds (using a
very different UI) can survive restarts and can also use
durable-task-plugin to manage running your external build process
during this time.

Bruno P. Kinoshita

unread,
Jun 13, 2014, 1:36:13 AM6/13/14
to jenkin...@googlegroups.com
Hi Jesse,

I will probably read the FlyweightTask javadoc&src and will experiment with it.

But that special project type could work too. Any chance to see an Open Source version of it this year? :-) 

Thanks!
Bruno


Sent: Thursday, June 12, 2014 10:59 AM

Subject: Re: Jenkins plugin for HPC job systems

Bruno P. Kinoshita

unread,
Jun 13, 2014, 1:55:08 AM6/13/14
to jenkin...@googlegroups.com
Hello James

>Regarding the problem of Jenkins going offline, my feeling would be to mitigate this risk by: (1) Warning and doing a qdel on all Jenkins-submitted jobs when Jenkins shuts down cleanly. (2) Warning and doing a qdel on all lingering Jenkins jobs when Jenkins comes back up after an unclean shutdown.

That's the beauty of durable tasks and similar approaches: we won't have to qdel PBS jobs. Jenkins will reconnect automagically with the cluster and will continue monitoring the job till it is over.

>I don't think I understand the durable-task solution, as to me that sounds like one would not be able to move existing freestyle jobs back and forth between conventional nodes and batch system nodes... It seems to me that the right solution would be a new kind of slave node, rather than a new kind of job, though I'm happy to be corrected...

I'm still learning about the durable-task solution (and keeping track of what I understand about it [1]). But IIUC we could still use the slave node to represent the PBS cluster. The problem lies in the way we trigger PBS jobs and keep track of what's running. 

At the moment you can use a Freestyle project in Jenkins to qsub a job to the cluster. But if anything happens to Jenkins while the script is running in the cluster, you may lose that build and maybe have to connect to the cluster or look for it in the PBS queues in Jenkins.

Using the durable-task solution, or similar approaches, after Jenkins is restarted it would revive the job for us. Yesterday a researcher submitted a pull request [2] to make the Freestyle project to run as long as the PBS job is running in the cluster. 

I will try to write a prototype using some Jenkins+PBS durable solution and what he proposed too, and cut a new release by next weekend (though with the world cup that could take a few more days :-)

Bruno



From: James Hetherington <jame...@gmail.com>
To: jenkin...@googlegroups.com
Sent: Thursday, June 12, 2014 9:38 AM
To unsubscribe from this group and stop receiving emails from it, send an email to jenkinsci-de...@googlegroups.com.

John McGehee

unread,
Jan 2, 2016, 1:48:42 PM1/2/16
to Jenkins Developers, j.hethe...@ucl.ac.uk
I just got the Jenkins SGE Cloud plugin working in production at my company.

For LSF, you can use the Jenkins LSF Cloud plugin.  My SGE Cloud plugin is based on this.
Reply all
Reply to author
Forward
0 new messages