[slurm-dev] spank plugin roll out on a cluster

2 views
Skip to first unread message

Hendryk Bockelmann

unread,
Sep 13, 2016, 7:35:48 AM9/13/16
to slurm-dev
Hello,

I was wondering what is the best/correct way to roll out a new version
of a spank plugin (ie. *.so file) on the cluster - while keeping the
production running.
We used to submit as root one job per node with high priority that
copies the stuff from some global lustre directory to node local
/etc/slurm. Unfortunately, this causes the job to get stuck in running
state and finally reaching the time limit. In slurmd.log one can see a
message like

error: _step_connect: connect() failed dir /var/log/slurm/spool_slurmd/
node m10037 job 3849566 step -2 Connection refused
_handle_stray_script: Purging vestigial job script
/var/log/slurm/spool_slurmd//job3849566/slurm_script

Hence, this does not seem to be a good idea ...
Any suggestion to do better?

Thanks, Hendryk

Jeffrey Frey

unread,
Sep 13, 2016, 10:05:09 AM9/13/16
to slurm-dev
How about installing an init/systemd service that starts before the slurm daemons would, and your job just reboots the node to get that to happen? The service could remove itself after it starts.
::::::::::::::::::::::::::::::::::::::::::::::::::::::
Jeffrey T. Frey, Ph.D.
Systems Programmer V / HPC Management
Network & Systems Services / College of Engineering
University of Delaware, Newark DE 19716
Office: (302) 831-6034 Mobile: (302) 419-4976
::::::::::::::::::::::::::::::::::::::::::::::::::::::




signature.asc
Reply all
Reply to author
Forward
0 new messages