Send slurm-users mailing list submissions to
To subscribe or unsubscribe via the World Wide Web, visit
or, via email, send a message with subject or body 'help' to
You can reach the person managing the list at
When replying, please edit your Subject line so it is more specific
than "Re: Contents of slurm-users digest..."
Today's Topics:
1. Re: Ubuntu Cluster with Slurm (Renfro, Michael)
2. Re: sacct returns nothing after reboot (Roger Mason)
3. Re: Reset TMPDIR for All Jobs (Ellestad, Erik)
4. Re: additional jobs killed by scancel. (Alastair Neil)
----------------------------------------------------------------------
Message: 1
Date: Wed, 13 May 2020 14:05:21 +0000
Subject: Re: [slurm-users] Ubuntu Cluster with Slurm
Content-Type: text/plain; charset="utf-8"
I?d compare the RealMemory part of ?scontrol show node abhi-HP-EliteBook-840-G2? to the RealMemory part of your slurm.conf:
> Nodes which register to the system with less than the configured resources (e.g. too little memory), will be placed in the "DOWN" state to avoid scheduling jobs on them.
As far as GPUs go, it looks like you have Intel graphics on the Lenovo and a Radeon R7 on the HP? If so, then nothing is CUDA-compatible, but you might be able to make something work with OpenCL. No idea if that would give performance improvements over the CPUs, though.
--
Mike Renfro, PhD / HPC Systems Administrator, Information Technology Services
>
> Dear All,
>
> Preamble
> ----------
> I want to form simple cluster with three laptops:
> abhi-Latitude-E6430 //This serves as the controller
> abhi-Lenovo-ideapad-330-15IKB //Compute Node
> abhi-HP-EliteBook-840-G2 //Compute Node
>
>
> Aim
> -------------
> I want to make use of CPU+GPU+RAM on all the machines when I execute JAVA programs or Python programs.
>
>
> Implementation
> ------------------------
> Now let us look at the slurm.conf
>
> On Machine abhi-Latitude-E6430
>
> ClusterName=linux
> ControlMachine=abhi-Latitude-E6430
> SlurmUser=abhi
> SlurmctldPort=6817
> SlurmdPort=6818
> AuthType=auth/munge
> SwitchType=switch/none
> StateSaveLocation=/tmp
> MpiDefault=none
> ProctrackType=proctrack/pgid
> NodeName=abhi-Lenovo-ideapad-330-15IKB RealMemory=12000 CPUs=2
> NodeName=abhi-HP-EliteBook-840-G2 RealMemory=14000 CPUs=2
> PartitionName=debug Nodes=ALL Default=YES MaxTime=INFINITE State=UP
>
> Same slurm.conf is copied to all the Machines.
>
>
> Observations
> --------------------------------------
> Now when I do
> ? slurmd.service - Slurm node daemon
> Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
> Active: active (running) since Wed 2020-05-13 18:50:01 IST; 1min 49s ago
> Docs: man:slurmd(8)
> Process: 98235 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
> Main PID: 98253 (slurmd)
> Tasks: 2
> Memory: 2.2M
> CGroup: /system.slice/slurmd.service
> ??98253 /usr/sbin/slurmd
>
> ? slurmd.service - Slurm node daemon
> Loaded: loaded (/lib/systemd/system/slurmd.service; enabled; vendor preset: enabled)
> Active: active (running) since Wed 2020-05-13 18:50:20 IST; 8s ago
> Docs: man:slurmd(8)
> Process: 71709 ExecStart=/usr/sbin/slurmd $SLURMD_OPTIONS (code=exited, status=0/SUCCESS)
> Main PID: 71734 (slurmd)
> Tasks: 2
> Memory: 2.0M
> CGroup: /system.slice/slurmd.service
> ??71734 /usr/sbin/slurmd
>
> ? slurmctld.service - Slurm controller daemon
> Loaded: loaded (/lib/systemd/system/slurmctld.service; enabled; vendor preset: enabled)
> Active: active (running) since Wed 2020-05-13 18:48:58 IST; 4min 56s ago
> Docs: man:slurmctld(8)
> Process: 97114 ExecStart=/usr/sbin/slurmctld $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
> Main PID: 97116 (slurmctld)
> Tasks: 7
> Memory: 2.6M
> CGroup: /system.slice/slurmctld.service
> ??97116 /usr/sbin/slurmctld
>
>
> PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
> debug* up infinite 1 down* abhi-Lenovo-ideapad-330-15IKB
>
>
> Advice needed
> ------------------------
> Please let me know Why I am seeing only one node.
> Further how the total memory is calculated? Can Slurm make use of GPU processing power as well
> Please let me know if I have missed something in configuration or explanation.
>
> Thank you all
>
> Best Regards,
>
>
------------------------------
Message: 2
Date: Wed, 13 May 2020 12:20:11 -0230
Subject: Re: [slurm-users] sacct returns nothing after reboot
Content-Type: text/plain
Hello,
> the default time window starts at 00:00:00 of the current day:
> -S, --starttime
> Select jobs in any state after the specified time. Default
> is 00:00:00 of the current day, unless the '-s' or '-j'
> options are used. If the '-s' option is used, then the
> default is 'now'. If states are given with the '-s' option
> then only jobs in this state at this time will be returned.
> If the '-j' option is used, then the default time is Unix
> Epoch 0. See the DEFAULT TIME WINDOW for more details.
Thank you! Obviously I did not read far enough down the man page.
Roger
------------------------------
Message: 3
Date: Wed, 13 May 2020 15:18:09 +0000
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs
Message-ID:
Content-Type: text/plain; charset="utf-8"
Woo!
Thanks Marcus, that works.
Though, ahem, SLURM/SchedMD, if you're listening, would it hurt to cover this in the documentation regarding prolog/epilog, and maybe give an example?
Just a thought,
Erik
--
Erik Ellestad
Wynton Cluster SysAdmin
UCSF
-----Original Message-----
Sent: Tuesday, May 12, 2020 10:08 PM
Subject: Re: [slurm-users] Reset TMPDIR for All Jobs
Hi Erik,
the output of task-prolog is sourced/evaluated (not really sure, how) in
the job environment.
Thus you don't have to export a variable in task-prolog, but echo the
export, e.g.
echo export TMPDIR=/scratch/$SLURM_JOB_ID
The variable will then be set in job environment.
Best
Marcus
Am 12.05.2020 um 17:40 schrieb Ellestad, Erik:
> I was wanted to set TMPDIR from /tmp to a per job directory I create in
> local /scratch/$SLURM_JOB_ID (for example)
>
> This bug suggests I should be able to do this in a task-prolog.
>
>
> However adding the following to task-prolog doesn?t seem to affect the
> variables the job script is running with.
>
> unset TMPDIR
>
> export TMPDIR=/scratch/$SLURM_JOB_ID
>
> It does work if it is done in the job script, rather than the task-prolog.
>
> Am I missing something?
>
> Erik
>
> --
>
> Erik Ellestad
>
> Wynton Cluster SysAdmin
>
> UCSF
>
------------------------------
Message: 4
Date: Wed, 13 May 2020 17:08:55 -0400
Subject: Re: [slurm-users] additional jobs killed by scancel.
Message-ID:
Content-Type: text/plain; charset="utf-8"
invalid field requested: "reason"
> What do you get from
>
> sacct -o jobid,elapsed,reason,exit -j 533900,533902
>
> wrote:
> >
> > The log is continuous and has all the messages logged by slurmd on the
> node for all the jobs mentioned, below are the entries from the slurmctld
> log:
> >
> >> [2020-05-10T00:26:03.097] _slurm_rpc_kill_job: REQUEST_KILL_JOB
> JobId=533898 uid 1224431221
> >>
> Slurm Job_id=533898 Name=r18-relu-ent Ended, Run time 04:36:17, CANCELLED,
> ExitCode 0
> >> [2020-05-10T00:26:03.098] job_signal: 9 of running JobId=533898
> successful 0x8004
> >> [2020-05-10T00:26:05.204] _job_complete: JobId=533902 WTERMSIG 9
> Slurm Job_id=533902 Name=r18-soft-ent Failed, Run time 04:30:39, FAILED
> >> [2020-05-10T00:26:05.205] _job_complete: JobId=533902 done
> >> [2020-05-10T00:26:05.210] _job_complete: JobId=533900 WTERMSIG 9
> Slurm Job_id=533900 Name=r18-soft Failed, Run time 04:32:51, FAILED
> >> [2020-05-10T00:26:05.215] _job_complete: JobId=533900 done
> >
> >
> > it is curious, that all the jobs were running on the same processor,
> perhaps this is a cgroup related failure?
> >
> >>
> >> I see one job cancelled and two jobs failed.
> >> Your slurmd log is incomplete -- it doesn't show the two failed jobs
> >> exiting/failing, so the real error is not here.
> >>
> >> It might also be helpful to look through slurmctld's log starting from
> >> when the first job was canceled, looking at any messages mentioning
> >> the node or the two failed jobs.
> >>
> >> I've had nodes do strange things on job cancel. Last one I tracked
> >> down to the job epilog failing because it was NFS mounted and nfs was
> >> being slower than slurm liked, so it took the node offline and killed
> >> everything on it.
> >>
> wrote:
> >> >
> >> > Hi there,
> >> >
> >> > We are using slurm 18.08 and had a weird occurrence over the
> weekend. A user canceled one of his jobs using scancel, and two additional
> jobs of the user running on the same node were killed concurrently. The
> jobs had no dependency, but they were all allocated 1 gpu. I am curious to
> know why this happened, and if this is a known bug is there a workaround
> to prevent it happening? Any suggestions gratefully received.
> >> >
> >> > -Alastair
> >> >
> >> > FYI
> >> > The cancelled job (533898) has this at the end of the .err file:
> >> >
> >> >> slurmstepd: error: *** JOB 533898 ON NODE056 CANCELLED AT
> 2020-05-10T00:26:03 ***
> >> >
> >> >
> >> > both of the killed jobs (533900 and 533902) have this:
> >> >
> >> >> slurmstepd: error: get_exit_code task 0 died by signal
> >> >
> >> >
> >> > here is the slurmd log from the node and ths how-job output for each
> job:
> >> >
> >> >> [2020-05-09T19:49:46.735] _run_prolog: run job script took usec=4
> >> >> [2020-05-09T19:49:46.735] _run_prolog: prolog with lock for job
> 533898 ran for 0 seconds
> >> >> [2020-05-09T19:49:46.754] ====================
> >> >> [2020-05-09T19:49:46.754] batch_job:533898 job_mem:10240MB
> >> >> [2020-05-09T19:49:46.754] JobNode[0] CPU[0] Job alloc
> >> >> [2020-05-09T19:49:46.755] JobNode[0] CPU[1] Job alloc
> >> >> [2020-05-09T19:49:46.756] JobNode[0] CPU[2] Job alloc
> >> >> [2020-05-09T19:49:46.757] JobNode[0] CPU[3] Job alloc
> >> >> [2020-05-09T19:49:46.758] ====================
> >> >> [2020-05-09T19:49:46.758] Launching batch job 533898 for UID
> 1224431221
> >> >> [2020-05-09T19:53:14.060] _run_prolog: run job script took usec=3
> >> >> [2020-05-09T19:53:14.060] _run_prolog: prolog with lock for job
> 533900 ran for 0 seconds
> >> >> [2020-05-09T19:53:14.080] ====================
> >> >> [2020-05-09T19:53:14.080] batch_job:533900 job_mem:10240MB
> >> >> [2020-05-09T19:53:14.081] JobNode[0] CPU[4] Job alloc
> >> >> [2020-05-09T19:53:14.082] JobNode[0] CPU[5] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[6] Job alloc
> >> >> [2020-05-09T19:53:14.083] JobNode[0] CPU[7] Job alloc
> >> >> [2020-05-09T19:53:14.084] ====================
> >> >> [2020-05-09T19:53:14.085] Launching batch job 533900 for UID
> 1224431221
> >> >> [2020-05-09T19:55:26.283] _run_prolog: run job script took usec=21
> >> >> [2020-05-09T19:55:26.284] _run_prolog: prolog with lock for job
> 533902 ran for 0 seconds
> >> >> [2020-05-09T19:55:26.304] ====================
> >> >> [2020-05-09T19:55:26.304] batch_job:533902 job_mem:10240MB
> >> >> [2020-05-09T19:55:26.304] JobNode[0] CPU[8] Job alloc
> >> >> [2020-05-09T19:55:26.305] JobNode[0] CPU[9] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[10] Job alloc
> >> >> [2020-05-09T19:55:26.306] JobNode[0] CPU[11] Job alloc
> >> >> [2020-05-09T19:55:26.307] ====================
> >> >> [2020-05-09T19:55:26.307] Launching batch job 533902 for UID
> 1224431221
> >> >> [2020-05-10T00:26:03.127] [533898.extern] done with job
> >> >> [2020-05-10T00:26:03.975] [533898.batch] error: *** JOB 533898 ON
> NODE056 CANCELLED AT 2020-05-10T00:26:03 ***
> >> >> [2020-05-10T00:26:04.425] [533898.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 15
> >> >> [2020-05-10T00:26:04.428] [533898.batch] done with job
> >> >> [2020-05-10T00:26:05.202] [533900.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533902.batch] error: get_exit_code task 0
> died by signal
> >> >> [2020-05-10T00:26:05.202] [533900.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.202] [533902.batch] sending
> REQUEST_COMPLETE_BATCH_SCRIPT, error:0 status 9
> >> >> [2020-05-10T00:26:05.211] [533902.batch] done with job
> >> >> [2020-05-10T00:26:05.216] [533900.batch] done with job
> >> >> [2020-05-10T00:26:05.234] [533902.extern] done with job
> >> >> [2020-05-10T00:26:05.235] [533900.extern] done with job
> >> >
> >> >
> >> >> [
root@node056 2020-05-10]# cat 533{898,900,902}/show-job.txt
> >> >> JobId=533898 JobName=r18-relu-ent
> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >> JobState=CANCELLED Reason=None Dependency=(null)
> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:15
> >> >> RunTime=04:36:17 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >> SubmitTime=2020-05-09T19:49:45 EligibleTime=2020-05-09T19:49:45
> >> >> AccrueTime=2020-05-09T19:49:45
> >> >> StartTime=2020-05-09T19:49:46 EndTime=2020-05-10T00:26:03
> Deadline=N/A
> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >> LastSchedEval=2020-05-09T19:49:46
> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >> ReqNodeList=(null) ExcNodeList=(null)
> >> >> NodeList=NODE056
> >> >> BatchHost=NODE056
> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >> Features=(null) DelayBoot=00:00:00
> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_relu_ent.slurm
> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.err
> >> >> StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-relu-ent-533898.out
> >> >> Power=
> >> >> TresPerNode=gpu:1
> >> >>
> >> >> JobId=533900 JobName=r18-soft
> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >> RunTime=04:32:51 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >> SubmitTime=2020-05-09T19:53:13 EligibleTime=2020-05-09T19:53:13
> >> >> AccrueTime=2020-05-09T19:53:13
> >> >> StartTime=2020-05-09T19:53:14 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >> LastSchedEval=2020-05-09T19:53:14
> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >> ReqNodeList=(null) ExcNodeList=(null)
> >> >> NodeList=NODE056
> >> >> BatchHost=NODE056
> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >> Features=(null) DelayBoot=00:00:00
> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft.slurm
> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.err
> >> >> StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-533900.out
> >> >> Power=
> >> >> TresPerNode=gpu:1
> >> >>
> >> >> JobId=533902 JobName=r18-soft-ent
> >> >> UserId=sshres2(1224431221) GroupId=users(100) MCS_label=N/A
> >> >> Priority=19375 Nice=0 Account=csjkosecka QOS=csqos
> >> >> JobState=FAILED Reason=JobLaunchFailure Dependency=(null)
> >> >> Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:9
> >> >> RunTime=04:30:39 TimeLimit=5-00:00:00 TimeMin=N/A
> >> >> SubmitTime=2020-05-09T19:55:26 EligibleTime=2020-05-09T19:55:26
> >> >> AccrueTime=2020-05-09T19:55:26
> >> >> StartTime=2020-05-09T19:55:26 EndTime=2020-05-10T00:26:05
> Deadline=N/A
> >> >> PreemptTime=None SuspendTime=None SecsPreSuspend=0
> >> >> LastSchedEval=2020-05-09T19:55:26
> >> >> Partition=gpuq AllocNode:Sid=ARGO-2:7221
> >> >> ReqNodeList=(null) ExcNodeList=(null)
> >> >> NodeList=NODE056
> >> >> BatchHost=NODE056
> >> >> NumNodes=1 NumCPUs=4 NumTasks=0 CPUs/Task=4 ReqB:S:C:T=0:0:*:*
> >> >> TRES=cpu=4,mem=10G,node=1,billing=4,gres/gpu=1
> >> >> Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
> >> >> MinCPUsNode=4 MinMemoryNode=10G MinTmpDiskNode=0
> >> >> Features=(null) DelayBoot=00:00:00
> >> >> OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
> >> >>
> Command=/scratch/sshres2/workspace-scratch/cs747-project/command_resnet18_soft_ent.slurm
> >> >> WorkDir=/scratch/sshres2/workspace-scratch/cs747-project
> >> >>
> StdErr=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.err
> >> >> StdIn=/dev/null
> >> >>
> StdOut=/scratch/sshres2/workspace-scratch/cs747-project/logs_slurm/r18-soft-ent-533902.out
> >> >> Power=
> >> >> TresPerNode=gpu:1
> >> >
> >> >
> >> >
> >>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
End of slurm-users Digest, Vol 31, Issue 50
*******************************************