I'm setting up Slurm from scratch for the first time ever. Using
22.05.8 since I haven't had a changed to upgrade our DB server to
23.02 yet. When I try to use salloc to get a shell on a compute
node (ranger-s22-07), I end up with a shell on the login node
(ranger):
[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G salloc: Granted job allocation 23 salloc: Waiting for resource configuration salloc: Nodes ranger-s22-07 are ready for job [pbisbal@ranger ~]$
Any ideas what's going wrong here? I have the following line in
my slurm.conf:
LaunchParameters=user_interactive_step
When I run salloc with -vvvvv, here's what I see:
[pbisbal@ranger ~]$ salloc -vvvvv -n 1 -t 00:10:00 --mem=1G salloc: defined options salloc: -------------------- -------------------- salloc: mem : 1G salloc: ntasks : 1 salloc: time : 00:10:00 salloc: verbose : 5 salloc: -------------------- -------------------- salloc: end of defined options salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_res.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Consumable Resources (CR) Node Selection plugin type:select/cons_res version:0x160508 salloc: select/cons_res: common_init: select/cons_res loaded salloc: debug3: Success. salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cons_tres.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Trackable RESources (TRES) Selection plugin type:select/cons_tres version:0x160508 salloc: select/cons_tres: common_init: select/cons_tres loaded salloc: debug3: Success. salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_cray_aries.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Cray/Aries node selection plugin type:select/cray_aries version:0x160508 salloc: select/cray_aries: init: Cray/Aries node selection plugin loaded salloc: debug3: Success. salloc: debug3: Trying to load plugin /usr/lib64/slurm/select_linear.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Linear node selection plugin type:select/linear version:0x160508 salloc: select/linear: init: Linear node selection plugin loaded with argument 20 salloc: debug3: Success. salloc: debug: Entering slurm_allocation_msg_thr_create() salloc: debug: port from net_stream_listen is 43881 salloc: debug: Entering _msg_thr_internal salloc: debug4: eio: handling events for 1 objects salloc: debug3: eio_message_socket_readable: shutdown 0 fd 6 salloc: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:Munge authentication plugin type:auth/munge version:0x160508 salloc: debug: auth/munge: init: Munge authentication plugin loaded salloc: debug3: Success. salloc: debug3: Trying to load plugin /usr/lib64/slurm/hash_k12.so salloc: debug3: plugin_load_from_file->_verify_syms: found Slurm plugin name:KangarooTwelve hash plugin type:hash/k12 version:0x160508 salloc: debug: hash/k12: init: init: KangarooTwelve hash plugin loaded salloc: debug3: Success. salloc: Granted job allocation 24 salloc: Waiting for resource configuration salloc: Nodes ranger-s22-07 are ready for job salloc: debug: laying out the 1 tasks on 1 hosts ranger-s22-07 dist 8192 [pbisbal@ranger ~]$
This is all I see in /var/log/slurm/slurmd.log on the compute
node:
[2023-05-19T10:21:36.898] [24.extern] task/cgroup: _memcg_initialize: job: alloc=1024MB mem.limit=1024MB memsw.limit=unlimited [2023-05-19T10:21:36.899] [24.extern] task/cgroup: _memcg_initialize: step: alloc=1024MB mem.limit=1024MB memsw.limit=unlimitedAnd this is all I see in /var/log/slurm/slurmctld.log on the controller:
[2023-05-19T10:18:16.815] sched: _slurm_rpc_allocate_resources JobId=23 NodeList=ranger-s22-07 usec=1136 [2023-05-19T10:18:22.423] Time limit exhausted for JobId=22 [2023-05-19T10:21:36.861] sched: _slurm_rpc_allocate_resources JobId=24 NodeList=ranger-s22-07 usec=1039Here's my slurm.conf file:
# grep -v ^# /etc/slurm/slurm.conf | grep -v ^$
ClusterName=ranger SlurmctldHost=ranger-master EnforcePartLimits=ALL JobSubmitPlugins=lua,require_timelimit LaunchParameters=user_interactive_step MaxStepCount=2500 MaxTasksPerNode=32 MpiDefault=none ProctrackType=proctrack/cgroup PrologFlags=contain ReturnToService=0 SlurmctldPidFile=/var/run/slurmctld.pid SlurmctldPort=6817 SlurmdPidFile=/var/run/slurmd.pid SlurmdPort=6818 SlurmdSpoolDir=/var/spool/slurmd SlurmUser=slurm StateSaveLocation=/var/spool/slurmctld SwitchType=switch/none TaskPlugin=task/affinity,task/cgroup TopologyPlugin=topology/tree CompleteWait=32 InactiveLimit=0 KillWait=30 MinJobAge=300 SlurmctldTimeout=120 SlurmdTimeout=300 Waittime=0| DefMemPerCPU=5000 SchedulerType=sched/backfill SelectType=select/cons_tres SelectTypeParameters=CR_Core_Memory PriorityType=priority/multifactor PriorityDecayHalfLife=15-0 PriorityCalcPeriod=15 PriorityFavorSmall=NO PriorityMaxAge=180-0 PriorityWeightAge=5000 PriorityWeightFairshare=5000 PriorityWeightJobSize=5000 AccountingStorageEnforce=all AccountingStorageHost=slurm.pppl.gov AccountingStorageType=accounting_storage/slurmdbd AccountingStoreFlags=job_script JobCompType=jobcomp/none JobAcctGatherFrequency=30 JobAcctGatherParams=UsePss JobAcctGatherType=jobacct_gather/cgroup SlurmctldDebug=info SlurmctldLogFile=/var/log/slurm/slurmctld.log SlurmdDebug=info SlurmdLogFile=/var/log/slurm/slurmd.log NodeName=ranger-s22-07 CPUs=72 Boards=1 SocketsPerBoard=4 CoresPerSocket=18 ThreadsPerCore=1 RealMemory=384880 State=UNKNOWN PartitionName=all Nodes=ALL Default=YES GraceTime=300 MaxTime=24:00:00 State=UP
-- Prentice
On May 19, 2023, at 10:35, Prentice Bisbal <pbi...@pppl.gov> wrote:
Defaulting to a shell for salloc is a newer feature.
For your version, you should:
srun -n 1 -t 00:10:00 --mem=1G --pty bash
Brian Andrus
Brian,
Thanks for the reply, and I was hoping that would be the fix, but
that doesn't seem to be the case. I'm using 22.05.8, which isn't
that old. I double-checked the documentation archives for version
22.05.08's documetation, and setting
LaunchParameters=use_interactive_step
should be valid here. From https://slurm.schedmd.com/archive/slurm-22.05.8/slurm.conf.html:
- use_interactive_step
- Have salloc use the Interactive Step to launch a shell on an allocated compute node rather than locally to wherever salloc was invoked. This is accomplished by launching the srun command with InteractiveStepOptions as options.
This does not affect salloc called with a command as an argument. These jobs will continue to be executed as the calling user on the calling host.
and
- InteractiveStepOptions
- When LaunchParameters=use_interactive_step is enabled, launching salloc will automatically start an srun process with InteractiveStepOptions to launch a terminal on a node in the job allocation. The default value is "--interactive --preserve-env --pty $SHELL". The "--interactive" option is intentionally not documented in the srun man page. It is meant only to be used in InteractiveStepOptions in order to create an "interactive step" that will not consume resources so that other steps may run in parallel with the interactive step.
According to that, setting LaunchParameters=use_interactive_step
should be enough, since "--interactive --preserve-env --pty
$SHELL" is the default.
A colleague pointed out that my slurm.conf was setting
LaunchParameters to "user_interactive_step" when it should be
"use_interactive_step", but changing that didn't fix my problem,
just changed it. Now when I try to start an interactive shell, it
just hangs and eventually returns an error:
[pbisbal@ranger ~]$ salloc -n 1 -t 00:10:00 --mem=1G
salloc: Granted job allocation 29
salloc: Waiting for resource configuration
salloc: Nodes ranger-s22-07 are ready for job
srun: error: timeout waiting for task launch, started 0 of 1 tasks
srun: launch/slurm: launch_p_step_launch: StepId=29.interactive
aborted before step completely launched.
srun: Job step aborted: Waiting up to 32 seconds for job step to
finish.
srun: error: Timed out waiting for job step to complete
salloc: Relinquishing job allocation 29
[pbisbal@ranger ~]$
This is fixed. I was a little overzealous in my IPtables rules on
the login host and was restricting traffic from the compute node
back to the login node.
Thanks to Ryan and Brian for the quick replies offering
suggestions.
Prentice
On May 19, 2023, at 13:13, Prentice Bisbal <pbi...@pppl.gov> wrote: