[slurm-users] Users Logout when job die or complete

Andrea Carotti

unread,

Jul 9, 2021, 6:51:16 AM7/9/21

to slurm...@schedmd.com

Dear all,

I've installed an Open hpc cluster 2.3 , Centos 8.4, running Slurm
20.11.7 (mostly following this guide
https://github.com/openhpc/ohpc/releases/download/v2.3.GA/Install_guide-CentOS8-Warewulf-SLURM-2.3-aarch64.pdf).

I've a master node and Hybrid nodes that are GPU/CPU execution hosts and
Login Nodes with X11 running (workstation used by the users). I leaved
the possibility to the users to ssh to other compute-nodes even if they
are not running jobs there (I created an ssh-allowed group following
page 51 of
https://software.intel.com/content/dam/www/public/us/en/documents/guides/installguide-openhpc2-centos82-6feb21.pdf,
and did not run this command ' echo "account required pam_slurm.so" >>
$CHROOT/etc/pam.d/sshd') . We are few guys using the cluster, so it's
not a big deal.

The GPUs are in Persistence-Mode OFF and "Default" Compute-Mode. SELINUX
is disabled. No firewall.

I'm having a strange problem of "connection closed by remote host":

1)when a job is running by user1 under slurm locally (let's say
hybrid-0-1 where user1 is logged and working in X11) and the job
finishes (or die, or is canceled), the user is logged out. The GDM login
window appears

2) when a job is running by user1 under slurm ( user1 is logged and
working in X11 on hybrid-0-2) on a remote host e.g. hybrid-0-2) nd the
job finishes (or die, or is canceled), the user is logged out by
hybrid-0-1. I can check it by connecting from hybrid-0-2 by ssh on
hybrid-0-1, and seeing that the terminal is disconnected at the end of
the job. It happens using both srun and sbatch.

I think that the problem can be related to the Slurm configuration, and
not the GPU configuration, because both CPU and GPU jobs lead to the
logout problem.

Here are the sbatch test , the slurm.conf and gres.conf

############## sbatch.test #####

#!/bin/bash
#SBATCH --job-name=test # Job name
#SBATCH --ntasks=1                    # Run on a single CPU
#SBATCH --cpus-per-task=1
#SBATCH --partition=allcpu
#SBATCH --nodelist=hybrid-0-1
#SBATCH --output=serial_test_%j.log   # Standard output and error log
# Usage of this script:
#sbatch job-test.sbatch

# Jobname below is set automatically when using "qsub job-orca.sh -N
jobname". Can alternatively be set manually here. Should be the name of
the inputfile without extension (.inp or whatever).
export job=$SLURM_JOB_NAME
   JOB_NAME="$SLURM_JOB_NAME"
     JOB_ID="$SLURM_JOB_ID"

# Here giving communication protocol

export RSH_COMMAND="/usr/bin/ssh -x"

#######SERIAL COMMANDS HERE

echo "HELLO WORLD"
sleep 10
echo "done"
#########################################

########## slurm.conf ##################

#
# Example slurm.conf file. Please run configurator.html
# (in doc/html) to build a configuration file customized
# for your environment.
#
#
# slurm.conf file generated by configurator.html.
#
# See the slurm.conf man page for more information.
#
ClusterName=linux
ControlMachine=orthrus
#ControlAddr=
#BackupController=
#BackupAddr=
#
SlurmUser=slurm
#SlurmdUser=root
SlurmctldPort=6817
SlurmdPort=6818
AuthType=auth/munge
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
StateSaveLocation=/var/spool/slurm/ctld
SlurmdSpoolDir=/var/spool/slurm/d
SwitchType=switch/none
MpiDefault=none
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
ProctrackType=proctrack/pgid
#PluginDir=
#FirstJobId=
#MaxJobCount=
#PlugStackConfig=
#PropagatePrioProcess=
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
#Prolog=
#Epilog=
#SrunProlog=
#SrunEpilog=
#TaskProlog=
#TaskEpilog=
#TaskPlugin=
#TrackWCKey=no
#TreeWidth=50
#TmpFS=
#UsePAM=
#
# TIMERS
SlurmctldTimeout=300
SlurmdTimeout=300
InactiveLimit=0
MinJobAge=300
KillWait=30
Waittime=0
#
# SCHEDULING
SchedulerType=sched/backfill
#SchedulerAuth=
SelectType=select/cons_tres
SelectTypeParameters=CR_Core
#PriorityType=priority/multifactor
#PriorityDecayHalfLife=14-0
#PriorityUsageResetPeriod=14-0
#PriorityWeightFairshare=100000
#PriorityWeightAge=1000
#PriorityWeightPartition=10000
#PriorityWeightJobSize=1000
#PriorityMaxAge=1-0
#
# LOGGING
SlurmctldDebug=info
SlurmctldLogFile=/var/log/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurmd.log
#JobCompLoc=
#
# ACCOUNTING
#JobAcctGatherType=jobacct_gather/linux
#JobAcctGatherFrequency=30
#
#AccountingStorageType=accounting_storage/slurmdbd
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStorageUser=
#
# COMPUTE NODES
# OpenHPC default configuration
TaskPlugin=task/affinity
PropagateResourceLimitsExcept=MEMLOCK
JobCompType=jobcomp/filetxt
Epilog=/etc/slurm/slurm.epilog.clean
GresTypes=gpu
NodeName=hybrid-0-1 Sockets=1 CoresPerSocket=4 ThreadsPerCore=1
State=UNKNOWN
NodeName=hybrid-0-2 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4
ThreadsPerCore=1 State=UNKNOWN
NodeName=hybrid-0-3 Sockets=1 Gres=gpu:titanxp:1,gpu:gtx1080:1
CoresPerSocket=4 ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-4 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4
ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-5 Sockets=1 Gres=gpu:gtx980:1 CoresPerSocket=4
ThreadsPerCore=2 State=UNKNOWN
NodeName=hybrid-0-7 Sockets=1 Gres=gpu:titanxp:1 CoresPerSocket=4
ThreadsPerCore=1 State=UNKNOWN
PartitionName=gpu Nodes=hybrid-0-[2-5,7] Default=YES MaxTime=INFINITE
State=UP Oversubscribe=NO
PartitionName=allcpu Nodes=hybrid-0-[1-5,7] Default=YES MaxTime=INFINITE
State=UP Oversubscribe=NO
PartitionName=fastcpu Nodes=hybrid-0-[3-5,7] Default=YES
MaxTime=INFINITE State=UP Oversubscribe=NO
PartitionName=fastqm Nodes=hybrid-0-5 Default=YES MaxTime=INFINITE
State=UP Oversubscribe=NO
SlurmctldParameters=enable_configless
ReturnToService=1

#################################################

########### gres.conf ####################

NodeName=hybrid-0-[2,3,7] Name=gpu Type=titanxp File=/dev/nvidia0 COREs=0
NodeName=hybrid-0-3 Name=gpu Type=gtx1080 File=/dev/nvidia1 COREs=1
NodeName=hybrid-0-[4-5] Name=gpu Type=gtx980 File=/dev/nvidia0 COREs=0

###############

Thanks and sorry for the looong message

Andrea

--

¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯¯
Andrea Carotti
Dipartimento di Scienze Farmaceutiche
Università di Perugia
Via del Liceo, 1
06123 Perugia, Italy
phone: +39 075 585 5121
fax: +39 075 585 5161
mail: andrea....@unipg.it

Sidhu, Khushwant

unread,

Jul 9, 2021, 6:56:03 AM7/9/21

to Slurm User Community List, slurm...@schedmd.com

CentOS 8 is probably not a good idea as support terminates at end of this year

Khushwant Sidhu |Systems Admin, Principal Consultant | Technology Solutions | NTT DATA UK
4020 Lakeside, Birmingham Business Park, Solihull, B37 7YN, United Kingdom

M: +44 (0) 7767111776 | Learn more at nttdata.com/uk | Follow us:

For more details on our registered companies, click here

Disclaimer: This email and any attachments are sent in strictest confidence for the sole use of the addressee and may contain legally privileged, confidential, and proprietary data. If you are not the intended recipient, please advise the sender by replying promptly to this email and then delete and destroy this email and any attachments without any further use, copying or forwarding.

Christopher Samuel

unread,

Jul 10, 2021, 4:54:57 PM7/10/21

to slurm...@lists.schedmd.com

Hi Andrea,

On 7/9/21 3:50 am, Andrea Carotti wrote:

> ProctrackType=proctrack/pgid

I suspect this is the cause of your problems, my bet is that it is
incorrectly identifying the users login processes as being part of the
job and thinking it needs to tidy them up in addition to any processes
left over from the job. It also seems to be more for BSD systems than Linux.

At the very least you'd want:

ProctrackType=proctrack/linuxproc

Though I'd strongly suggest looking at cgroups for this, see:

https://slurm.schedmd.com/slurm.conf.html#OPT_ProctrackType

and:

https://slurm.schedmd.com/cgroups.html

Best of luck!
Chris
--
Chris Samuel : http://www.csamuel.org/ : Berkeley, CA, USA

Andrea Carotti

unread,

Jul 12, 2021, 6:11:40 AM7/12/21

to slurm...@lists.schedmd.com

Dear Chris,

thanks for the suggestions. I'm running Centos Stream 8.4.

I've done a couple of tests:

1) I've modified as suggested this line as this
ProctrackType=proctrack/linuxproc. Restarted the slurmctld and nd the
nodes' slurmd(hope it's enough) but didn't changed the behaviour.

2) I've tried the cgroup configuration like this:

##############Lines added/changed to slurm.conf ###############

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup,task/affinity
JobAcctGatherType=jobacct_gather/cgroup #optional for gathering metrics
PrologFlags=Contain                     #X11 flag is also suggested

###########Lines of cgroup.conf#################

###
# Slurm cgroup support configuration file.
###
CgroupAutomount=yes
CgroupMountpoint=/sys/fs/cgroup
ConstrainCores=yes
ConstrainDevices=yes
ConstrainKmemSpace=no         #avoid known Kernel issues
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes
TaskAffinity=no          #use task/affinity plugin instead

Restarted the slurmctld and nd the nodes' slurmd (again hoping it's
enough) but again no luck...

Do I need a complete restart?

What ele can I check/change/try?

Hope someone can help, thanks

Andrea

Andrea Carotti

unread,

Jul 13, 2021, 10:39:04 AM7/13/21

to slurm...@lists.schedmd.com

Dear all,

thanks for the suggestion. Indeed ther was a file
/etc/slurm/slurm.epilog.clean used by this line
Epilog=/etc/slurm/slurm.epilog.clean in the /etc/slurm.conf

In this moment I'm using the cgroup and to solve the logout problem I
commented out this line present near the end of the
/etc/slurm/slurm.epilog.clean

#pkill -KILL -U $SLURM_UID

Hope this is the right solution , or if you GURUs can sugget something
better, I will be happy to improve the solution

Many thanks again

Reply all

Reply to author

Forward