Hi,
What do you mean by 'properly configured'? Ultimately you will want to
submit a job to the nodes and use something like 'nvidia-smi' to see
whether the GPUs are actually being used.
> 2) I want to reserve a few CPU cores, and a few gigs of memory for use by non slurm related tasks. According to the documentation, I am to use
> CoreSpecCount and MemSpecLimit to achieve this. The documentation for CoreSpecCount says "the Slurm daemon slurmd may either be confined to these
> resources (the default) or prevented from using these resources", how do I change this default behaviour to have the config specify the cores reserved for non
> slurm stuff instead of specifying how many cores slurm can use?
I am not aware that this is possible.
> 3) While looking up examples online on how to run Python scripts inside a conda env, I have seen that the line 'module load conda' should be run before
> running 'conda activate myEnv' in the sbatch submission script. The command 'module' did not exist until I installed the apt package 'environment-modules',
> but now I see that conda is not listed as a module that can be loaded when I check using the command 'module avail'. How do I fix this?
Environment modules and Conda are somewhat orthogonal to each other.
Environment modules is a mechanism for manipulating environment
variables such as PATH and LD_LIBRARY_PATH. It allows you to provide
easy access for all users to software which has been centrally installed
in non-standard paths. It is not used to provide access to software
installed via 'apt'.
Conda is another approach to providing non-standard software, but is
usually used by individual users to install programs in their own home
directories.
You can use environment modules to allow access to a different version
of Conda than the one you get via 'apt', but there is no necessity to do
that.
> 4) A very broad question: while managing the resources being used by a program, slurm might happen to split the resources across multiple computers that
> might not necessarily have the files required by this program to run. For example, a python script that requires the package 'numpy' to function but that
> package was not installed on all of the computers. How are such things dealt with? Is the module approach meant to fix this problem? In my previous
> question, if I had a python script that users usually run just by running a command like 'python3 someScript.py' instead of running it within a conda
> environment, how should I enable slurm to manage the resources required by this script? Would I have to install all the packages required by this script on all
> the computers that are in the cluster?
In general a distributed or cluster file system, such as NFS, Ceph or
Lustre is used to provide access to multiple nodes. /home would be on
such a files system, as would a large part of the software. You can
use something like EasyBuild which will install software and generate
the relevant module files.
> 5) Related to the previous question: I have set up my 3 nodes in such a way that all the users' home directories are stored on a ceph cluster created using the
> hard drives from all the 3 nodes, which essentially means that a user's home directory is mounted at the same location on all 3 computers - making a user's
> data visible to all 3 nodes. Does this make the process of managing the dependencies of a program as described in the previous question easier? I realise that
> programs having to read and write to files on the hard drives of a ceph cluster is not really the fastest so I am planning on having users use the /tmp/ directory
> for speed critical reading and writing, as the OSs have been installed
> on NVME drives.
Depending on the IO patterns created by a piece of software using the
distributed file system might be fine or a local disk might be needed.
Note that you might experience problems with /tmp filling up, so it may
be better to have a separate /localscratch. In general you probably also
want people to use as much RAM as possible in order to avoid filesystem
IO altogether if this is feasible.
HTH
Loris
--
Dr. Loris Bennett (Herr/Mr)
FUB-IT (ex-ZEDAT), Freie Universität Berlin
--
slurm-users mailing list -- slurm...@lists.schedmd.com
To unsubscribe send an email to slurm-us...@lists.schedmd.com