i am trying to run a docker image on a slurm container, so i modified the "custom-compute-install" to install docker and set it up to run.
in particular, i added my slurm user (that is, the 'whoami' user printed on the compute machine) on the "docker" group so i can run containers.
on the slurm login machine and on the template slurm image, this is working, i can actually run commands on the installed docker image, like:
<prompt># docker run <my-image> <my-command>
in fact, if i run "groups" on the login/compute-image instances, it correctly prints that the user belongs to the "docker" group.
however, when i try to schedule a job with sbatch, it seems that the user is not on the "docker" group, thus giving me the error:
"got permission denied while trying to connect to the Docker daemon socket".
this is my "custom-compute-install" script:
******************************************************
#!/bin/bash
# install docker
sh get-docker.sh
# add docker to startup and start it
systemctl enable docker
systemctl restart docker
# add cluster user to the docker group
usermod -aG docker <my-cluster-user>
# pull the image that will be run on the cluster instances
docker pull <my-docker-image>
******************************************************
this is my sbatch script (compute.sh) that invokes the computation:
******************************************************
#!/bin/bash
#
#SBATCH --job-name=compute
#SBATCH --output=out_%j.txt
#SBATCH --nodes=1
srun docker run <my-image> <my-command>
******************************************************
and this is how the job is scheduled:
******************************************************
<prompt># sbatch compute.sh
******************************************************
the job is actually executed, but with the permission denied error i mentioned before.
what am i doing wrong?