Dear Slurm User list,
I would like to startup all ~idle (idle and powered down) nodes and check programmatically if all came up as expected. For context: this is for a program that sets up slurm clusters with on demand cloud scheduling.
In the most easiest fashion this could be executing a command like srun FORALL hostname which would return the names of the nodes if it succeeds and an error message otherwise. However, there's no such input value like FORALL as far as I am aware. One could use -N{total node number} as all nodes are ~idle when this executes, but I don't know an easy way to get the total number of nodes.
Best regards,
Xaver
Hi Ole,
thank you for your answer!
I apologize for the unclear wording. We have already implemented
the on demand scheduling.
However, we have not provided a HealthCheckProgram yet (which
simply means that the node starts without health check). I will
look into it regardless of my question.
Back to the question: I am aware that sinfo contains the
information, but I am basically looking for a method like sinfo
that produces a more machine friendly output as I want to verify
the correct start of all nodes programmatically. ClusterShell is
also on my list of software to try out in general.
More Context
We are maintaining a tool that creates Slurm clusters in OpenStack from configuration files and for that we would like to write integration tests. Therefore, we would like to be able to test (CI/CD) whether the slurm cluster behaves as expected given certain configurations of our program. Of course this includes checking whether the nodes power up.
Best regards,
Xaver
Maybe to expand on this even further:
I would like to run something that waits and comes back with a 0 when all workers have been powered up (resume script ran without an issue) and comes back with =/= 0 (for example 1) otherwise. Then I could start other routines to complete the integration test.
And my personal idea was to use something like:
scontrol show nodes | awk '/NodeName=/ {print $1}' | sed 's/NodeName=//' | sort -u | xargs -Inodenames srun -w nodenames hostname
to execute the hostname command on all instances which forces
them to power up. However, that feels a bit clunky and the output
is definitely not perfect as it needs parsing.
Best regards,
Xaver
My approach would be something along this:
sinfo -t POWERED_DOWN -o %n -h | parallel -i -j 10 --timeout 900 srun -w {} hostname
sinfo lists all powered down nodes and the output gets piped into parallel. parallel will then run 10 (or how many you want) srun instances simultaneously, with a timout of 900 seconds to give the hosts enough time to power up. If everything works parallel exists with 0, otherwise it will sum up the exit codes.
works for me like charm, only downside is that parallel is usually needs to be installed for that. But it’s useful for other cases as well.
Regards,
Gerald Schneider
--
Gerald Schneider
tech. Mitarbeiter
IT-SC
Fraunhofer-Institut für Graphische Datenverarbeitung IGD
Joachim-Jungius-Str. 11 | 18059 Rostock | Germany
Telefon +49 6151 155-309 | +49 381 4024-193 | Fax +49 381 4024-199
That looks very promising. We will have to increase the timeout due to an extensive Ansible setup run on each node before it is ready, but the idea should work nonetheless. I will try it out.
Thank you!
Xaver