[slurm-users] Error running jobs with srun

7,533 views
Skip to first unread message

Elisabetta Falivene

unread,
Nov 8, 2017, 5:20:04 PM11/8/17
to Slurm User Community List
I'm getting this message anytime I try to execute any job on my cluster. 
(node01 is the name of my first of eight nodes and is up and running)

Trying a python simple script:
root@mycluster:/tmp# srun python test.py 
slurmd[node01]: error: task/cgroup: unable to build job physical cores
/usr/bin/python: can't open file 'test.py': [Errno 2] No such file or directory
srun: error: node01: task 0: Exited with exit code 2

Trying to run a simple command
root@ mycluster:/tmp# srun nano testbet.txt
slurmd[node01]: error: task/cgroup: unable to build job physical cores
And then remains stuck indefinitely. If i tap Ctrl-C two times it stops the process, nano gets called for a second and then closed and I'm back to the terminal

Have you ever seen this message or knowing what it means?
Thank you

Lachlan Musicman

unread,
Nov 8, 2017, 5:28:29 PM11/8/17
to Slurm User Community List
On 9 November 2017 at 09:19, Elisabetta Falivene <e.fal...@ilabroma.com> wrote:
I'm getting this message anytime I try to execute any job on my cluster. 
(node01 is the name of my first of eight nodes and is up and running)

Trying a python simple script:
root@mycluster:/tmp# srun python test.py 
slurmd[node01]: error: task/cgroup: unable to build job physical cores
/usr/bin/python: can't open file 'test.py': [Errno 2] No such file or directory
srun: error: node01: task 0: Exited with exit code 2


This error - which I've seen too many times to mention - is because the file isn't visible to the node.

EG: If all the cluster share /opt and /home/ but not /root, and you run "srun python test.py" from /root - then node1 can't find it (because on node1, /root/test.py doesn't exist)
 
Cheers
L.


------
"The antidote to apocalypticism is apocalyptic civics. Apocalyptic civics is the insistence that we cannot ignore the truth, nor should we panic about it. It is a shared consciousness that our institutions have failed and our ecosystem is collapsing, yet we are still here — and we are creative agents who can shape our destinies. Apocalyptic civics is the conviction that the only way out is through, and the only way through is together. "

Greg Bloom @greggish https://twitter.com/greggish/status/873177525903609857

Elisabetta Falivene

unread,
Nov 8, 2017, 6:36:59 PM11/8/17
to Slurm User Community List
Wow, thank you. There's a way to check which directories the master and The nodes share?

Lachlan Musicman

unread,
Nov 8, 2017, 6:49:36 PM11/8/17
to Slurm User Community List
On 9 November 2017 at 10:35, Elisabetta Falivene <e.fal...@ilabroma.com> wrote:
Wow, thank you. There's a way to check which directories the master and The nodes share?

There's no explicit way.
1. Check the cluster documentation written by the cluster admins
2. Ask the cluster admins
3. Run "mount" or "cat /etc/mtab" or "df -H" on the master node and check against the same commands on a worker node (by getting an interactive terminal: "srun --pty bash" )

Elisabetta Falivene

unread,
Nov 8, 2017, 6:55:17 PM11/8/17
to Slurm User Community List
I am the admin and I have no documentation :D I'll try The third option. Thank you very much

Lachlan Musicman

unread,
Nov 8, 2017, 7:08:29 PM11/8/17
to Slurm User Community List
On 9 November 2017 at 10:54, Elisabetta Falivene <e.fal...@ilabroma.com> wrote:
I am the admin and I have no documentation :D I'll try The third option. Thank you very much

Ah. Yes. Well, you will need some sort of drive shared between all the nodes so that they can read and write from a common space.

Also, I recommend documentation ;)

Elisabetta Falivene

unread,
Nov 9, 2017, 6:03:27 AM11/9/17
to Slurm User Community List
I'll surely produce documentation as soon as I understand how all the cluster is working. (It was something kinda "Here it is the root password and the key to the room. You don't need anything else, don't you?" :) )

Thank to your precious suggestions I was able to get that the common shared space was the partition 'home' for master and nodes.
Executing srun on a script that is in home removed the 'file not found' problem and the script gets executed but the error

slurmd[node01]: error: task/cgroup: unable to build job physical cores

is still raised before the execution of the job. What does it mean?
Thank you, thank you, thank you!
Reply all
Reply to author
Forward
0 new messages