[slurm-users] Job step aborted

3,863 views
Skip to first unread message

Mahmood Naderan

unread,
May 17, 2018, 5:25:47 AM5/17/18
to Slurm User Community List
Hi,
For an interactive job via srun, I see that after opening the gui, the
session is terminated automatically which is weird.

[mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10
--mem=8GB --pty bash
[mahmood@compute-0-6 ansys_test]$ /state/partition1/scfd/sc -t10
srun: First task exited 60s ago
srun: step:292.0 task 0: running
srun: step:292.0 tasks 1-9: exited
srun: Terminating job step 292.0
srun: Job step aborted: Waiting up to 62 seconds for job step to finish.
srun: error: compute-0-6: task 0: Killed

What does that mean?

Regards,
Mahmood

Mahmood Naderan

unread,
May 17, 2018, 12:51:32 PM5/17/18
to Slurm User Community List
I have opened a bug ticket at https://bugs.schedmd.com/show_bug.cgi?id=5182
It is annoying...

Regards,
Mahmood

Matthieu Hautreux

unread,
May 17, 2018, 3:12:05 PM5/17/18
to Slurm User Community List
It means what is written : your job is terminated because 9 tasks out of 10 exited more than 60s before.

The logic behind the 60 seconds (configurable) is described in the srun man page. You should look at it closely. 

You should also look at the FAQ here https://slurm.schedmd.com/faq.html.

You should set --ntask=1, if I properly guess your goal.

HTH



Regards,
Mahmood

Mahmood Naderan

unread,
May 18, 2018, 2:06:47 AM5/18/18
to Slurm User Community List
OK I understand that. However, there is a issue with ntasks=1.
Assume a user wants to launch an application with the number of cores
in the command line argument. Taking into mind that the cpu limit for
the partition is 20 cores, the following example

[mahmood@rocks7 ~]$ srun --x11 -A y8 -p RUBY --mem=8GB --pty bash
[mahmood@compute-0-6 ~]$ /state/partition1/scfd/sc -t10

raises two problems:
1- Slurm assumes that the user job is using only one core. That means
a user can create 20 interactive sessions and in each of the sessions
launch the program with 10 threads and bypassing the core limit I set
before.

2- The user that start the session with ntasks=1 (or not specifying
that) and then cheat the system by launching the program with more
than cpu limit (specifying -t50).

Any idea?



Regards,
Mahmood

Mahmood Naderan

unread,
May 19, 2018, 8:07:22 AM5/19/18
to Slurm User Community List
Excuse me, how can I tell slurm not to terminate until all steps
(tasks) are finished?

Regards,
Mahmood
Reply all
Reply to author
Forward
0 new messages