rerun command not executed on failed job restart

26 views
Skip to first unread message

Bharani Reddy

unread,
Jan 28, 2016, 12:49:40 PM1/28/16
to schedulix
Hello,

When I rerun a failed job, rerun command is not executed. I noticed that rerun command line is set to none. Attached screenshot of the same.
Can you please let me know what am I missing here.

Thanks,
Bharani
rerun program.PNG

Ronald Jeninga

unread,
Jan 28, 2016, 6:55:33 PM1/28/16
to schedulix
Hi Bharani,

actually it works as designed. But let me explain this.

The idea behind the rerun program is that if a program is executed again, it might be necessary to inform that program about it (e.g. different cmdline options), or you might want to execute a small cleanup routine.
Hence you define something like

run program = 'FullGraphical4DHelloWorld'

and a rerun program like

rerun program = 'sh -c "CleanupFirst; FullGraphical4DHelloWorld"'

This is probably what you expected.


In your case, the situation is slightly different.
The jobserver responsible for execution of the command line tries to execvp() your executable "echo1", which doesn't exist (2/No such file or directory).
This means, it (= echo1) wasn't executed at all ! This on itself means that the system is in exactly the same state as before the submit of the job.
Ergo: it would be a mistake to run the rerun program.

If you change your run program to '/bin/false', or 'sh -c "exit 1"', or even better simply '1', you can observe that after a "failed" execution the rerun program will be used.
(actually my first two suggestions would run perfectly without an error, but return exit code 1).

Two words on my last suggestion, run program = '1'.
It is often the case that for technical reasons a job is required (e.g. because you want to allocate resources), but there's nothing to do.
This led to a lot of "/bin/true" jobs in the past. Now if you think about a job and its overhead, such a /bin/true is very expensive.
So we came up with a solution: If a run program after variable substitution (which is done at the time a suitable jobserver does a "GET NEXT JOB;") is a numerical value, then it is not handed over to the requesting jobserver, but it is set to state FINISHED as if it had run and terminated with an exit code equal to the evaluated run program.
If you happen to have an executable called "4711" or so, you'll have to call it like './4711', or 'sh -c 4711'. So far I haven't ever seen a program with a number as a name in my life though.

Regards,

Ronald

Bharani Reddy

unread,
Jan 29, 2016, 9:54:32 AM1/29/16
to schedulix
Thank you for the clarification Ronald. It makes sense now.
Reply all
Reply to author
Forward
0 new messages