Cancel job after some time

36 views
Skip to first unread message

Hector Barranco

unread,
Oct 29, 2020, 4:50:52 AM10/29/20
to schedulix
Hello,

Sometimes the job doesn´t finish due to any error. Is there any way to kill a job when it takes more than certain time?

I tried setting the parameter Expected Runtime but it didn´n work



Any idea?

Kind regards,

Ronald Jeninga

unread,
Oct 29, 2020, 8:56:12 AM10/29/20
to schedulix
Hi Hector,

the expected runtime parameter is just an informational property which isn't used within the server.
It is possible to retrieve the value using the predefined parameter $EXPRUNTIME.
I guess you've been able to set the parameter (that worked), but it didn't show the effect you was hoping for.

There's a far more important property: the "KILL PROGRAM".
If you enter something like "kill -KILL -$PID" here and you submit the job, you'll find that an extra button in the button bar is active.
It's the button that looks a bit like cross hairs.
If you press that button, the system will execute the defined kill program. (It will be executed by the jobserver that runs the job you'd like to kill).
In my example it will execute the command line "kill -KILL -1234", if the PID of the process to kill happens to be 1234.
(And I usually use the negative value because that will kill the entire process group, which is important if the top level process that has to be killed has started a bunch of child processes. This is often the case if it is a shell script).

We've made this configurable for a bunch of reasons.
First of all it isn't guaranteed that a kill -KILL (which doesn't allow the killed process to take any action in response to the signal) is the best way to get rid of a process.
Maybe it is a process that interacts with a database system and it'll be important to interrupt the current running query and to terminate the connection first.
Secondly the way to kill processes (i.e. remove them from the process list) is operating system dependent. Again an indication that the kill program must be configurable.
Thirdly a configurable kill program offers a bit of extra functionality.
For example, it is very common to send a SIGHUP to daemon processes to instruct them to re-read their configuration.
This is something that can be mimic-ed with the kill program.

Hence after specifying the kill program, the system now knows how to kill that job.
The next step is to automate the killing.

Obviously we'll need a kind of trigger to achieve this.
But since there's no event that can fire the trigger, we'll have to regularly check the situation.
The UNTIL_FINISHED and UNTIL_FINAL types of triggers exactly do this.
In the definition you specify a condition and if this condition evaluates to true (at run time), the trigger will fire.

Let's think about the condition first, and afterwards we'll think about what to do if the trigger fires.

As you've discovered already, there is this property Expected Runtime.
And as I've told you above, this value can be accessed using the EXPRUNTIME parameter.
There are two other interesting standard parameters: STARTTIME and SYSDATE.
The STARTTIME parameter is set to the time a job is indeed started. SYSDATE just returns the current date and time.
Thus if "$STARTTIME + $EXPRUNTIME < $SYSDATE", we have the situation that the jobs runs longer than expected.
In order to guarantee that the condition is indeed evaluated numerically, it'll make sense to do some explicit type conversions:

int($STARTTIME) + int($EXPRUNTIME) < int(SYSDATE)

Furthermore, a nice feature of the UNTIL_* triggers is that the condition is evaluated at least once.
If a job finishes, the condition is evaluated and the trigger is fired, if the evaluation yields true.
But even if the expected runtime has been exceeded at that time, it won't make sense to attempt to kill the not-anymore-existing process.
In other words, we'll have to slightly modify the condition, e.g. :

($STATE == "RUNNING" or $STATE == "TO_KILL" or $STATE == "KILLED") and (int($STARTTIME) + int($EXPRUNTIME) < int(SYSDATE))

After having figured out what the condition (more or less) looks like, we'll have to think about the action to take.
Another nice feature (I'm using release 2.9) is that jobs are allowed to commit suicide.
Hence if a job connects to the system and issues an "ALTER JOB WITH KILL;", the kill program will be run.

This means that we can create a generic job that kills jobs on behalf of them.
We'll need some information of the job to kill: the JOBID and the KEY.
So let me assume that a jobs that wants to be killed defines two parameters: MYJOBID and MYKEY with the respective values "$JOBID" and "$KEY".
The triggered job that wants to kill the triggering job will run as a child of that triggering job and is able to access those parameters.
If it defines a run program like:

/bin/bash -c "echo 'alter job with kill;' | sdmsh -h $SDMSHOST -p $SDMSPORT -j $MYJOBID -k $MYKEY"

we're done :-)

I think it is obvious that killing jobs is just one kind of action that can be taken.
It'll be just as easy to open tickets in a ticketing system, to send e-mails or maybe even to start the coffee machine (it's going to be a looong night ;).

I hope this answers your question.
If I managed to confuse you entirely or if you get stuck somewhere, don't hesitate to ask.

Best regards,

Ronald
Reply all
Reply to author
Forward
0 new messages