[Condor-users] Job rescheduling

103 views
Skip to first unread message

Janito Ferreira Filho

unread,
Aug 14, 2009, 10:20:46 AM8/14/09
to condor...@cs.wisc.edu
Hi,

I've investigated more into the matter of the rescheduling of jobs after an execution node has died, and although it appears to be working, it's taking too long. If I shutdown an execute node with a job running on it, and then restart it, it takes two hours for condor to remove the failed job (until that point Condor thinks it's still running) and reschedule it (sometimes to run on the same node, which was unclaimed since the restart). I searched the manual, but I can't seem to find where to configure this two hour delay. Can someone please point me in the right direction? Thank you,

JVFF


Share your memories online with anyone you want anyone you want.

Matthew Farrellee

unread,
Aug 17, 2009, 10:48:23 AM8/17/09
to Condor-Users Mail List
Janito Ferreira Filho wrote:
> Hi,
>
> I've investigated more into the matter of the rescheduling of jobs after an execution node has died, and although it appears to be working, it's taking too long. If I shutdown an execute node with a job running on it, and then restart it, it takes two hours for condor to remove the failed job (until that point Condor thinks it's still running) and reschedule it (sometimes to run on the same node, which was unclaimed since the restart). I searched the manual, but I can't seem to find where to configure this two hour delay. Can someone please point me in the right direction? Thank you,
>
> JVFF

Have a look at ...

http://www.google.com/search?q=site%3Awww.cs.wisc.edu%2Fcondor%2Fmanual%2Fv7.3+claim+alive

Specifically around MAX_CLAIM_ALIVES_MISSED and ALIVE_INTERVAL.

If you're seeing a 2 hour timeout that sounds fairly familiar. I believe Todd answered it previously. I'd assume his answer was to reverse the direction on the alive messages. I'll ping him to include details.

Best,


matt
_______________________________________________
Condor-users mailing list
To unsubscribe, send a message to condor-use...@cs.wisc.edu with a
subject: Unsubscribe
You can also unsubscribe by visiting
https://lists.cs.wisc.edu/mailman/listinfo/condor-users

The archives can be found at:
https://lists.cs.wisc.edu/archive/condor-users/

Todd Tannenbaum

unread,
Aug 17, 2009, 10:55:52 AM8/17/09
to Condor-Users Mail List
Matthew Farrellee wrote:

>
> If you're seeing a 2 hour timeout that sounds fairly familiar. I
> believe Todd answered it previously. I'd assume his answer was to
> reverse the direction on the alive messages. I'll ping him to include
> details.
>

Here is what we think can be done with the current Condor binaries to
address the problem :

Set in the condor_config on **both** the submit machines (running the
condor_schedds) AND the execute machines (running the condor_startds)
the following setting:

STARTD_SENDS_ALIVES = True

Then do a condor_reconfig as usual to both submit and execute machines
(or a condor_reconfig -all). Note that the default setting for this
parameter is False, so if it is not specified in the config it is False.
Unfortunately, Condor will not (yet) gracefully handle the situation
where the value is different on the submit -vs- execute machines.

Upon doing the above, your job ClassAds will contain an attribute
"LastJobLeaseRenewal" which will contain an integer representing the
epoch time (number of seconds since 1/1/1970) since it last heard from
the startd on the execute machine.

So in your job submit description file (which you give to
condor_submit), you could add the following:

PeriodicHold = JobLeaseDuration =!= UNDEFINED && \
((JobLeaseDuration - (CurrentTime - LastJobLeaseRenewal)) <= 0 )
PeriodicRelease = PeriodicHold =?= True

The above says that if the job has a job lease, and the lease has
expired, put the job on hold, thereby move it from Running state to Hold
state. Then the periodic release expression says if the lease is
expired (ergo the PeriodicHold expression is true), then release the job
from Hold state back to Idle state -- at which point it will be
rescheduled someplace else. Note you can use SUBMIT_EXPRS (see Manual)
to have condor_submit automatically add the above policy into every job
submitted.

Let us know how the above suggestions go.

In a future release of Condor, we wish to do the following:
a) make STARTD_SENDS_ALIVE default to True
b) have the schedd automatically move a job with an expired lease
from Running back to idle the moment the lease expires, without
requiring the user to utilize the periodic hold/release expressions, and
the polling delay the use of these expressions introduces (the schedd
only periodically evaluates the periodic expressions).


--
Todd Tannenbaum University of Wisconsin-Madison
Condor Project Research Department of Computer Sciences
tann...@cs.wisc.edu 1210 W. Dayton St. Rm #4257
Phone: (608) 263-7132 Madison, WI 53706-1685

Reply all
Reply to author
Forward
0 new messages