submitted condor jobs sometimes start after a short delay

65 views
Skip to first unread message

Tobias

unread,
Aug 2, 2010, 1:26:33 PM8/2/10
to acis.p2p.users
Hi,
I'm trying to do some performance tests on a local grid appliance
infrastructure. I have:

- one ga server
- one ga client
- four ga workers
- (one IPOP server)

After all client and workers are connected to the condor pool
(verified with condor_status), I submit several jobs. Sometimes the
jobs are executed immediately, sometimes there is a delay of 10
seconds. There is also sometimes a delay of 10 seconds between the
execution of 2 jobs.

Does anybody know the reason for this delay and how I can change that?


Greetings,
Tobias

Kristof overdulve

unread,
Aug 2, 2010, 1:51:21 PM8/2/10
to acisp2...@googlegroups.com
Hi Tobias,

The matchmaker of Condor works in so called negotiation cycles. This means that a condor_negotiator (ga server) will query all condor_schedd daemons (ga client) for the jobs that they have in their job queue. Jobs will therefore under low load be matched within the time range of by default 0-60 seconds as the negotiation cycles occur every minute by default. The reason why two jobs will not immediately be matched although workers might be available is probably a result of the fairness policy employed by the condor negotiator. This will hand out pie slices indicating the number of machines each client can use. Having just matched a job will cause the negotiator to prompt other potential job submitters as your pie slice has been exceeded. When no other nodes transmit jobs, the negotiator will come back to you and will ask for your job. Although it is a bit curious, this process could take about ten seconds. Another reason could be that the negotiator decides that you have already used too many machines and that it will only grant you another machine in the next negotiation cycle. You best check the output logs for that.

For other questions regarding problems such as this one, you'd best consult the condor manual. Pay close attention to the administrator part of the manual. Modifying the delay between consecutive negotiation cycles can be done in the configuration files. How this is exactly done can be found in the Condor manual. With the small size of your grid, you could lower the delay to 30 seconds or so.

With kind regards,
Kristof.


--
You received this message because you are subscribed to the Google Groups "acis.p2p.users" group.
To post to this group, send email to acisp2...@googlegroups.com.
To unsubscribe from this group, send email to acisp2pusers...@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/acisp2pusers?hl=en.




--

Met vriendelijke groeten,
Kristof Overdulve.

Tobias

unread,
Aug 10, 2010, 10:30:36 AM8/10/10
to acis.p2p.users
Hi Kristof,
thank you for your answer. I'm looking forward to find my problem ;-)

I have one more question. In my ga client, where I submitted my jobs,
if found the following log-entries in SchedLog (for Job ID 105, 106,
107):

8/8 11:40:13 (pid:3030) Started shadow for job 105.0 on
C145163224.ipop <9.145.163.224:32991> for grid...@C218097024.ipop,
(shadow pid = 7363)
8/8 11:40:13 (pid:3030) Called reschedule_negotiator()
8/8 11:40:13 (pid:3030) Sent REQUEST_CLAIM to startd C168093093.ipop
<9.168.93.93:58251> for grid...@C218097024.ipop
8/8 11:40:13 (pid:3030) Starting add_shadow_birthdate(106.0)
8/8 11:40:13 (pid:3030) Started shadow for job 106.0 on
C168093093.ipop <9.168.93.93:58251> for grid...@C218097024.ipop,
(shadow pid = 7373)
8/8 11:40:17 (pid:3030) Sent ad to central manager for
grid...@C218097024.ipop
8/8 11:40:17 (pid:3030) Sent ad to 1 collectors for
grid...@C218097024.ipop
8/8 11:40:33 (pid:3030) Activity on stashed negotiator socket
8/8 11:40:33 (pid:3030) Negotiating for owner:
grid...@C218097024.ipop
8/8 11:40:33 (pid:3030) Checking consistency running and runnable jobs
8/8 11:40:33 (pid:3030) Tables are consistent
8/8 11:40:33 (pid:3030) Rebuilt prioritized runnable job list in
0.000s.
8/8 11:40:33 (pid:3030) Out of jobs - 1 jobs matched, 0 jobs idle,
flock level = 0
8/8 11:40:33 (pid:3030) Sent ad to central manager for
grid...@C218097024.ipop
8/8 11:40:33 (pid:3030) Sent ad to 1 collectors for
grid...@C218097024.ipop
8/8 11:40:33 (pid:3030) Sent REQUEST_CLAIM to startd C217238175.ipop
<9.217.238.175:41401> for grid...@C218097024.ipop
8/8 11:40:33 (pid:3030) Starting add_shadow_birthdate(107.0)
8/8 11:40:33 (pid:3030) Started shadow for job 107.0 on
C217238175.ipop <9.217.238.175:41401> for grid...@C218097024.ipop,
(shadow pid = 7383)


So job 107 started 20 seconds after 105 and 106. But I submitted the 3
Jobs within 2 seconds (see the output from my script). This is the
delay I don't understand.

Sun Aug 8 11:40:12 UTC 2010
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 105.
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 106.
Submitting job(s).
Logging submit event(s).
1 job(s) submitted to cluster 107.
Sun Aug 8 11:40:14 UTC 2010

-- Submitter: C218097024.ipop : <9.218.97.24:9501> : C218097024.ipop
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE
CMD
105.0 griduser 8/8 11:40 0+00:00:01 R 0 0.0
script
106.0 griduser 8/8 11:40 0+00:00:00 R 0 0.0
script
107.0 griduser 8/8 11:40 0+00:00:00 I 0 0.0
script

Can you explain this delay?

Thank you very much,
Tobias


On Aug 2, 7:51 pm, Kristof overdulve <kristof.overdu...@gmail.com>
wrote:
> On Mon, Aug 2, 2010 at 7:26 PM, Tobias <tobias.weik...@googlemail.com>wrote:
>
>
>
> > Hi,
> > I'm trying to do some performance tests on a local grid appliance
> > infrastructure. I have:
>
> > - one ga server
> > - one ga client
> > - four ga workers
> > - (one IPOP server)
>
> > After all client and workers are connected to the condor pool
> > (verified with condor_status), I submit several jobs. Sometimes the
> > jobs are executed immediately, sometimes there is a delay of 10
> > seconds. There is also sometimes a delay of 10 seconds between the
> > execution of 2 jobs.
>
> > Does anybody know the reason for this delay and how I can change that?
>
> > Greetings,
> > Tobias
>
> > --
> > You received this message because you are subscribed to the Google Groups
> > "acis.p2p.users" group.
> > To post to this group, send email to acisp2...@googlegroups.com.
> > To unsubscribe from this group, send email to
> > acisp2pusers...@googlegroups.com<acisp2pusers%2Bunsu...@googlegroups.com>
> > .
Reply all
Reply to author
Forward
0 new messages