Machines currently online, job dispatch CPU usage, and "Quick start" tutorial questions

4 views
Skip to first unread message

Nate

unread,
Sep 17, 2009, 2:43:13 PM9/17/09
to Archer User's Group
condor_status reports 71 machines in the Archer pool (55 Intel and 16
X86_64). The Archer wiki says "Archer currently has a pool of
approximately 240 CPUs" but doesn't say how many machines there are.
Is 71 the expected number to see?

I am running the Grid Appliance on a fairly slow host. During
submission of jobs the virtual (and the physical) CPU grinds to a
halt. Is this expected on slow hosts, and if so, is there a way to
reduce CPU usage during job submission? If there are over a couple
hundred jobs the virtual CPU usage appears to shut out the condor
system and condor_status shows an error for an hour or so, during
which times all jobs report "idle". After this time the connection
returns and the jobs go out for execution. I can provide more details
if this is not previously known behavior.

Finally (and this is more Grid Appliance related) the "Quick start"
tutorial for VirtualBox says "Instructions on how to set up a host-
only network for the Grid Appliance's eth1 NIC will be provided in the
near future." but was last updated in January 2008. With the latest
version of VirtualBox, setting up a host-only network is very easy and
self-contained. I can provide the necessary instructions for whomever
is in charge of that page. More generally, the VirtualBox
instructions and screenshots are slightly off for both the new Grid
Appliance and the new VirtualBox; I can provide updates if that would
be useful.

Thanks,
Nathan

rjo...@gmail.com

unread,
Sep 17, 2009, 2:57:22 PM9/17/09
to Archer User's Group
Nate, see comments below:

On Sep 17, 2:43 pm, Nate <nbly...@gmail.com> wrote:
> condor_status reports 71 machines in the Archer pool (55 Intel and 16
> X86_64).  The Archer wiki says "Archer currently has a pool of
> approximately 240 CPUs" but doesn't say how many machines there are.
> Is 71 the expected number to see?

We're in a transition period this week migrating to the newer version
of the appliance, so some appliances are still running the old pool.
Good news is we have a new cluster coming online at UMN, we should
have about 400 cores by next week.

>
> I am running the Grid Appliance on a fairly slow host.  During
> submission of jobs the virtual (and the physical) CPU grinds to a
> halt.  Is this expected on slow hosts, and if so, is there a way to
> reduce CPU usage during job submission?  If there are over a couple
> hundred jobs the virtual CPU usage appears to shut out the condor
> system and condor_status shows an error for an hour or so, during
> which times all jobs report "idle".  After this time the connection
> returns and the jobs go out for execution.  I can provide more details
> if this is not previously known behavior.
>

One idea is to stagger the submission. Condor tries to transfer
multiple files in parallel and this can slow down your machine.
There's a parameter in Condor to limit the max number of concurrent
jobs to submit, I'll have to look it up.


> Finally (and this is more Grid Appliance related) the "Quick start"
> tutorial for VirtualBox says "Instructions on how to set up a host-
> only network for the Grid Appliance's eth1 NIC will be provided in the
> near future." but was last updated in January 2008.  With the latest
> version of VirtualBox, setting up a host-only network is very easy and
> self-contained.  I can provide the necessary instructions for whomever
> is in charge of that page.  More generally, the VirtualBox
> instructions and screenshots are slightly off for both the new Grid
> Appliance and the new VirtualBox; I can provide updates if that would
> be useful.

Actually, we've been meaning to move much of the grid appliance
documentation currently on Joomla to the Wiki so others can
contribute; Nate, if you know how to edit the Wiki, would you mind
creating an entry for VirtualBox, say
http://www.grid-appliance.org/wiki/index.php/Archer:VirtualBoxHowTo
and I can link to it from the Grid appliance and Archer pages.

--rf

>
> Thanks,
>  Nathan

Nathan Blythe

unread,
Sep 17, 2009, 3:30:25 PM9/17/09
to archer-us...@googlegroups.com
Thanks for the feedback.

I think the parameter you're referring to is
"CONCURRENCY_LIMIT_DEFAULT" (or the other concurrency features). I
don't mind how many jobs are running simultaneously (the
Condor-suggested use case for that parameter is access to
license-restricted resources), but I'd like to stagger the times at
which they are submitted to Condor. I see some other commands that
would be useful for this, though.

Also, I'm currently running in the Vanilla universe. If I can find
Condor support for Haskell I will recompile for the Standard universe
and enable caching (as I'm providing the same small file to every
job), which should help.

I'll see if I can put some information on the Wiki about VirtualBox.

- Nathan

Nathan Blythe

unread,
Sep 17, 2009, 3:42:34 PM9/17/09
to archer-us...@googlegroups.com
I am going to put the wiki material in the general wiki and not under
the Archer namespace as it relates to the Grid Appliance in general
(as per http://www.grid-appliance.org/wiki/index.php/Using_The_Wiki).
If that isn't right let me know and I'll move it.

- Nathan

Renato Figueiredo

unread,
Sep 17, 2009, 3:42:49 PM9/17/09
to archer-us...@googlegroups.com
You can also try the following two parameters.
Another idea, if you put your binaries on your appliance's local NFS server, you may get better performance - you get also the benefit of caching, see http://www.grid-appliance.org/wiki/index.php/Archer:NFS_HOWTO (I'd guess you probably have more bytes being transferred for the binaries than the input files)
MAX_CONCURRENT_UPLOADS
This specifies the maximum number of simultaneous transfers of input files from the submit machine to execute machines. The limit applies to all jobs submitted from the same condor_schedd. The default is 10. A setting of 0 means unlimited transfers. This limit currently does not apply to grid universe jobs or standard universe jobs. When the limit is reached, additional transfers will queue up and wait before proceeding.
MAX_JOBS_RUNNING
This macro limits the number of processes spawned by a given
condor_schedd, for all job universes except the grid universe. See
section 2.4.1. This includes, but is not limited to condor_ shadow
processes, and scheduler universe processes, including condor_ dagman.
The actual number of condor_ shadows may be less if you have reached
your $(RESERVED_SWAP) limit. This macro has a default value of 200.

--
Dr. Renato J. Figueiredo
Associate Professor
ACIS Lab - ECE - University of Florida
UF Site Director, Center for Autonomic Computing
http://byron.acis.ufl.edu
ph: 352-392-6430

Renato Figueiredo

unread,
Sep 23, 2009, 8:19:18 AM9/23/09
to archer-us...@googlegroups.com
On Thu, Sep 17, 2009 at 3:42 PM, Nathan Blythe <nbl...@gmail.com> wrote:

I am going to put the wiki material in the general wiki and not under
the Archer namespace as it relates to the Grid Appliance in general
(as per http://www.grid-appliance.org/wiki/index.php/Using_The_Wiki).
If that isn't right let me know and I'll move it.


Good point, Nathan. If you have ahad a chance to work on a virtualbox howto, please let me know and we'll link to it and ask for other vbox users to check/improve it.

--rf

 

Nathan Blythe

unread,
Sep 23, 2009, 11:21:40 AM9/23/09
to archer-us...@googlegroups.com
I wrote enough content to subsume the old non-wiki page, so linking it
is probably safe. The instructions are for the new(er) versions of
VirtualBox and include the host-only networking instructions.

http://www.grid-appliance.org/wiki/index.php/VirtualBox_How_To

Renato Figueiredo

unread,
Sep 23, 2009, 1:10:29 PM9/23/09
to archer-us...@googlegroups.com
very good, thanks, Nathan! the virtualbox links now point to your entry.
--rf

Nathan Blythe

unread,
Sep 23, 2009, 2:46:42 PM9/23/09
to archer-us...@googlegroups.com
Thanks, glad to help :)

Going back to my earlier questions about Archer: I changed my
application to do more work in each process. I'm noticing now that my
host is marked as "Owner" in the results of condor_status and that the
CPU usage is at 100% (entirely the mono process). I verified with
condor_q -run that my processes are not executing on my host.

How can I figure out what is executing on my machine? If my host is
marked "owner" then it's not running anyone else's job, correct? And
if my own jobs show up in condor_q with hostnames not matching my own,
then they're not executing locally.

- Nathan

rjo...@gmail.com

unread,
Sep 23, 2009, 8:26:20 PM9/23/09
to Archer User's Group


On Sep 23, 2:46 pm, Nathan Blythe <nbly...@gmail.com> wrote:
> Thanks, glad to help :)
>
> Going back to my earlier questions about Archer: I changed my
> application to do more work in each process.  I'm noticing now that my
> host is marked as "Owner" in the results of condor_status and that the
> CPU usage is at 100% (entirely the mono process).  I verified with
> condor_q -run that my processes are not executing on my host.

Owner means either there is activity on mouse/keyboard on your
machine, or the load is above a threshhold (0.3 I think). When a
machine is in Owner mode, it does not run Condor jobs.

>
> How can I figure out what is executing on my machine?  If my host is
> marked "owner" then it's not running anyone else's job, correct?  And
> if my own jobs show up in condor_q with hostnames not matching my own,
> then they're not executing locally.

Correct, owner means it's not running anyone's job. Seems like mono
and IPOP are taking most of your virtual cpu cycles; this is probably
because you may still be transferring many binaries concurrently for
your runs. You might be able to still bump up the work per job some
more.

>
> - Nathan
>
> On 9/23/09, Renato Figueiredo <rjo...@gmail.com> wrote:
>
> > very good, thanks, Nathan! the virtualbox links now point to your entry.
> > --rf
> > On Wed, Sep 23, 2009 at 11:21 AM, Nathan Blythe <nbly...@gmail.com> wrote:
>
> >> I wrote enough content to subsume the old non-wiki page, so linking it
> >> is probably safe.  The instructions are for the new(er) versions of
> >> VirtualBox and include the host-only networking instructions.
>
> >>http://www.grid-appliance.org/wiki/index.php/VirtualBox_How_To
>
> >> On 9/23/09, Renato Figueiredo <rjo...@gmail.com> wrote:
> >> > On Thu, Sep 17, 2009 at 3:42 PM, Nathan Blythe <nbly...@gmail.com>
> >> wrote:
>
> >> >> I am going to put the wiki material in the general wiki and not under
> >> >> the Archer namespace as it relates to the Grid Appliance in general
> >> >> (as perhttp://www.grid-appliance.org/wiki/index.php/Using_The_Wiki).
> >> >> If that isn't right let me know and I'll move it.
>
> >> > Good point, Nathan. If you have ahad a chance to work on a virtualbox
> >> howto,
> >> > please let me know and we'll link to it and ask for other vbox users to
> >> > check/improve it.
>
> >> > --rf
>
> >> >> - Nathan
>
Reply all
Reply to author
Forward
0 new messages