Sending multiple bulk jobs to a PBS resource

9 views
Skip to first unread message

Tobias Mahlmann

unread,
Mar 15, 2011, 8:43:34 PM3/15/11
to migrid
Hi list,
I'm trying to configure MiG in a way that it's sends multiple bulk
jobs (preferably from different users) to a PBS resource as long as it
can supply the required job resources. I set "bulk" as the standard
job type in the MiGServer.conf, the resource config and the job
description. As far as I've reversed engineered MiG (Is there any doc
available how this thing works apart from the user guideline, resource
owner docs and coding guidelines? I haven't found any), the problem is
that the resource script only pulls one job at a time.
Am I doing something completely wrong here, or how is the concept of
"bulk" jobs supposed to work in MiG?

Any help appreciated!

Best,
Tobias
--
Tobias Mahlmann
PhD Student
IT University of Copenhagen
Center for Computer Games Research
Rued Langgaards Vej 7 4B11
DK-2300 KBH S
E-mail: tm...@itu.dk
Phone: +45 72 18 52 98
Web: game.itu.dk

Jonas Bardino

unread,
Mar 16, 2011, 4:34:55 AM3/16/11
to migrid
> E-mail: t...@itu.dk
> Phone:  +45 72 18 52 98
> Web: game.itu.dk

Hej Tobias

Bulk jobs in the form of packing multiple jobs inside a single job
slot on the resources is an experimental feature and it never really
gained traction in production, so I don't know for sure if it still
works. The MiG design relies on a single job only per resource exe
node, so the bulk job implementation makes a number of hacks to work
around that assumption. The proper way to implement it would be a
redesign to remove the single job limitation and let each resource exe
node take compatible jobs until it is full, but we haven't had the
resources to work on that so far. Please also refer to the "Python-
only resources" ticket in our issue tracker:
http://code.google.com/p/migrid/issues/detail?id=61

For LRMS resources (e.g. those with PBS), a similar but more tested
behavior can be achieved by creating multiple exe nodes for the
resource. In that way you can decide how many concurrent jobs you want
to allow while leaving the actual concurrency management to PBS. If
two jobs fit at the same time PBS will run them concurrently, and
otherwise the MiG scheduler will decide if one should get locally
queued based on the queue-delay mechanism. This is of course
suboptimal compared to bulk fill by MiG but in real life it works well
enough. With the MiG queue-delay feature it is still possible to
compensate somewhat for queue load on the PBS resource. You can find
an example queue-delay script for PBS in the mig/resource/pbs-scripts
folder.

If you want to support multiple concurrent jobs even from different
MiG users you can create multiple local users on the PBS resource and
assign them to individual exe nodes. In that way you get a more
general multi-job concurrency than the MiG bulk setup where only jobs
from the same user can be bulk-executed.

We do not have much documentation on the server internals apart from
the (Wiki http://code.google.com/p/migrid/wiki/DeveloperTopics),
README (http://code.google.com/p/migrid/source/browse/trunk/README)
and inline comments in the code, but feel free to ask here!

Cheers, Jonas

Tobias Mahlmann

unread,
Mar 17, 2011, 8:55:04 AM3/17/11
to migrid
Thanks Jonas, that was most helpful. Creating several exe nodes and -
users works fine for me.

But I've got a little follow-up question:
Whenever a user cancels a job, the scheduler restarts the exe resource
- via "kill" per default in the resource config -
but the job stays in the PBS queue. I looked into the exe node
masterscript, to see if I can
call it with a "remove job then die" parameter instead. I saw that the
LRMS stop_job script
from the pbs-scripts directory is referenced there but never called.
Is there a more advanced version of the script available, should I use
another one or should I alter the template myself?
(Or did I misconfigure something and this should work out-of-the-box)
I'm a bit unsure when LRMSREMOVECOMMAND is supposed to be called and
by whom.

Thanks

Tobias

Jonas Bardino

unread,
Mar 19, 2011, 11:38:15 AM3/19/11
to migrid
You're welcome!

It seems LRMSREMOVECOMMAND is only used during stop when the resource
uses either of the *-execution-leader LRMS setups (look for
remove_job_command in mig/resource/dummy_node_script.sh). Timeout
restart includes a stop call, so that too should work in the leader
case.
We should definitely use the same procedure for resources without -
execution-leader, but as you can see in mig/shared/resource.py we
still use a raw kill command for the non-leader case. The
master_node_script does not currently support action handlers like
dummy_node_script, so it will require some structural changes to
support.
I'll add a bug report for now, but I think you should be able to work
around it and limit the resource load by using the more efficient -
execution-leader LRMS choice.

Have a nice weekend!

Cheers, Jonas

Tobias Mahlmann

unread,
Mar 28, 2011, 11:10:53 AM3/28/11
to migrid
>On 19 Mrz., 17:38, Jonas Bardino <jonas.bard...@gmail.com> wrote:
>[..] I think you should be able to work
>around it and limit the resource load by using the more efficient -
>execution-leader LRMS choice.

Sorry for the late reply, I had to wait until our system became idle
again to play around with its configuration.

It turned out that I had it in batch-execution-leader mode already,
but there was no "execution-leader" node in my configuration. I was
never aware that there has to be such a virtual node nor did mig print
out any kind of warning. It seems that the resource configuration-
interface only adds one if you create a *-execution-leader
configuration from scratch (not if you change an existing resource).

Anyway, now jobs get correctly removed from the PBS queue when
cancelled through mig and all my problems are solved (for now) ;-)


Cheers,
Tobias

Jonas Bardino

unread,
Mar 28, 2011, 2:45:32 PM3/28/11
to migrid
Good!
The missing leader thing sounds like a bug: Thanks for reporting it -
I'll put it on the list.

Cheers, Jonas

Jonas Bardino

unread,
Mar 30, 2011, 12:20:56 PM3/30/11
to migrid
Hi Tobias

I haven't been able to reproduce the missing execution-leader problem
with either 'Batch' or 'Native' LRMS choice during creation. I tried
them with one and eight exe nodes respectively and got execution
leader in both cases when I used the resource editor to switch to the
corresponding -execution-leader LRMS setting afterwards.

Can you please provide some more info or better a step-by-step guide
for triggering the problem. I tested with svn trunk, so if you
encountered it with e.g. an older official release that would also be
a useful piece of information.

Cheers, Jonas
Reply all
Reply to author
Forward
0 new messages