sandbox on non-shared filesystem

16 views
Skip to first unread message

Joosep Pata

unread,
Mar 31, 2014, 8:33:05 AM3/31/14
to grid-c...@googlegroups.com
Hi,

We’re evaluating grid-control at the CMS group in NICPB (Tallinn). We are finding this tool extremely useful. However, a few questions have arisen, we’re glad if you’d find some opportunity to enlighten us.

Is it possible to specify a sandbox directory for local cluster jobs that lives only on the worker node (where the task runs)? The issue is that a shared filesystem can be brought down by thousands of simultaneous read/writes in the sandbox directory by the tasks as they run, it would be better to put on e.g. /tmp or /scratch on the workers.

Also, how is the code licensed? We’ve made some improvements to the code, e.g. a new slurm backend supporting the sbatch, sacct commands, but are unsure as to the rights of the original code. Can we make a fork with some acceptable OSS license?

Thanks for the nice software and for making it publicly available.

Cheers,
Joosep Pata

Fred-Markus Stober

unread,
Mar 31, 2014, 11:09:13 AM3/31/14
to Joosep Pata, grid-c...@googlegroups.com
Hi Joosep!

On 03/31/2014 02:33 PM, Joosep Pata wrote:
> Hi,
>
> We're evaluating grid-control at the CMS group in NICPB (Tallinn).
> We are finding this tool extremely useful. However, a few questions
> have arisen, we're glad if you'd find some opportunity to enlighten us.

Nice to hear this - I'm always open to suggestions if you find some room
for improvement. Unfortunately a lot of useful features are a bit hidden -
so I alway suggest looking at the examples and the "documentation.conf"
file.

> Is it possible to specify a sandbox directory for local cluster jobs that
> lives only on the worker node (where the task runs)? The issue is that
> a shared filesystem can be brought down by thousands of simultaneous
> read/writes in the sandbox directory by the tasks as they run, it would be
> better to put on e.g. /tmp or /scratch on the workers.

Most local backends (SGE,PBS,...) are currently written with a shared space
in mind. Only the grid based backends (glite, Condor) are decoupled from a
shared space (unifying this to make it completely configurable has very low
priority on the todo list however, the condor backend is a bit of a
playground
for this). However the amount of activity on the shared space can be
configured a bit.

* First of all there is the scratch space where the job is executed.
You can specify it with the "[backend] scratch path" option. The default is
"TMPDIR /tmp" - meaning it will first try to use the TMPDIR variable and
then
the "/tmp" path. If neither exists it will use the sandbox. I'm not sure
if this
works in your case... Maybe you can send the output of a test job?

* The second thing that can create a lot of activity on the shared space is
the stdout/err stream. The SGE and PBS backends have the option "[backend]
delay output" to write stdout / stderr first to the scrach space (should
be on
the WN) while the job is active. It sets a non-zero value for the
GC_DELAY_OUTPUT
environment variable - and could be done as well for your SLURM backend
i guess.

If scratch is on the worker node and the logs are written there as well, the
shared space activity is usually very low.

> Also, how is the code licensed? We've made some improvements to the
> code, e.g. a new slurm backend supporting the sbatch, sacct commands,
> but are unsure as to the rights of the original code. Can we make a fork
> with some acceptable OSS license?

Hm - I wanted to select the final license when I have time to write the
manual for grid-control. On pages where I had to specify it
(eg. https://www.ohloh.net/p/grid-control-wms) I used LGPL for now.
(However licensing under the Apache license was also an option)

If you want to contribute code, you are welcome to do so (if you are happy
with LGPL or Apache...) - I can give you an account so you can directly
commit
changes - as long as its trivial or contained changes that don't need to be
discussed this allows

> Thanks for the nice software and for making it publicly available.
I should also mention this: I guess you are currently using the stable
release.

In general, development is done in trunk (some people are using
git svn to contribute...). However I'm currently in the process of
finishing the overhaul of the configuration infrastructure -
which caused changes in every single source code file ... This
pretty deep change also had some side effects which I'm trying to
identify and fix at the moment.
So if you want to contribute you should be aware that of this.

Cheer,
Fred

Joosep Pata

unread,
Apr 9, 2014, 5:49:01 AM4/9/14
to Fred-Markus Stober, grid-c...@googlegroups.com
Hi,

Thanks for the feedback! This e-mail is a quasi pull-request to follow up (see attached). I’ve also added some comments inline.

OK, seems to be working, thanks!


>
> * The second thing that can create a lot of activity on the shared space is
> the stdout/err stream. The SGE and PBS backends have the option "[backend]
> delay output" to write stdout / stderr first to the scrach space (should
> be on
> the WN) while the job is active. It sets a non-zero value for the
> GC_DELAY_OUTPUT
> environment variable - and could be done as well for your SLURM backend
> i guess.
>
> If scratch is on the worker node and the logs are written there as well, the
> shared space activity is usually very low.
>
>> Also, how is the code licensed? We've made some improvements to the
>> code, e.g. a new slurm backend supporting the sbatch, sacct commands,
>> but are unsure as to the rights of the original code. Can we make a fork
>> with some acceptable OSS license?
>
> Hm - I wanted to select the final license when I have time to write the
> manual for grid-control. On pages where I had to specify it
> (eg. https://www.ohloh.net/p/grid-control-wms) I used LGPL for now.
> (However licensing under the Apache license was also an option)
>
> If you want to contribute code, you are welcome to do so (if you are happy
> with LGPL or Apache...) - I can give you an account so you can directly
> commit
> changes - as long as its trivial or contained changes that don't need to be
> discussed this allows

Thanks, that would be useful! Also, in the era of github and friends, developing without being able to do pull requests is starting to feel awkward. Would be nice if gc was hosted on some public service supporting that ;) I’m pretty sure that this software has wider applicability.


>
>> Thanks for the nice software and for making it publicly available.
> I should also mention this: I guess you are currently using the stable
> release.
Yes, I was using stable, now I’ve made the commits to trunk using git svn, also attached a git diff.


>
> In general, development is done in trunk (some people are using
> git svn to contribute...). However I'm currently in the process of
> finishing the overhaul of the configuration infrastructure -
> which caused changes in every single source code file ... This
> pretty deep change also had some side effects which I'm trying to
> identify and fix at the moment.
> So if you want to contribute you should be aware that of this.

Thanks and cheers,
Joosep

>
> Cheer,
> Fred
>

slurm.diff
Reply all
Reply to author
Forward
0 new messages