There is some rather lengthy background information below for those that
are interested. Read further for more information or if you believe that
you need a higher limit.
Background:
Recently we have seen a number of cases where a "large number" of batch
jobs in execution for a user has caused an AFS server to become very
sluggish and finally almost totally unresponsive for all interactive and
batch users of that server. Experimentally we find that "large number" to
be in the range 125-150. Initially we tried to bypass the effects of this
problem by limiting the number of batch jobs for those users who we
observed were precipitating the problem while we investigated the cause.
We were finally able to determine that the problem occurs when processes
on many machines are trying to write to the same AFS directory at the same
time. Whenever an update is made to an AFS directory, all other machines
with open files in that directory must be notified to update their AFS
cache. There is a limit on the number of these notifications that the AFS
server can handle simultaneously and when that limit is exceeded things
start to queue up and response becomes sluggish.
The major cause of this in batch jobs had to do with the way batch job
input and output files were handled by LSF during job execution. We have
modified the LSF configuration to handle this differently so the primary
cause of this for batch jobs has been removed. But there is still a
potential problem which can't easily be bypassed.
The basic problem still remains. If processes on many machines are writing
to the same AFS directory at the same time the server can become clogged.
This is most likely to happen in batch jobs but it can happen in other
situations as well, such as the OPR processing in BaBar. There is very
little we can do on the AFS server to remove this potential bottleneck.
The solution has to rely on users being careful to avoid this situation.
With the change to the LSF configuration mentioned above, the problem in
batch jobs should only occur if a user specifies an output file location
(bsub -o option) in AFS or is writing to an AFS directory. Note that it
is the final output file location that is critical, not any AFS links that
may have to be traversed to get to that location. If the final output
location for batch jobs is in an NFS directory we have not seen the
problem.
In order to control this situation for batch jobs we would have to be able
to determine which jobs are going to write to the same AFS directory and
limit the number in execution at the same time. This is almost impossible
to do so we have imposed the general limit on the number of jobs that any
user can have in execution at once in an attempt to prevent our AFS
servers from degraded response.
In most cases, the current limit of 100-125 batch jobs in execution at
once for any one user should not impose a major hardship but we recognize
that there are users for whom this may be a problem. We can raise the
limit for those users who *need* to run more jobs but it has to be with
the guarantee that the jobs are not writing to an AFS directory. If any
user believes that it is imperative to be able to run more than 100-125
batch jobs simultaneously, a request can be made for an exception to this
policy. BaBar users should send email to Fabrizio Bianchi (userid
bianchi) for such requests. Other users should send email to unix-admin.
Please note that this problem is not exclusive to batch jobs, although
more likely to occur in batch. Whenever there are many machines running
processes which are writing to the same AFS directory the problem can
occur. For example, OPR could run into this situation as well as other
things which run on multiple machines.
Another possible approach in batch jobs would be to disallow output files
from being written to an AFS directory. But this may be an undue hardship
on those users who only run a few jobs at once and who may not have other
alternatives. So we have imposed this batch job limit. We will continue
to explore other alternatives but for now that seems like the wisest
solution. User comments regarding the limit or alternative suggestions
for addressing the problem are welcome.