bjs problems

0 views
Skip to first unread message

Daniel Gruner

unread,
Dec 24, 2008, 6:15:19 PM12/24/08
to xc...@googlegroups.com
Hi All,

I've just run into a problem with bjs, which on the surface seems like
some silly magic number issue: I have a cluster with 42 nodes, and
when I start bjs it thinks it only has 32. I have played with
bjs.conf, and I am pretty sure the problem is not there. This was
never an issue under bproc, so the "limit" of 32 nodes must be some
hard-coded number in the xcpu-specific modifications to the original.

There are also a few other issues, like it not being able to build
"out of the box" for version 755. There is some leftover garbage in
bjs.c, which I had to comment out.

Another issue is that it is not quite integrated into sxcpu yet, so
the Makefile needs to be modified, the rc script must be modified to
reflect the location of the executables, etc. I will submit these
changes later, but in the meantime I am looking for the magic "32".

Happy holidays to all,
Daniel

Daniel Gruner

unread,
Dec 29, 2008, 8:36:51 PM12/29/08
to xc...@googlegroups.com
Further info on the bjs bug(s):

The "magic number" does not seem to be that, but rather a problem in
parsing a node list from the /etc/xcpu/bjs.conf file. Here is my
sample bjs.conf:

# Sample BJS configuration file
#
# $Id: bjs.conf,v 1.10 2003/11/10 19:40:22 mkdist Exp $

spooldir /var/spool/bjs
policypath /usr/local/lib64/bjs:/usr/local/lib/bjs
socketpath /tmp/.bjs
#acctlog /tmp/acct.log
statfsaddr localhost!20003

pool default
policy filler
# nodes n00[00-41]
nodes n0000,n0001,n0002,n0003,n0004,n0005,n0006,n0007,n0008,n0009,n0010,n0011,n0012,n0013,n0014,n0015,n0016,n0017,n0018,n0019,n0020,n0021,n0022,n0023,n0024,n0025,n0026,n0027,n0028,n0029,n0030,n0031,n0032,n0033,n0034,n0035,n0036,n0037,n0038,n0039,n0040,n0041
maxsecs 20000000


If I specify the nodes line as a comma-separated list of nodenames, it
works. However, the alternative syntax (like the commented nodes
line) that specifies the nodes as a range is not parsed correctly, and
produces a bunch of different errors.

Another bug:
- Attempting to submit a job using "bjssub", when the user either does
not have an id_rsa.pub key or the system cannot read the id_rsa.pub
(e.g. when the home directory is on an nfs filesystem that does not
allow root access), causes bjs to crash. When bjs is restarted, it
attempts to run the job, but then xrx cannot run it because of lack of
permissions - no user was authenticated on the nodes.

I think there should be more stringent checks on the user before the
job is accepted by bjs.

Yet another one:
- when a node in the bjs.conf list is down at the time bjs is started,
bjsstat correctly shows that there is a node missing. However, when
the node comes back up, and xstat shows it is up, bjs does not "pick
it up". Furthermore, trying to get bjs to reload the config file
(e.g. with killall -HUP bjs), causes it to crash. One must completely
restart bjs in order to get the number of nodes right. This is not
the expected behaviour (the bproc version always worked correctly,
reflecting the actual number of available nodes). Should bjs be
polling statfs more often?

Regards,
Daniel

Reply all
Reply to author
Forward
0 new messages