perceus and xcpu [was onesis]

14 views
Skip to first unread message

Roger Mason

unread,
Dec 18, 2008, 12:38:35 PM12/18/08
to xc...@googlegroups.com

Hello,

"Abhishek Kulkarni" <abby...@gmail.com> writes:

> Did you try the version (sxcpu r715) already included in Perceus? Does
> it not work for you?

I have built perceus-1.4.4 using

a) the tarball and configure, make, make install
b) a Gentoo ebuild (which does a) automatically)

With bothe methods when I get to this bit in daniel's instructions:

xgroupset ...

my system cannot find xgroupset (find / -iname '*xgroupset*' returns
nothing).

There must be something I missed.

Roger

Daniel Gruner

unread,
Dec 18, 2008, 4:07:39 PM12/18/08
to xc...@googlegroups.com
These are utilities that come with xcpu. If you build sxcpu after
getting it with svn,
and then install, you will get all these utilities in /usr/local/bin
(and things like statfs and xcpufs in /usr/local/sbin). I guess that
if you have the sxcpu tarball in the "proper" location for perceus it
might build the utilities too, but I don't know what it does about
installation.

Daniel

Roger Mason

unread,
Dec 19, 2008, 1:18:42 PM12/19/08
to xc...@googlegroups.com
Hello Daniel,

"Daniel Gruner" <dgr...@gmail.com> writes:

> These are utilities that come with xcpu. If you build sxcpu after
> getting it with svn,
> and then install, you will get all these utilities in /usr/local/bin
> (and things like statfs and xcpufs in /usr/local/sbin). I guess that
> if you have the sxcpu tarball in the "proper" location for perceus it
> might build the utilities too, but I don't know what it does about
> installation.

I installed scxpu 755 without disturbing the version that came with
perceus. I now have xgroupset. However, when I do

xgroupset add -a -u

I am told

Error: Connection refused:127.0.0.1

Cheers,
Roger

Daniel Gruner

unread,
Dec 19, 2008, 6:21:55 PM12/19/08
to xc...@googlegroups.com
Ok, this is different. Can you explain your setup? I assume that you
have by now booted at least one node using perceus, and it is running
xcpufs, right?

Do you have statfs running on your master node? It needs to be there,
and its contents need to match your nodes. You can even run xcpufs on
your master node and add it to /etc/xcpu/statfs.conf together with the
other nodes, but I don't recommend you do that (unless you are just
testing).

I would also start by testing xgroupset on a single node and with a
single group, which doesn't require that statfs be running or be
correctly configured. For example, you can try:

xgroupset add n0000 root 0

and then you can add the root user:

xuserset add n0000 root 0 root ~root/.ssh/id_rsa.pub

(I hope I had the correct syntax).

If these commands work, then you can do "xrx n0000 date", for example,
and it should work.

Daniel

Roger Mason

unread,
Dec 20, 2008, 7:54:59 AM12/20/08
to xc...@googlegroups.com
"Daniel Gruner" <dgr...@gmail.com> writes:

> Ok, this is different. Can you explain your setup? I assume that you
> have by now booted at least one node using perceus, and it is running
> xcpufs, right?

I have one node running and xcpufs is running (I assume, no error
messages on the console when I booted the node).

> Do you have statfs running on your master node? It needs to be there,
> and its contents need to match your nodes. You can even run xcpufs on
> your master node and add it to /etc/xcpu/statfs.conf together with the
> other nodes, but I don't recommend you do that (unless you are just
> testing).

I started it from the command line. I read the man page yesterday and
(from memory) that says it daemonises.

> I would also start by testing xgroupset on a single node and with a
> single group, which doesn't require that statfs be running or be
> correctly configured. For example, you can try:
>
> xgroupset add n0000 root 0

Ah, progress: that worked.

> and then you can add the root user:
>
> xuserset add n0000 root 0 root ~root/.ssh/id_rsa.pub

More progress: that worked after I generated a key for root with ssh-keygen.

> If these commands work, then you can do "xrx n0000 date", for example,
> and it should work.


Almost there but not quite:

lowalbite man4 # xrx n0000 date
Error: /root/.ssh/id_rsa: error:0906B072:lib(9):func(107):reason(114)

Thanks for your help.

Roger

Daniel Gruner

unread,
Dec 21, 2008, 12:29:46 AM12/21/08
to xc...@googlegroups.com
OK!!

Now you may be running into an issue with versions. The version of
sxcpu that comes with perceus is a bit dated (this stuff moves
along...), and that is possibly why the xgroupset add -a -u command
didn't work. You may want to upgrade the xcpufs in the perceus tree
(/usr/local/var/lib/perceus/modules/xcpu/xcpufs
or something like that) with the statically compiled one that you can
make in the xcpufs subdirectory of your sxcpu source tree, using the
LINKSTATIC script in there (as in my first set of instructions).

The funny error you got about the id_rsa is beyond me...

Daniel

Roger Mason

unread,
Dec 23, 2008, 9:36:52 AM12/23/08
to xc...@googlegroups.com
Hi Daniel,

"Daniel Gruner" <dgr...@gmail.com> writes:

> Now you may be running into an issue with versions. The version of
> sxcpu that comes with perceus is a bit dated (this stuff moves
> along...), and that is possibly why the xgroupset add -a -u command
> didn't work. You may want to upgrade the xcpufs in the perceus tree
> (/usr/local/var/lib/perceus/modules/xcpu/xcpufs
> or something like that) with the statically compiled one that you can
> make in the xcpufs subdirectory of your sxcpu source tree, using the
> LINKSTATIC script in there (as in my first set of instructions).
>
> The funny error you got about the id_rsa is beyond me...

I built sxcpu and made the static version of xcpufs, which I copied to
perceus' modules directory. I rebooted the master and a node. Here
is a transcript of what I did afterwards:

lowalbite ~ $ su -
Password:
lowalbite ~ # ps auwx | grep perceus
root 6112 0.0 0.3 1572 384 ? Ss 10:29 0:00 /usr/libexec/perceus/perceus-xget -o -x -p 988 /var/lib/lib/perceus
nobody 6115 0.0 0.4 1808 576 ? S 10:29 0:00 /usr/libexec/perceus/perceus-dnsmasq --dhcp-leasefile=/var/lib/lib/perceus/dhcpd.leases --conf-file=/etc/perceus/dnsmasq.conf
root 6117 0.0 5.4 11864 6824 ? S 10:29 0:00 perceusd: master perceus server
root 6118 0.0 5.0 9240 6252 ? S 10:29 0:00 perceusd: node connection handler process 1
root 6119 0.0 5.0 9240 6252 ? S 10:29 0:00 perceusd: node connection handler process 2
root 6120 0.0 5.0 9240 6252 ? S 10:29 0:00 perceusd: node connection handler process 3
root 6123 0.0 5.0 9240 6252 ? S 10:29 0:00 perceusd: node connection handler process 4
root 6423 0.0 0.4 1556 592 pts/2 S+ 10:53 0:00 grep --colour=auto perceus
lowalbite ~ # stat
stat statfs
lowalbite ~ # statfs
lowalbite ~ # ps auwx | grep statfs
root 6427 0.0 0.3 1552 492 pts/2 S+ 10:53 0:00 grep --colour=auto statfs
lowalbite ~ # ps auwx | grep stat
nobody 5754 0.0 0.5 1588 652 ? Ss 10:29 0:00 /sbin/rpc.statd -p 32765 -o 32766
root 6429 0.0 0.4 1556 588 pts/2 S+ 10:53 0:00 grep --colour=auto stat
lowalbite ~ # perceus module activate xcpu
ERROR: This module is already set active at 'init/all'!
Perceus Module 'xcpu' was not enabled in any additional provisionary states!
lowalbite ~ # perceus module activate ipaddr
ERROR: This module is already set active at 'init/all'!
Perceus Module 'ipaddr' was not enabled in any additional provisionary states!
lowalbite ~ # xgroupset add -a -u
Error: Connection refused:127.0.0.1
lowalbite ~ # xuserset add -a -u
Error: Connection refused:127.0.0.1
lowalbite ~ # xrx -a date


Error: /root/.ssh/id_rsa: error:0906B072:lib(9):func(107):reason(114)

lowalbite ~ # xgroupset add n0000 root 0
lowalbite ~ # xuserset add n0000 root 0 root ~root/.ssh/id_rsa.pub
lowalbite ~ # xrx n0000 date


Error: /root/.ssh/id_rsa: error:0906B072:lib(9):func(107):reason(114)

Cheers,
Roger

Daniel Gruner

unread,
Dec 23, 2008, 6:48:55 PM12/23/08
to xc...@googlegroups.com
Beats me!

There are a couple of things you need not (or should not) do. Once
you have configured perceus, it should simply restart on reboot.
After that you should not run the "perceus activate module..."
commands. You say the node booted up. Have you actually looked at
the node's console when it boots? I assume it does work, since you
can do the xgroupset and xuserset stuff. After that I don't know.

What does xstat return?

What are the contents of the /etc/xcpu directory?

The xcpu gurus should be able to help...

Daniel

Roger Mason

unread,
Dec 24, 2008, 1:54:16 PM12/24/08
to xc...@googlegroups.com
"Daniel Gruner" <dgr...@gmail.com> writes:

> There are a couple of things you need not (or should not) do. Once
> you have configured perceus, it should simply restart on reboot.
> After that you should not run the "perceus activate module..."
> commands.

OK.

> You say the node booted up. Have you actually looked at
> the node's console when it boots? I assume it does work, since you
> can do the xgroupset and xuserset stuff. After that I don't know.

Yes, there is a console on the node. I have not tried to do much with
it but simple things like 'ls' certainly work.

> What does xstat return?

lowalbite ~ # xstat
Error: could not obtain node list from statfs: Connection refused:127.0.0.1: 111

> What are the contents of the /etc/xcpu directory?

lowalbite ~ # ls /etc/xcpu/
admin_key admin_key.pub statfs.conf statfs.conf~

lowalbite ~ # cat /etc/xcpu/statfs.conf
#/etc/xcpu/statfs.conf
n0000=tcp!192.168.0.100!6667
n0001=tcp!192.168.0.101!6667


Thanks and best wishes,
Roger

Daniel Gruner

unread,
Dec 24, 2008, 4:02:53 PM12/24/08
to xc...@googlegroups.com
Hi Roger,

On Wed, Dec 24, 2008 at 1:54 PM, Roger Mason <rma...@esd.mun.ca> wrote:
>
> "Daniel Gruner" <dgr...@gmail.com> writes:
>
>> There are a couple of things you need not (or should not) do. Once
>> you have configured perceus, it should simply restart on reboot.
>> After that you should not run the "perceus activate module..."
>> commands.
>
> OK.
>
>> You say the node booted up. Have you actually looked at
>> the node's console when it boots? I assume it does work, since you
>> can do the xgroupset and xuserset stuff. After that I don't know.
>
> Yes, there is a console on the node. I have not tried to do much with
> it but simple things like 'ls' certainly work.
>
>> What does xstat return?
>
> lowalbite ~ # xstat
> Error: could not obtain node list from statfs: Connection refused:127.0.0.1: 111
>

statfs is NOT running... :-) I suspect this is the source of all your problems.
You must start statfs on the master node in order to be able to use
the "-a" option to most commands, as this is the daemon that monitors
which nodes are up, their load, etc.

>> What are the contents of the /etc/xcpu directory?
>
> lowalbite ~ # ls /etc/xcpu/
> admin_key admin_key.pub statfs.conf statfs.conf~
>
> lowalbite ~ # cat /etc/xcpu/statfs.conf
> #/etc/xcpu/statfs.conf
> n0000=tcp!192.168.0.100!6667
> n0001=tcp!192.168.0.101!6667
>

The two lines defining the nodes look ok. I don't know if you can
have comment lines like the first line in your statfs.conf. What
messages do you get when you try to start statfs?

>
> Thanks and best wishes,
> Roger
>

Same to you! Happy holidays.

Happy holidays to all in the list too! I am happy to report that I am
about to go production with my
42-node xcpu cluster, with bjs as the scheduler. Now it is only mpi
that is still giving me trouble. Next year...

Daniel

Reply all
Reply to author
Forward
0 new messages