[Rocks-Discuss] Rocks reinstall on compute node, after disk rebuild, fails

2,678 views
Skip to first unread message

peter....@okstate.edu

unread,
Jan 12, 2012, 6:33:10 PM1/12/12
to npaci-rocks...@sdsc.edu
I still need help. Sorry.
I'm using 5.4, i386. A centOS Base build.
I've tried everything you all have suggested. Nothing works. I can only
assume at this point I have damaged something. I am not able to get a
re-install to complete, as it continues to stop on compute node 0-0
after reaching the point where the install is "looking for kickstart
keys. The screen then goes blank for a few seconds, and then a window
pops up saying:
"https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8"
"HTTP IO ERROR"
I can clearly communicate with the FE at least briefly or the install
would not begin... correct?

I went through my logs on the FE and it says there is a lock on my
insert-ethers i.e.:

"File "opt/rocks/sbin/insert-ethers, line1745, in ?
app.run()
File "opt/rocks/sbin/insert-ethers", line 1712, in run
raise-("error - lock file %s exists") % self.lockfile
error -lockfile /var/lock/insert-ethers exists"

Sure enough there is a zero-bytes file in the lock folder called
insert-ethers. This is really the only clue I have.

I can issue a "rocks sync config" but not a "rocks sync host network
compute-0-0" because
I cannot ssh to compute node-0-0 either, it says one or all of these errors:

"connect to host compute-0-0 port 20, no route to host"
"Close failed: Error 32 Broken Pipe"
"ssh connect to host compute node 0-0 port 22 connection refused"

So I have the correct distro disks, (even rebuilt the distro on the FE)
but they aren't recognized as correct (compute-0-0 says they aren't correct)
The rebuilt disk on compute-0-0 is happy and green lights are
everywhere. The five SCSI disks are in a raid-5 array that looks
content. Yet no matter what folder on the FE I point my install to,
compute-0-0 says it can't find the files I know are there.

Should I just blank the disks on that node and try again? Should I try
to reinstall ALL compute nodes with SGE? (I'm afraid of this one because
there are already some locked files on the FE). Thanks for any advice.

Pete

Version of Rocks?

-P


On Thu, Dec 22, 2011 at 9:38 AM, peter.r.hoyt at okstate.edu <
peter.r.hoyt at okstate.edu> wrote:

> Thanks to those who replied. I'm still fighting this compute node (it is
> the 0-0 node)
>
> I tried:
>
>> To change compute-0-5 to re-install at next boot-up, do
>>
>> # rocks set host boot compute-0-5 action=install
>>
>> Then make sure the node PXE boots.
>>
> And indeed this started to work... then stopped at the same install
point,
> ie after "looking for kickstart keys" the computer puts up a screen
saying:
> Could not get file
>
https://10.1.1.1//install/**sbin/public/kickstart.cgi?**arch=i386&np=8<https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> HTTP IO error
> If I try to point to the files on the head node in different directories,
> or using different paths (ig typing the path directly rather than
using the
> "export" link), I get essentially the same result. If I use the
external IP
> of the server, it replies that it "could find the files on the server"
>
> 10.1.1.1 is the correct internal IP, and I do not always get the IO
error.
> Sometimes I get an error saying "could not retrieve
>
https://10.1.1.1//install/**sbin/public/kickstart.cgi?**arch=i386&np=8<https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> ".
>
> Note that in the "background" of the window, I can see the following: "
> (network.c.358) can't bind to port: 80 Address is already in use". Is
this
> a clue to the IO error, and is there a fix?
>
> If I hit cancel for each image the node is looking to find, I eventually
> get back to the screen asking for an IP address and directory of the
distro
> files. Nothing seems to work. I also tried the suggestion:
>
> First try rebuilding the distribution on the frontend (make sure you
have
>> enough disk space/verify that none of your frontend partitions are full)
>>
>> # cd /export/rocks/install
>> # rocks create distro
>>
>> Then rebuild/reinstall your compute node
>>
>
> This did indeed seem to build the distro on the head node. But again,
using
>
> "# rocks set host boot compute-0-5 action=install"
>
> cause the compute node to start an install, but stopped at the same
point.
> I was also unable to point the computer node to the distro on the
head node
> (unless it was in a new directory I couldn't find).
>
> I need another suggestion or two. This system does not have DVD
drives, so
> I'll need CDs to reinstall the compute node. Any idea why I wasn't
able to
> use my previous CDs to reinstall the compute node?
>
> Thanks,
>
> --
> Peter R. Hoyt, Ph. D.
> Oklahoma State University
> --
> Peter R. Hoyt, Ph. D.
> Director, Bioinformatics Certification Graduate Program,
> Oklahoma State University
> Department of Biochemistry and Molecular Biology
> 110FB Henry Bellmon Research Center
> (Shipping address: 246 Noble Research Center)
> Stillwater, OK 74078
> (405) 744-6206
> FAX (405) 744-7799
>
>


--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
-------------- next part --------------
--
Peter R. Hoyt

Luca Clementi

unread,
Jan 12, 2012, 8:36:42 PM1/12/12
to Discussion of Rocks Clusters
Hey Peter,
what is the output of
# wget --no-check-certificate
'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
if you run it from the frontend?

Does the
rocks list host profile <nameoftheproblematicnode>
returns a proper xml file?

What do you see in the apache access log?
/var/log/httpd/ssl_access and /var/log/httpd/ssl_error

Sincerely,
Luca

peter....@okstate.edu

unread,
Jan 13, 2012, 11:22:33 AM1/13/12
to npaci-rocks...@sdsc.edu
Hi Luca,

> what is the output of
> # wget --no-check-certificate
> 'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
> if you run it from the frontend?

--2012-01-13 09:49:53--
https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8
Connecting to 10.1.1.1:443... connected.
WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
State
University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
Self-signed certificate encountered.
WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
requested host name `10.1.1.1'.
HTTP request sent, awaiting response... 404 Not Found

> Does the
> rocks list host profile <nameoftheproblematicnode>
> returns a proper xml file?

Looks like it does: the head of the file is:
<?xml version="1.0" standalone="no"?>
<profile lang="kickstart">
<section name="kickstart">
<![CDATA[
#
# Node Traversal Order
#
# ./nodes/partitions-save.xml (base)
# ./nodes/python-development.xml (base)
# ./nodes/firewall.xml (base)

> What do you see in the apache access log?
> /var/log/httpd/ssl_access and /var/log/httpd/ssl_error

ssl_access.log tail results:

10.1.255.254 - - [12/Jan/2012:15:41:30 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:31 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:32 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:33 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.1.1 - - [13/Jan/2012:09:46:54 -0600] "GET
//install.sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.1.1 - - [13/Jan/2012:09:49:53 -0600] "GET
//install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305


ssl_error, tail results:

[Thu Jan 12 15:41:31 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Thu Jan 12 15:41:32 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install.sbin
[Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install


The dates are odd because I was working on the system yesterday (Jan 12)
and no errors show up.

***************************************************


Hey Peter,
what is the output of
# wget --no-check-certificate
'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
if you run it from the frontend?

Does the
rocks list host profile <nameoftheproblematicnode>
returns a proper xml file?

What do you see in the apache access log?
/var/log/httpd/ssl_access and /var/log/httpd/ssl_error

Sincerely,
Luca

Luca Clementi

unread,
Jan 13, 2012, 12:13:19 PM1/13/12
to Discussion of Rocks Clusters
Hey Peter,
....

That's the problem. It's in the apache error log.
You should have a symbolic link in /var/www/html

# ls -l /var/www/html/
total 44
drwxr-xr-x 7 root root 4096 Jan 5 16:31 ganglia
drwxr-xr-x 2 root root 4096 Jan 5 16:30 images
-rw-r--r-- 1 root apache 199 Jan 3 18:18 index.html
lrwxrwxrwx 1 root root 21 Jan 3 18:12 install -> /export/rocks/install
drwxr-xr-x 2 root root 4096 Jan 3 18:04 misc

And
# ls -l /export/rocks/install/
total 304
drwxr-xr-x 2 root root 282624 Jan 12 19:07 cachedir
drwxr-xr-x 3 root root 4096 Jan 3 18:12 contrib
drwxr-xr-x 3 root root 4096 Jan 12 19:06 rocks-dist
drwxr-xr-x 13 root root 4096 Jan 12 19:06 rolls
drwxr-xr-x 2 root root 4096 Jan 3 18:12 sbin
drwxr-xr-x 3 root root 4096 Jan 3 18:05 site-profiles

Sincerely,
Luca

peter....@okstate.edu

unread,
Jan 13, 2012, 3:24:57 PM1/13/12
to npaci-rocks...@sdsc.edu
Luca,
Thank-you so much for your help. I was missing the symbolic link. I put
the link in and after a pxe boot I got one step farther in the process.
Instead of stopping at the window after looking for kickstart keys, the
install opened a window named "retrieving" and said it was "retrieving
/install/sbin/public/kickstart.cgi?arch=i386&np=8"

But this fails, and the error is that it can't find the file on the
server. The ssl_error log has new entries looking for a key(?):

[Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install.sbin
[Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install

[Fri Jan 13 13:36:06 2012] [error] [client 10.1.255.254] cat:
/etc/security/ca/new-certs/cert.e15471/key: No such file or directory
[Fri Jan 13 13:36:06 2012] [error] [client 10.1.255.254] cat:
/etc/security/ca/new-certs/cert.s15480/key: No such file or directory

Am I missing another symbolic link? The last error is looking at
10.1.255.254 which is (I believe) the internal IP of the compute node
with the problem. Should I start over with insert-ethers? Is there a
special way to remove the lockfile? (Remember, I'm just a biologist).

Thanks again,
Peter

> Hey Peter,
> ....
>
> On Fri, Jan 13, 2012 at 8:22 AM,peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>
> <peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>> wrote:
> >/ Hi Luca,
> />>/
> />>/ what is the output of
> />>/ # wget --no-check-certificate
> />>/ 'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8' <https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8%27>
> />>/ if you run it from the frontend?
> />/
> />/
> />/ --2012-01-13 09:49:53--
> />/ https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8 <https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> />/ Connecting to 10.1.1.1:443... connected.
> />/ WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma State
> />/ University/OU=YENKO454-CA/emailAddress=peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
> />/ Self-signed certificate encountered.
> />/ WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
> />/ requested host name `10.1.1.1'.
> />/ HTTP request sent, awaiting response... 404 Not Found
> />/
> />>/ Does the
> />>/ rocks list host profile<nameoftheproblematicnode>
> />>/ returns a proper xml file?
> />/
> />/
> />/ Looks like it does: the head of the file is:
> />/ <?xml version="1.0" standalone="no"?>
> />/ <profile lang="kickstart">
> />/ <section name="kickstart">
> />/ <![CDATA[
> />/ #
> />/ # Node Traversal Order
> />/ #
> />/ # ./nodes/partitions-save.xml (base)
> />/ # ./nodes/python-development.xml (base)
> />/ # ./nodes/firewall.xml (base)
> />/
> />>/ What do you see in the apache access log?
> />>/ /var/log/httpd/ssl_access and /var/log/httpd/ssl_error
> />/
> />/ ssl_access.log tail results:
> />/
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:30 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:31 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:32 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:33 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.1.1 - - [13/Jan/2012:09:46:54 -0600] "GET
> />/ //install.sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.1.1 - - [13/Jan/2012:09:49:53 -0600] "GET
> />/ //install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/
> />/
> />/ ssl_error, tail results:
> />/
> />/ [Thu Jan 12 15:41:31 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Thu Jan 12 15:41:32 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not exist:
> />/ /var/www/html/install.sbin
> />/ [Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not exist:
> />/ /var/www/html/install
> />/
> /

Luca Clementi

unread,
Jan 13, 2012, 5:33:25 PM1/13/12
to Discussion of Rocks Clusters
Peter,
What's the output of
# ls -l /export/rocks/install
and of
# ls -l /export/rocks/install/sbin

Sincerely,
Luca

PS: your emails have some problem. Their subject should start with
"Re:" like this one "Re: [Rocks-Discuss] Rocks reinstall...." but
somehow you or your mail client keeps on removing the "Re: " part, and
this messes up the mailing list threading system. Just hitting the
replay button and not modifying subject should work just fine.

peter....@okstate.edu

unread,
Jan 17, 2012, 9:44:10 AM1/17/12
to Discussion of Rocks Clusters
Hi Luca,

Yes, I was having email problems, and finally had to delete my RSS feed
and sign up for individual emails in order to reply properly.

Outputs:
> -bash-3.2$ ls -l /export/rocks/install
> total 432
> drwxr-xr-x 2 root root 413696 Aug 22 11:58 cachedir
> drwxr-xr-x 3 root root 4096 Aug 22 12:20 contrib
> drwxr-xr-x 2 root root 4096 Aug 22 12:54 images
> drwxr-xr-x 3 root root 4096 Dec 21 17:35 rocks-dist
> drwxr-xr-x 11 root root 4096 Aug 22 12:00 rolls
> drwxr-xr-x 2 root root 4096 Aug 22 12:21 sbin
> drwxr-xr-x 3 root root 4096 Aug 22 12:11 site-profiles
> -bash-3.2$

> -bash-3.2$ ls -l /export/rocks/install/sbin
> total 60
> -rwxr-xr-x 1 root root 36219 Nov 2 2010 kickstart.cgi
> lrwxrwxrwx 1 root root 1 Aug 22 12:21 public -> .
> -rwxr-xr-x 1 root root 6152 Nov 2 2010 setDbPartitions.cgi
> -rwxr-xr-x 1 root root 4496 Nov 2 2010 setPxeboot.cgi
> -rwxr-xr-x 1 root root 4724 Nov 2 2010 setPxeMode.cgi
> -bash-3.2$

Anything interesting (the link looks unusual to me).
Thanks again. At least I can reply now.

Peter

--

peter....@okstate.edu

unread,
Jan 17, 2012, 4:28:09 PM1/17/12
to Discussion of Rocks Clusters
Hi Luca,

I removed the lock on the insert-ethers, but when insert-ethers runs, it
is attempting to use a new internal network IP (10.1.0.0 vs 10.1.1.1).
Is there anyway to change this back?

Unfortunately, I've made things worse by issuing:
insert-ethers --replace compute-0-0

Now, compute0-0 is gone from my network, and I can't add it back if
insert-ethers wants to use a new IP.

Thanks and I apologize for my errors.

Pete

On 1/13/2012 4:33 PM, Luca Clementi wrote:

--

Luca Clementi

unread,
Jan 17, 2012, 9:53:54 PM1/17/12
to Discussion of Rocks Clusters
Hey Peter,
I'm a little lost now :-(

So the initial problem was to install compute-0-5, since you made the
simbolyc link in /var/www/html what is the problem?
What changed after that?
If you connect a monitor to the node during the installation what is
the error that you see after you re-created the link?
What happen now if you do the wget as i told you before?


On Tue, Jan 17, 2012 at 1:28 PM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> Hi Luca,
>
> I removed the lock on the insert-ethers, but when insert-ethers runs, it is
> attempting to use a new internal network IP (10.1.0.0 vs 10.1.1.1). Is there
> anyway to change this back?

that is normal....

>
> Unfortunately, I've made things worse by issuing:
>    insert-ethers --replace compute-0-0
>
> Now, compute0-0 is gone from my network, and I can't add it back if
> insert-ethers wants to use a new IP.

to force a particular hostname you can use
insert-enthers --basename compute --cabinet 0 --rank 0

Luca

peter....@okstate.edu

unread,
Jan 18, 2012, 12:41:03 PM1/18/12
to Discussion of Rocks Clusters
Sorry Luca I'll try to explain.
Initial problem was a compute node (compute-0-0) that had a disk
failure. I tried to hotswap the disk. Upon "rebuilding" the disk, the
compute node was visible to the head node, but wanted to reinstall
ROCKS. Unfortunately I was getting an HTTP IO error after looking for
the kickstart keys.

After you helped me find and fix the missing symbolic link, the compute
node still wants to re-install ROCKS, but the installation proceeded a
little farther, trying to "retrieve" the kickstart keys. The install
fails at this point saying that it could not find the files on the FE
(10.1.1.1).

All these errors are from a monitor attached to the compute-0-0 directly.

However, I stupidly removed the lockfile on insert-ethers so that I
could try to run insert-ethers, and add the node back to my cluster. I
then sent the following commands:
insert-ethers --replace compute-0-0

Insert-ethers does run, but now it is trying to set up a cluster using
10.1.0.0 instead of 10.1.1.1. Insert-ethers does NOT see my other nodes
(which I did not restart). Will I now have to add back all my nodes to
use a new 10.1.0.0 internal IP?

If I do wget --no-check-certificate
'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
from the FE now I get:
> # --2012-01-18 11:37:26--
> https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8


> Connecting to 10.1.1.1:443... connected.

> WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
> State

> University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
> Self-signed certificate encountered.


> WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't

> match requested host name `10.1.1.1'.
> HTTP request sent, awaiting response... 200 OK
> Length: 320990 (313K) [application/octet-stream]
> Saving to: `kickstart.cgi?arch=i386&np=8'
>
> 100%[===================================================================================================>]
> 320,990 --.-K/s in 0.04s
>
> 2012-01-18 11:37:33 (8.69 MB/s) - `kickstart.cgi?arch=i386&np=8' saved
> [320990/320990]
> #

I'm still lost. Thanks for your help.

Peter

Luca Clementi

unread,
Jan 19, 2012, 12:56:25 PM1/19/12
to Discussion of Rocks Clusters
Hey Peter,

On Wed, Jan 18, 2012 at 9:41 AM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> Sorry Luca I'll try to explain.
> Initial problem was a compute node (compute-0-0) that had a disk failure. I
> tried to hotswap the disk. Upon "rebuilding" the disk, the compute node was
> visible to the head node, but wanted to reinstall ROCKS. Unfortunately I was
> getting an HTTP IO error after looking for the kickstart keys.
>
> After you helped me find and fix the missing symbolic link, the compute node
> still wants to re-install ROCKS, but the installation proceeded a little
> farther, trying to "retrieve" the kickstart keys. The install fails at this
> point saying that it could not find the files on the FE (10.1.1.1).

So the retrieve kickstart keys step is before the http step, that is
what Im not getting, it's like we did a step backward.

Can you please try to start the installation of the node compute-0-0
and then connect to a node and hit ctrl+Alt and F3 key at the same
time. Then you should see the logs of what is going on.
What are the last 5 lines you see there before the machine gets stuck?

>
> All these errors are from a monitor attached to the compute-0-0 directly.
>
> However, I stupidly removed the lockfile on insert-ethers so that I could
> try to run insert-ethers, and add the node back to my cluster. I then sent
> the following commands:
>    insert-ethers --replace compute-0-0
>
> Insert-ethers does run, but now it is trying to set up a cluster using
> 10.1.0.0 instead of 10.1.1.1.  Insert-ethers does NOT see my other nodes
> (which I did not restart). Will I now have to add back all my nodes to use a
> new 10.1.0.0 internal IP?

That is not the problem.
10.1.0.0 is the subnetwork that insert ether is using, it means the
network with all the IP in the form of 10.1.X.X, which is fine in your
case.
Insert-ether displays only the new discovered host (so already
discovered host are not displayed in the list).

If you wanna see the list of "your" host (aka already discovered host)
you should run rocks list host (can you post the output?).


>
> If I do wget --no-check-certificate
> 'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8' from
> the FE now I get:
>>
>> # --2012-01-18 11:37:26--
>>  https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8
>>
>> Connecting to 10.1.1.1:443... connected.
>> WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
>> State
>> University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
>> Self-signed certificate encountered.
>> WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
>> requested host name `10.1.1.1'.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 320990 (313K) [application/octet-stream]
>>        Saving to: `kickstart.cgi?arch=i386&np=8'
>>
>>
>> 100%[===================================================================================================>]
>> 320,990     --.-K/s   in 0.04s
>>
>>                                                                 2012-01-18
>> 11:37:33 (8.69 MB/s) - `kickstart.cgi?arch=i386&np=8' saved [320990/320990]
>> #
>

What's the content of this file?

How does the last lines of
/var/log/httpd/ssl_access_log
and
/var/log/httpd/ssl_error_log
after you try to install compute-0-0 and it fails?


Sincerely,
Luca

Luca Clementi

unread,
Jan 19, 2012, 2:03:14 PM1/19/12
to Discussion of Rocks Clusters
Hey Peter,
please make sure that also the following processes are running
properly before trying to install:
http://www.rocksclusters.org/roll-documentation/base/5.4.3/faq-installation.html#COMPUTE-KICKSTART-FILE


Sincerely,
Luca

peter....@okstate.edu

unread,
Jan 19, 2012, 2:25:35 PM1/19/12
to Discussion of Rocks Clusters
Hi Luca,

Result of "rocks list host":

> -bash-3.2$ rocks list host
> HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
> yenko: Frontend 8 0 0 os install
> compute-0-4: Compute 8 0 4 os install
> compute-0-1: Compute 8 0 1 os install
> compute-0-3: Compute 8 0 3 os install

So, compute-0-0 now thinks it is compute-0-4. I saw this happening when
I last used insert-ethers.

> So the retrieve kickstart keys step is before the http step, that is
> what Im not getting, it's like we did a step backward.

Earlier the window said "looking for kickstart keys", went blank, and
then had an HTTP IO-error. After putting in the symbolic link, the
"looking for kickstart keys", goes blank, and is followed by "retrieving
kickstart keys", then gave a "could not find keys" error.

> What are the last 5 lines you see there before the machine getsstuck?

There are 7-lines that say:

> INIT cannot execute "/sbin/mingetty" INIT cannot execute
> "/sbin/mingetty" INIT cannot execute "/sbin/mingetty" INIT cannot
> execute "/sbin/mingetty" INIT cannot execute "/sbin/mingetty" INIT
> cannot execute "/sbin/mingetty" INIT Id "1" respawing too fast:
> disabled for 5 min.

I did hang around for 5 minutes, and the same set of lines repeated them selves exactly:

Thanks Luca, maybe this will give you a hint.

I don't want to complicate things, but BEHIND the window that says
"could not find keys" error, there is a message in the background (off
centered/out of phase) that says:"2012-01-19 19:12:26: (network.c.358)
can't bind to port: 80 address already in use"

Peter

On 1/19/2012 11:56 AM, Luca Clementi wrote:
> Hey Peter,
>

> On Wed, Jan 18, 2012 at 9:41 AM, peter....@okstate.edu
> <peter....@okstate.edu> wrote:
>> Sorry Luca I'll try to explain.
>> Initial problem was a compute node (compute-0-0) that had a disk failure. I
>> tried to hotswap the disk. Upon "rebuilding" the disk, the compute node was
>> visible to the head node, but wanted to reinstall ROCKS. Unfortunately I was
>> getting an HTTP IO error after looking for the kickstart keys.
>>
>> After you helped me find and fix the missing symbolic link, the compute node
>> still wants to re-install ROCKS, but the installation proceeded a little
>> farther, trying to "retrieve" the kickstart keys. The install fails at this
>> point saying that it could not find the files on the FE (10.1.1.1).
> So the retrieve kickstart keys step is before the http step, that is
> what Im not getting, it's like we did a step backward.
>
> Can you please try to start the installation of the node compute-0-0
> and then connect to a node and hit ctrl+Alt and F3 key at the same
> time. Then you should see the logs of what is going on.
>

peter....@okstate.edu

unread,
Jan 19, 2012, 3:54:01 PM1/19/12
to Discussion of Rocks Clusters
Luca,

All those processes appear to be running. There are eight instances of
apache running httpd, but otherwise everything looked normal.

Peter

Luca Clementi

unread,
Jan 20, 2012, 4:56:26 PM1/20/12
to Discussion of Rocks Clusters
On Thu, Jan 19, 2012 at 11:25 AM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> Hi Luca,
>

> Result of "rocks list host":
>
>> -bash-3.2$ rocks list host
>> HOST         MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
>> yenko:       Frontend   8    0    0    os        install
>> compute-0-4: Compute    8    0    4    os        install
>> compute-0-1: Compute    8    0    1    os        install
>> compute-0-3: Compute    8    0    3    os        install
>
>
> So, compute-0-0 now thinks it is compute-0-4. I saw this happening when I
> last used insert-ethers.

for this problem you can use the
insert-ether --basename compute --cabinet 0 --rank 0

>
>
>> So the retrieve kickstart keys step is before the http step, that is
>> what Im not getting, it's like we did a step backward.
>
> Earlier the window said "looking for kickstart keys", went blank, and then
> had an HTTP IO-error. After putting in the symbolic link, the "looking for
> kickstart keys", goes blank, and is followed by "retrieving kickstart keys",
> then gave a "could not find keys" error.
>
>

Peter maybe you have some hardware problem with your system.
The keys the rocks installer is looking for, are on the local ramdisk (if
Im not wrong).
The missing symbolic link, the keys not there....
Maybe you have some hardware problem?
Can you check the disk of your frontend?
First run a
dmesg
e see if in the output there is any IO error.
And then you can reboot your frontend with:
shutdown -rF now
and then at the next boot you can see the disk check (it's hidden in
centos slpash screen, you have to click on show details).


Sincerely,
Luca

Edward Walter

unread,
Jan 21, 2012, 4:36:58 PM1/21/12
to Discussion of Rocks Clusters
We saw problems similar to this on a cluster where we reinstalled the frontend. The cause ended up being a stray /.rocks-release file which prevented the filesystem from bring formatted. If I remember correctly; this caused a certificate mismatch between the reinstalled frontend and the compute node we were attempting to reinstall. Removing the .rocks-release file solved our problems.

Hope that helps.

-Ed

peter....@okstate.edu

unread,
Jan 23, 2012, 11:57:18 AM1/23/12
to Discussion of Rocks Clusters
Hi Luca,

No obvious hardware problems. However when I rebooted the front end, I
now get

> ]# rocks list host boot
> HOST ACTION
> yenko: -------
> compute-0-4: install
> compute-0-1: os
> compute-0-3: os
> [root@yenko ~]#

> ]# rocks list host


> HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
> yenko: Frontend 8 0 0 os install
> compute-0-4: Compute 8 0 4 os install
> compute-0-1: Compute 8 0 1 os install
> compute-0-3: Compute 8 0 3 os install

Shouldn't "rocks list host boot" say "os"?

Also, the Gnome desktop seems to have changed. I'll have to got to the
server room and check it out.

THanks,

pete

peter....@okstate.edu

unread,
Jan 23, 2012, 3:15:32 PM1/23/12
to Discussion of Rocks Clusters
This is getting too frustrating.

When I run "insert-ethers --basename compute --cabinet 0 --rank 0"

the insert-ethers window starts and I don't know where to proceed from there. Do I reboot the troublesome node and get it recognized? I would really like to rename compute-0-4 back to compute-0-0 because that is the compute node listed in the .ssh file. I don't think I can communicate with that node until it is back to being compute-0-0.

Also, when I rebooted my FE, as you instructed:


> And then you can reboot your frontend with:
> shutdown -rF now
> and then at the next boot you can see the disk check (it's hidden in
> centos slpash screen, you have to click on show details).

I did not see the disk-check you are referring to. Is there a file I can access? The other nodes are fine, it's just this one node. If I could just get it to re-install. I am willing to reformat and re-initialize the disks in this node if that's the next option. Is there any info I could get from booting this node using the CentOS "Live" CD?

Thanks,

Peter


On 1/20/2012 3:56 PM, Luca Clementi wrote:

Bart Brashers

unread,
Jan 23, 2012, 4:36:27 PM1/23/12
to Discussion of Rocks Clusters
I think on key tidbit of information would help you here. You only run insert-ethers when you are trying to install a NEW appliance (e.g. compute node). If you just want to re-install an existing node, you do not use insert-ethers.

Insert-ethers is a tool to intercept PXE boots, detect MAC addresses, and create entries in the rocks database for new nodes. It does have a few (leftover) features like removing database entries, but those have been largely moved to "/opt/rocks/bin/rocks" commands.

So in your case, to rename compute-0-4 to compute-0-0, you can do one of two things:

Version one:

1. Run "insert-ethers --replace compute-0-4"
2. Pick "compute" from list.
3. Make that node PXE boot.
4. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.

Version two:

1. Run "rocks remove host compute-0-4"
2. Run "rocks sync config; rocks sync users"
3. Run "insert-ethers --cabinet 0 --rank 0"
4. Pick "compute" from list.
5. Make that node PXE boot.
6. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.

To re-install a node that already has an entry in the rocks database (i.e. a "known" node):

1. Run "rocks set host boot compute-0-0 action=install
2. Make that node PXE boot.

When the node is either installed or re-installed, the OS is completely new. This includes things like the node's SSH ID (values you would put in ~/.ssh/known_hosts or /etc/ssh/ssh_known_hosts).

But you can still currently do a "ssh compute-0-4", and after it's been installed as compute-0-0, you can "ssh compute-0-0".

I think you can detect a hardware problem by reading `dmesg` and/or looking in the logs. No need to do a filesystem check (fsck) as suggested by Luca. If you really want to be sure, you can boot your FE from a LiveCD, and run "fsck" on each partition of your FE. Keep in mind that if you have some large userdata section, it could take a really, really long time to run. For now, run fsck only on the /, /var, /boot, etc. partitions. I can't give you an exact list, because I don't know how you partitioned your FE. But you get the idea: check the local OS partitions.

I suspect that most of your problems stem from running insert-ethers too many times when you should have used "rocks set host boot compute-0-0 action=install".

Bart

________________________________
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.

Luca Clementi

unread,
Jan 24, 2012, 3:13:47 PM1/24/12
to Discussion of Rocks Clusters
Dear Peter,

On Mon, Jan 23, 2012 at 12:15 PM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> This is getting too frustrating.
>
> When I run "insert-ethers --basename compute --cabinet 0 --rank 0"
>
> the insert-ethers window starts and I don't know where to proceed from
> there. Do I reboot the troublesome node and get it recognized? I would
> really like to rename compute-0-4 back to compute-0-0 because that is the
> compute node listed in the .ssh file. I don't think I can communicate with
> that node until it is back to being compute-0-0.
>

The name difference should not cause any problem.
If you want to change it you have to remove compute-0-4 with
rocks remove host compute-0-4
and then you can rediscover the host using insert-ethers with the rack
and cabinet
option and then reboot your host.


> Also, when I rebooted my FE, as you instructed:
>
>> And then you can reboot your frontend with:
>> shutdown -rF now
>> and then at the next boot you can see the disk check (it's hidden in
>> centos slpash screen, you have to click on show details).
>
> I did not see the disk-check you are referring to. Is there a file I can
> access?

Not that Im aware.


(it's hidden in centos slpash screen, you have to click on show

details during the boot)

> The other nodes are fine, it's just this one node. If I could just
> get it to re-install. I am willing to reformat and re-initialize the disks
> in this node if that's the next option. Is there any info I could get from
> booting this node using the CentOS "Live" CD?


To wipe out the disk data you can use gparted:
http://gparted.sourceforge.net/
boot from the gparted CD delete all the partition you have on the
disk, recreate one big partition
which takes all the disk, and that should wipe all the data away from the disk.
Then try to reinstall the node with insert-enther as indicated above.

Sincerely,
Luca

peter....@okstate.edu

unread,
Jan 24, 2012, 8:08:27 PM1/24/12
to Discussion of Rocks Clusters
Hi Luca, and also Hi Ed. I appreciate your assistance.

Using Ed's advice to convert compute-0-4 to compute-0-0, with some
obvious problems, I got the compute node renamed:
1. Run "insert-ethers --replace compute-0-4" ---> no problem
2. Pick "compute" from list. ---> no problem
3. Make that node PXE boot. ---> no problem


4. When the "( )" turns into a "(*)", indicating that the node has

received a kickstart file, exit. ---PROBLEM!
At this step, the instant that the insert-ethers window said it had
recognized a new appliance, the insert-ethers window CRASHED and
disapeared. I tried to restart insert-ethers quickly, but after a few
seconds the window filled up with a few lines of unrecognizeable
characters. I could not "F8" or "F9" to get out of insert-ethers, but
clicking on the upper right "X" did exit the window.
a. Running "rocks list host interface" showed that compute-0-4 was now
gone. So with Compute-0-4 gone, I tried


> 2. Run "rocks sync config; rocks sync users"
> 3. Run "insert-ethers --cabinet 0 --rank 0"
> 4. Pick "compute" from list.
> 5. Make that node PXE boot.
> 6. When the "( )" turns into a "(*)", indicating that the node has
> received a kickstart file, exit.

Which worked fine. I now had a compute-0-0 but the crash of
insert-ethers was disturbing.
I ran :"rocks set host boot compute-0-0 action=install"
Upon rebooting the node using PXE, the same error: "failed to connect to
HTTP server" AFTER the "retrieving kickstart files" window.

b. I checked all my paths, and checked all my symbolic links. All good.
So i tried something different. Instead of connecting to "10.1.1.1" I
tried to connect to head node using the external IP address. This
actually started to work!
"Running anaconda..."
"Probing for video card..."
"formatting file system..."
"formatting /var file system.."
then STOP! error says:
"Unable to read package metadata. This may be due to a missing repo data
directory. Please ensure that your install tree has been correctly
generated. Cannot retrieve repository metadata (repomd.xml) for
repository anaconda-base-201101061733.i386."
"Please verify its path and try again"
<abort> or <continue>

I hit continue, and another error said:
"Unable to read group info from repositories. This is a problem with the
generation of your install tree."

I was unable to switch my video back to the front end after this. So
rather than reboot my front end blindly, I first rebooted my compute-0-0
and then got another error: "exception bug..." Which I was able to save
to the FE.

The exception-bug file is a long file. Most of the parts look like
installation is going okay. The critical errors seemed to start at this
point"
> 22:59:25 INFO : Running kickstart %%pre script(s)
> 22:59:27 INFO : All kickstart %%pre script(s) have been run
> 22:59:31 WARNING : step installtype does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en_US.UTF-8.html:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en_US.UTF-8:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en_US.html:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en_US: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en.html: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en: HTTP Error
> 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-C.html: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.C: HTTP Error
> 404: Not Found
> 22:59:33 INFO : moving (1) to step partitionobjinit

later the next big set of errors seemed to start here:

> 23:00:09 INFO : moving (1) to step downloadrolls
> 23:00:09 INFO : moving (1) to step reposetup
> 23:00:09 DEBUG : repo time: 0.002
> 23:00:09 DEBUG : Setting up Package Sacks
> 23:00:09 ERROR : reading package metadata: Cannot retrieve
> repository metadata (repomd.xml) for repository:
> anaconda-base-201101061733.i386. Please verify its path and try again
> 23:08:47 INFO : no groups missing
> 23:08:47 INFO : moving (1) to step basepkgsel
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for
> ClusterTools_gnu
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : rpmdb time: 0.000
> 23:08:47 DEBUG : no package matching ClusterTools_gnu
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for EMBOSS
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching EMBOSS
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for
> OpenIPMI
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching OpenIPMI
After about 100 "no package matching <packagename>" repetitions, the
exception bug ended abruptly with:
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching yum
> 23:08:47 DEBUG : no such group Core
> 23:08:47 DEBUG : no such group Base
> 23:08:47 INFO : moving (1) to step postselection
> 23:08:47 DEBUG : Setting up Package Sacks
>
>
> /tmp/lvmout:
> Wiping cache of LVM-capable devices
> Wiping internal VG cache
> Finding all volume groups
> Reading all physical volumes. This may take a while...

Bart Brashers

unread,
Jan 25, 2012, 12:35:55 PM1/25/12
to Discussion of Rocks Clusters
> Using Ed's advice to convert compute-0-4 to compute-0-0, with some
> obvious problems, I got the compute node renamed:
> 1. Run "insert-ethers --replace compute-0-4" ---> no problem
> 2. Pick "compute" from list. ---> no problem
> 3. Make that node PXE boot. ---> no problem
> 4. When the "( )" turns into a "(*)", indicating that the node has
> received a kickstart file, exit. ---PROBLEM!
> At this step, the instant that the insert-ethers window said it had
> recognized a new appliance, the insert-ethers window CRASHED and
> disapeared. I tried to restart insert-ethers quickly, but after a few
> seconds the window filled up with a few lines of unrecognizeable
> characters. I could not "F8" or "F9" to get out of insert-ethers, but
> clicking on the upper right "X" did exit the window.

Did you remember to remove the lock file for insert-ethers after the crash? It's not a program that a new instance can pick up where the last instance left off, so it was pointless to quickly re-start it. It's not a service, like sshd.

> a. Running "rocks list host interface" showed that compute-0-4 was now
> gone. So with Compute-0-4 gone, I tried
> > 2. Run "rocks sync config; rocks sync users"
> > 3. Run "insert-ethers --cabinet 0 --rank 0"
> > 4. Pick "compute" from list.
> > 5. Make that node PXE boot.
> > 6. When the "( )" turns into a "(*)", indicating that the node has
> > received a kickstart file, exit.
> Which worked fine. I now had a compute-0-0 but the crash of
> insert-ethers was disturbing.

So at that point, did compute-0-0 install correctly? It takes a while, something like 30 minutes to install (depending on your hardware, interconnect, etc.). Did you hook up a screen to compute-0-0 and verify that the install finished?

> I ran :"rocks set host boot compute-0-0 action=install"
> Upon rebooting the node using PXE, the same error: "failed to connect to
> HTTP server" AFTER the "retrieving kickstart files" window.

It's very odd that an initial install using insert-ethers would work, but a subsequent re-install would not work. Perhaps you interrupted it in the middle of an install, and left it in a half-hosed state.

> b. I checked all my paths, and checked all my symbolic links. All good.
> So i tried something different. Instead of connecting to "10.1.1.1" I
> tried to connect to head node using the external IP address.

Sorry, this too was pointless, as you eventually saw. Rocks is set up to work using the private IP range for PXE-based installs of compute nodes. If that doesn't work, then you should fix the problem rather than trying a work-around. It has worked for many, many other clusters, so something must be wrong with your install.

> This actually started to work!
> "Running anaconda..."

...


> "Unable to read group info from repositories. This is a problem with the
> generation of your install tree."
>
> I was unable to switch my video back to the front end after this. So
> rather than reboot my front end blindly, I first rebooted my compute-0-0
> and then got another error: "exception bug..." Which I was able to save
> to the FE.

...


> /tmp/lvmout:
> Wiping cache of LVM-capable devices
> Wiping internal VG cache
> Finding all volume groups
> Reading all physical volumes. This may take a while...

I'm not sure, but I thought that anaconda (the Redhat installer that Rocks uses) does not support LVM. Did this node initially have some LVM partitions on it? Perhaps you could boot compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank" disk before starting the re-install.

A few other things come to mind, suggestions for you to think about and/or try:

1. What are the two IP ranges and netmasks for your local public LAN and rock's private IP range? From your comments about "10.1.1.1" and connecting to the head node's public IP, I hope you don't mean that your public LAN uses 10.1.X.Y/24 and your private side uses 10.1.X.Y/16. Normally, a compute node has no route to the local public LAN except for being NATted by the headnode. So I don't see how it's possible for it, during the anaconda-based install, to have a route to your FE's public NIC.

2. Which rolls did you select? In one of my recent install attempts of Rocks 5.4.3, I selected some rolls from the Triton 5.4 tree, and some from the Stanford collection of rolls (some built for 5.4.3, some built for 5.4). Although the FE installed correctly, it would not insert compute nodes. The web page/wordpress was also not working, if that's a clue. What's the output on your FE of "rocks list roll"

Bart Brashers

peter....@okstate.edu

unread,
Jan 26, 2012, 10:26:59 AM1/26/12
to Discussion of Rocks Clusters
1. The lockfile was removed earlier, and has not appeared again. here
are the contents of my var/lock/ directory:

> drwxrwxr-x 6 root lock 4096 Jan 26 04:02 .
> drwxr-xr-x 28 root root 4096 Aug 22 12:45 ..
> -rw-r--r-- 1 root root 0 Jan 23 10:27 after-rc
> -rw-r--r-- 1 root root 0 Jan 23 10:26 before-rc
> drwxr-xr-x 2 root root 4096 Mar 31 2010 dmraid
> -rw-r--r-- 1 root root 0 Jan 23 10:26 irqbalance
> drwxr-xr-x 2 root root 4096 Jan 23 10:26 iscsi
> drwx------ 2 root root 4096 Aug 22 12:14 lvm
> drwxr-xr-x 2 root root 4096 Jan 26 04:02 subsys

2. you asked "So at that point, did compute-0-0 install correctly?"
ANSWER: No. And I did hook up a monitor... because this is a small
cluster I have a KVM switch. The install was stopped with the can't
locate kickstart file error

3. You commented: It's very odd that an initial install using

insert-ethers would work, but a subsequent re-install would not work.
Perhaps you interrupted it in the middle of an install, and left it in a
half-hosed state.

ANSWER: Because I have a KVM setup, I tried to be careful this did
not happen. But something happened, and "half-hosed" seems to describe
it well.

4. You commented: "Sorry, this too was pointless, as you eventually

saw. Rocks is set up to work using the private IP range for PXE-based
installs of compute nodes."

ANSWER: Yes, sorry. Just a Biologist as stated previously.

5: You Commented: "I'm not sure, but I thought that anaconda (the Redhat

installer that Rocks uses) does not support LVM. Did this node
initially have some LVM partitions on it? Perhaps you could boot
compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank"
disk before starting the re-install."

ANSWER: I have to admit, I'm not sure about this but I think not.
The Biologist in me says the other nodes are okay (as was this one until
the disk crashed), and they were not treated differently. I have a
directory "/var/lock/lvm" but it is empty (on the head node, 5 SCSI
disks). if I issue the df command I get:
> # df
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sda1 15872604 4314572 10738720 29% /
> /dev/sda5 47719976 32671108 12585704 73% /state/partition1
> /dev/sda2 3968124 307960 3455336 9% /var
> tmpfs 12219672 0 12219672 0% /dev/shm
> tmpfs 886632 3456 883176 1%
> /var/lib/ganglia/rrds
if I issue fdisk:
> # fdisk -l
>
> Disk /dev/sda: 72.4 GB, 72469184512 bytes
> 255 heads, 63 sectors/track, 8810 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sda1 * 1 2040 16386268+ 83 Linux
> /dev/sda2 2041 2550 4096575 83 Linux
> /dev/sda3 2551 2677 1020127+ 82 Linux swap /
> Solaris
> /dev/sda4 2678 8810 49263322+ 5 Extended
> /dev/sda5 2678 8810 49263291 83 Linux
if I issue: lvdisplay:
> # lvdisplay -v
> Finding all logical volumes
> #
You (and Luca) are suggesting I start over... at least on compute-0-0. I
believe that is the best option. Dang.
My public IP, located on the University network is an University IP
nothing close to 10.1... anything. I was surprised I could access it.

Finally, the results of "rocks list roll"
> # rocks list roll
> NAME VERSION ARCH ENABLED
> os: 5.4 i386 yes
> web-server: 5.4 i386 yes
> ganglia: 5.4 i386 yes
> base: 5.4 i386 yes
> hpc: 5.4 i386 yes
> bio: 5.4 i386 yes
> service-pack: 5.4.2 i386 yes
> kernel: 5.4 i386 yes
> area51: 5.4 i386 yes


Thanks again!

-p

Bart Brashers

unread,
Jan 26, 2012, 1:24:30 PM1/26/12
to Discussion of Rocks Clusters
> 1. The lockfile was removed earlier, and has not appeared again.

Lock files are ways a program can prevent other copies of itself from being launched. Let's say I have a program that opens and manipulates a database. I don't want two copies of the program making changes as the same time, because it was cause database corruption. So I make the program check for the existence of a little file somewhere (e.g. /var/lock) when it starts, and exit immediately if the file exists. I also make the program remove the lock file when it exits normally. But if the program crashes in the middle, it won't remove the lock file -- crash means it can't do anything else, it's dead.

So you normally won't see any lock file for insert-ethers, unless it crashed -- which you said it did. If you could immediately launch another instance of insert-ethers after the first one "crashed", then I suspect either (a) something is very wrong with your install, or (b) it didn't really crash, it exited normally. But I can't explain why the 2nd one you started wrote all that junk to the screen.

> 2. you asked "So at that point, did compute-0-0 install correctly?"
> ANSWER: No. And I did hook up a monitor... because this is a small
> cluster I have a KVM switch. The install was stopped with the can't
> locate kickstart file error

OK.

> 3. You commented: It's very odd that an initial install using
> insert-ethers would work, but a subsequent re-install would not work.
> Perhaps you interrupted it in the middle of an install, and left it in a
> half-hosed state.
> ANSWER: Because I have a KVM setup, I tried to be careful this did
> not happen. But something happened, and "half-hosed" seems to describe
> it well.

Well, now that we know it never installed correctly, I'd say the compute node is "blank" rather than "half hosed". You should be able to install on it, were your FE functioning correctly.

> 4. You commented: "Sorry, this too was pointless, as you eventually
> saw. Rocks is set up to work using the private IP range for PXE-based
> installs of compute nodes."
> ANSWER: Yes, sorry. Just a Biologist as stated previously.

OK.

> 5: You Commented: "I'm not sure, but I thought that anaconda (the Redhat
> installer that Rocks uses) does not support LVM. Did this node
> initially have some LVM partitions on it? Perhaps you could boot
> compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank"
> disk before starting the re-install."
> ANSWER: I have to admit, I'm not sure about this but I think not.
> The Biologist in me says the other nodes are okay (as was this one until
> the disk crashed), and they were not treated differently. I have a
> directory "/var/lock/lvm" but it is empty (on the head node, 5 SCSI
> disks). if I issue the df command I get:

OK, this just shows that there are no LVM partitions on the frontend. BTW, where are the other 4 SCSI disks on your FE mounted? Or are they not being used (yet)?

I was wondering if there were some leftover LVM partitions on compute-0-0. Can you do your calls to df, fdisk, and lvdisplay on compute-0-0, after booting from a LiveCD?

I suspect you won't find any LVM partitions there either, so this may be a waste of time.

> You (and Luca) are suggesting I start over... at least on compute-0-0. I
> believe that is the best option. Dang.

I think you misunderstand. We don't suggest you start over just on compute-0-0, we suggest you start over by re-installing your frontend. Start fresh with a new install of your cluster. You might consider taking the opportunity to upgrade to 5.4.3 while you're at it.

Speaking as an Atmospheric Scientist giving my $0.02 worth of advice to a Biologist, you can spend much more time trying to figure out what is wrong with your current Rocks install than it takes to install a new one. I've been there, recently. I wasted nearly a week trying to figure out why my install with a mix of Rocks, Triton, and Stanford rolls was slightly hosed (it was mostly working, but certain key bits were busted). When I bit the bullet and started fresh with just a basic set of rolls, it was up and running correctly in a few hours.

If you can, I suggest you NOT make a restore roll, so you don't inherit any problems. Look over the restore roll internals, and read the wiki entry
https://wiki.rocksclusters.org/wiki/index.php/Tips_and_tricks#Q._What_files_should_survive_a_frontend_re-install.2Fupgrade.3F,
and make some spot on a different disk (e.g. /dev/sdb) to copy these files (/etc/passwd, /etc/group, /etc/shadow, /etc/auto.home, and so on). Don't forget your extend*.xml files.

During the install, be sure to NOT FORMAT your disks with user data. Just format the /dev/sda disk with the Rocks install. You can even leave the /dev/sda5 partition containing your /state/partition1 (aka /export) alone, not re-format.

Then after the fresh install, you can merge the user data from the old versions of these files into the new versions -- they are just text files, so you can edit them by hand.

> My public IP, located on the University network is an University IP
> nothing close to 10.1... anything. I was surprised I could access it.

I'm really surprised too. Do you have your compute nodes on a separate switch, or at least a VLAN? If you log into one of your functioning nodes and type "traceroute www.redhat.com" does it show that the first step is your FE?

In particular, you must protect your compute nodes from any other DHCP servers that might be on your local public LAN. That's why we generally have a private switch or VLAN for them.

> Finally, the results of "rocks list roll"
> > # rocks list roll
> > NAME VERSION ARCH ENABLED
> > os: 5.4 i386 yes
> > web-server: 5.4 i386 yes
> > ganglia: 5.4 i386 yes
> > base: 5.4 i386 yes
> > hpc: 5.4 i386 yes
> > bio: 5.4 i386 yes
> > service-pack: 5.4.2 i386 yes
> > kernel: 5.4 i386 yes
> > area51: 5.4 i386 yes

These look pretty vanilla to me. That rules out the "incompatible rolls" theory for why you can't install compute-0-0.

I see that you've re-created the rocks distro, but maybe you need to clean it first? You could try the following:

# cd /export/rocks/install
# /bin/rm -rf rocks-dist # removes the old distro
# rocks create distro

Then verify compute-0-0 is set to re-install upon next boot:

# rocks list host boot

Next, verify that Rocks can generate a valid kickstart file:

# rocks list host profile compute-0-0 > /tmp/c0.ks
# less /tmp/c0.ks # scan for errors/truncation, mine is about 117K in size.

Then PXE-boot your node.

peter....@okstate.edu

unread,
Jan 26, 2012, 3:31:49 PM1/26/12
to Discussion of Rocks Clusters
Bart,

Thanks very much for your detailed help. I am hesitant to re-install the
entire cluster. Mostly now because I am no good at setting up disks. I
better learn how to format and mount disks before I upgrade/re-install.
My SysAdmin left and this has become my responsibility. I enjoy the
learning, but have had enough in this round. Part of the "tree" problem
could be a couple of "home" directories that are acting like they are
hard-linked directoreis (???), but have different inode numbers (took
most of yesterday to learn those terms).

If you still have some patience would you give me a little guidance on
how to format using "points" on the disks? Or a concise tutorial.
Specifically you said: "make some spot on a different disk (e.g.

/dev/sdb) to copy these files (/etc/passwd, /etc/group, /etc/shadow,

/etc/auto.home, and so on). Don't forget your extend*.xml files." More
specifically when you ask me "BTW, where are the other 4 SCSI disks on
your FE mounted?" I really don't have a clue. I assumed they were all
being used. My original Linux "mentors" told me that the physical disks
would all just be one "big logical disk". Compute-0-0 is a raid5 array
and I believe the other nodes are the same (but have not
double-checked). I do understand the concepts of logical volumes vs
physical volumes. I have worked with partitions on disks, but these were
mostly in Windows computers. The mount points of Linux disks are still
mysterious regarding moving them around or otherwise assigning them...
much less doing it appropriately for organization and system
security/stability.

The FE has 5 SCSI disks (old 18GB U320 Cheetas). The arrangement on the
head node is:


# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 15872604 4314572 10738720 29% /
/dev/sda5 47719976 32671108 12585704 73% /state/partition1

/dev/sda2 3968124 307968 3455328 9% /var


tmpfs 12219672 0 12219672 0% /dev/shm
tmpfs 886632 3456 883176 1%
/var/lib/ganglia/rrds

But I didn't do anything on purpose. I just let ROCKS do it's thing.
It's a mystery to me why /dev/sda5 appears to have 32Gb of data already.
I expected the OS to be fairly lean. Also, no sda3 or sda4 (or sdb)...
hidden as part of the raid5 setup? Because I'm starting over, I could
put five 148GB SCSIs on the head node, if it is going to "fill-up"
faster. For user data storage, the FE will have an external SATA-array
of several terabytes.

Off to read the CentOS manual...

Peter

--

"Hung-Sheng Tsao (Lao Tsao 老曹) Ph.D."

unread,
Jan 26, 2012, 3:47:15 PM1/26/12
to Discussion of Rocks Clusters, peter....@okstate.edu

hi
could I ask
what is your server's CPU only support 32-bit?
How old are your servers?
HDD, HW-raid?
NIC the FE and compute node
what are you main apps?
ya for ROCKS re-install everything is good for begineer, you also need
some test env e.g. Vbox on some desktop so you can try things out
regards

--
Hung-Sheng Tsao Ph D.
Founder& Principal
HopBit GridComputing LLC
cell: 9734950840

http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/

-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 608 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120126/76b54817/laotsao.vcf

Reply all
Reply to author
Forward
0 new messages