I went through my logs on the FE and it says there is a lock on my
insert-ethers i.e.:
"File "opt/rocks/sbin/insert-ethers, line1745, in ?
app.run()
File "opt/rocks/sbin/insert-ethers", line 1712, in run
raise-("error - lock file %s exists") % self.lockfile
error -lockfile /var/lock/insert-ethers exists"
Sure enough there is a zero-bytes file in the lock folder called
insert-ethers. This is really the only clue I have.
I can issue a "rocks sync config" but not a "rocks sync host network
compute-0-0" because
I cannot ssh to compute node-0-0 either, it says one or all of these errors:
"connect to host compute-0-0 port 20, no route to host"
"Close failed: Error 32 Broken Pipe"
"ssh connect to host compute node 0-0 port 22 connection refused"
So I have the correct distro disks, (even rebuilt the distro on the FE)
but they aren't recognized as correct (compute-0-0 says they aren't correct)
The rebuilt disk on compute-0-0 is happy and green lights are
everywhere. The five SCSI disks are in a raid-5 array that looks
content. Yet no matter what folder on the FE I point my install to,
compute-0-0 says it can't find the files I know are there.
Should I just blank the disks on that node and try again? Should I try
to reinstall ALL compute nodes with SGE? (I'm afraid of this one because
there are already some locked files on the FE). Thanks for any advice.
Pete
Version of Rocks?
-P
On Thu, Dec 22, 2011 at 9:38 AM, peter.r.hoyt at okstate.edu <
peter.r.hoyt at okstate.edu> wrote:
> Thanks to those who replied. I'm still fighting this compute node (it is
> the 0-0 node)
>
> I tried:
>
>> To change compute-0-5 to re-install at next boot-up, do
>>
>> # rocks set host boot compute-0-5 action=install
>>
>> Then make sure the node PXE boots.
>>
> And indeed this started to work... then stopped at the same install
point,
> ie after "looking for kickstart keys" the computer puts up a screen
saying:
> Could not get file
>
https://10.1.1.1//install/**sbin/public/kickstart.cgi?**arch=i386&np=8<https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> HTTP IO error
> If I try to point to the files on the head node in different directories,
> or using different paths (ig typing the path directly rather than
using the
> "export" link), I get essentially the same result. If I use the
external IP
> of the server, it replies that it "could find the files on the server"
>
> 10.1.1.1 is the correct internal IP, and I do not always get the IO
error.
> Sometimes I get an error saying "could not retrieve
>
https://10.1.1.1//install/**sbin/public/kickstart.cgi?**arch=i386&np=8<https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> ".
>
> Note that in the "background" of the window, I can see the following: "
> (network.c.358) can't bind to port: 80 Address is already in use". Is
this
> a clue to the IO error, and is there a fix?
>
> If I hit cancel for each image the node is looking to find, I eventually
> get back to the screen asking for an IP address and directory of the
distro
> files. Nothing seems to work. I also tried the suggestion:
>
> First try rebuilding the distribution on the frontend (make sure you
have
>> enough disk space/verify that none of your frontend partitions are full)
>>
>> # cd /export/rocks/install
>> # rocks create distro
>>
>> Then rebuild/reinstall your compute node
>>
>
> This did indeed seem to build the distro on the head node. But again,
using
>
> "# rocks set host boot compute-0-5 action=install"
>
> cause the compute node to start an install, but stopped at the same
point.
> I was also unable to point the computer node to the distro on the
head node
> (unless it was in a new directory I couldn't find).
>
> I need another suggestion or two. This system does not have DVD
drives, so
> I'll need CDs to reinstall the compute node. Any idea why I wasn't
able to
> use my previous CDs to reinstall the compute node?
>
> Thanks,
>
> --
> Peter R. Hoyt, Ph. D.
> Oklahoma State University
> --
> Peter R. Hoyt, Ph. D.
> Director, Bioinformatics Certification Graduate Program,
> Oklahoma State University
> Department of Biochemistry and Molecular Biology
> 110FB Henry Bellmon Research Center
> (Shipping address: 246 Noble Research Center)
> Stillwater, OK 74078
> (405) 744-6206
> FAX (405) 744-7799
>
>
--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
-------------- next part --------------
--
Peter R. Hoyt
Does the
rocks list host profile <nameoftheproblematicnode>
returns a proper xml file?
What do you see in the apache access log?
/var/log/httpd/ssl_access and /var/log/httpd/ssl_error
Sincerely,
Luca
--2012-01-13 09:49:53--
https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8
Connecting to 10.1.1.1:443... connected.
WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
State
University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
Self-signed certificate encountered.
WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
requested host name `10.1.1.1'.
HTTP request sent, awaiting response... 404 Not Found
> Does the
> rocks list host profile <nameoftheproblematicnode>
> returns a proper xml file?
Looks like it does: the head of the file is:
<?xml version="1.0" standalone="no"?>
<profile lang="kickstart">
<section name="kickstart">
<![CDATA[
#
# Node Traversal Order
#
# ./nodes/partitions-save.xml (base)
# ./nodes/python-development.xml (base)
# ./nodes/firewall.xml (base)
> What do you see in the apache access log?
> /var/log/httpd/ssl_access and /var/log/httpd/ssl_error
ssl_access.log tail results:
10.1.255.254 - - [12/Jan/2012:15:41:30 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:31 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:32 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.255.254 - - [12/Jan/2012:15:41:33 -0600] "GET
/install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.1.1 - - [13/Jan/2012:09:46:54 -0600] "GET
//install.sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
10.1.1.1 - - [13/Jan/2012:09:49:53 -0600] "GET
//install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
ssl_error, tail results:
[Thu Jan 12 15:41:31 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Thu Jan 12 15:41:32 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install.sbin
[Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install
The dates are odd because I was working on the system yesterday (Jan 12)
and no errors show up.
***************************************************
Hey Peter,
what is the output of
# wget --no-check-certificate
'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
if you run it from the frontend?
Does the
rocks list host profile <nameoftheproblematicnode>
returns a proper xml file?
What do you see in the apache access log?
/var/log/httpd/ssl_access and /var/log/httpd/ssl_error
Sincerely,
Luca
That's the problem. It's in the apache error log.
You should have a symbolic link in /var/www/html
# ls -l /var/www/html/
total 44
drwxr-xr-x 7 root root 4096 Jan 5 16:31 ganglia
drwxr-xr-x 2 root root 4096 Jan 5 16:30 images
-rw-r--r-- 1 root apache 199 Jan 3 18:18 index.html
lrwxrwxrwx 1 root root 21 Jan 3 18:12 install -> /export/rocks/install
drwxr-xr-x 2 root root 4096 Jan 3 18:04 misc
And
# ls -l /export/rocks/install/
total 304
drwxr-xr-x 2 root root 282624 Jan 12 19:07 cachedir
drwxr-xr-x 3 root root 4096 Jan 3 18:12 contrib
drwxr-xr-x 3 root root 4096 Jan 12 19:06 rocks-dist
drwxr-xr-x 13 root root 4096 Jan 12 19:06 rolls
drwxr-xr-x 2 root root 4096 Jan 3 18:12 sbin
drwxr-xr-x 3 root root 4096 Jan 3 18:05 site-profiles
Sincerely,
Luca
But this fails, and the error is that it can't find the file on the
server. The ssl_error log has new entries looking for a key(?):
[Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
exist: /var/www/html/install
[Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install.sbin
[Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not
exist: /var/www/html/install
[Fri Jan 13 13:36:06 2012] [error] [client 10.1.255.254] cat:
/etc/security/ca/new-certs/cert.e15471/key: No such file or directory
[Fri Jan 13 13:36:06 2012] [error] [client 10.1.255.254] cat:
/etc/security/ca/new-certs/cert.s15480/key: No such file or directory
Am I missing another symbolic link? The last error is looking at
10.1.255.254 which is (I believe) the internal IP of the compute node
with the problem. Should I start over with insert-ethers? Is there a
special way to remove the lockfile? (Remember, I'm just a biologist).
Thanks again,
Peter
> Hey Peter,
> ....
>
> On Fri, Jan 13, 2012 at 8:22 AM,peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>
> <peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>> wrote:
> >/ Hi Luca,
> />>/
> />>/ what is the output of
> />>/ # wget --no-check-certificate
> />>/ 'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8' <https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8%27>
> />>/ if you run it from the frontend?
> />/
> />/
> />/ --2012-01-13 09:49:53--
> />/ https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8 <https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8>
> />/ Connecting to 10.1.1.1:443... connected.
> />/ WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma State
> />/ University/OU=YENKO454-CA/emailAddress=peter.r.hoyt at okstate.edu <https://lists.sdsc.edu/mailman/listinfo/npaci-rocks-discussion>/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
> />/ Self-signed certificate encountered.
> />/ WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
> />/ requested host name `10.1.1.1'.
> />/ HTTP request sent, awaiting response... 404 Not Found
> />/
> />>/ Does the
> />>/ rocks list host profile<nameoftheproblematicnode>
> />>/ returns a proper xml file?
> />/
> />/
> />/ Looks like it does: the head of the file is:
> />/ <?xml version="1.0" standalone="no"?>
> />/ <profile lang="kickstart">
> />/ <section name="kickstart">
> />/ <![CDATA[
> />/ #
> />/ # Node Traversal Order
> />/ #
> />/ # ./nodes/partitions-save.xml (base)
> />/ # ./nodes/python-development.xml (base)
> />/ # ./nodes/firewall.xml (base)
> />/
> />>/ What do you see in the apache access log?
> />>/ /var/log/httpd/ssl_access and /var/log/httpd/ssl_error
> />/
> />/ ssl_access.log tail results:
> />/
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:30 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:31 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:32 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.255.254 - - [12/Jan/2012:15:41:33 -0600] "GET
> />/ /install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.1.1 - - [13/Jan/2012:09:46:54 -0600] "GET
> />/ //install.sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/ 10.1.1.1 - - [13/Jan/2012:09:49:53 -0600] "GET
> />/ //install/sbin/public/kickstart.cgi?arch=i386&np=8 HTTP/1.0" 404 305
> />/
> />/
> />/ ssl_error, tail results:
> />/
> />/ [Thu Jan 12 15:41:31 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Thu Jan 12 15:41:32 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Thu Jan 12 15:41:33 2012] [error] [client 10.1.255.254] File does not
> />/ exist: /var/www/html/install
> />/ [Fri Jan 13 09:46:54 2012] [error] [client 10.1.1.1] File does not exist:
> />/ /var/www/html/install.sbin
> />/ [Fri Jan 13 09:49:53 2012] [error] [client 10.1.1.1] File does not exist:
> />/ /var/www/html/install
> />/
> /
Sincerely,
Luca
PS: your emails have some problem. Their subject should start with
"Re:" like this one "Re: [Rocks-Discuss] Rocks reinstall...." but
somehow you or your mail client keeps on removing the "Re: " part, and
this messes up the mailing list threading system. Just hitting the
replay button and not modifying subject should work just fine.
Yes, I was having email problems, and finally had to delete my RSS feed
and sign up for individual emails in order to reply properly.
Outputs:
> -bash-3.2$ ls -l /export/rocks/install
> total 432
> drwxr-xr-x 2 root root 413696 Aug 22 11:58 cachedir
> drwxr-xr-x 3 root root 4096 Aug 22 12:20 contrib
> drwxr-xr-x 2 root root 4096 Aug 22 12:54 images
> drwxr-xr-x 3 root root 4096 Dec 21 17:35 rocks-dist
> drwxr-xr-x 11 root root 4096 Aug 22 12:00 rolls
> drwxr-xr-x 2 root root 4096 Aug 22 12:21 sbin
> drwxr-xr-x 3 root root 4096 Aug 22 12:11 site-profiles
> -bash-3.2$
> -bash-3.2$ ls -l /export/rocks/install/sbin
> total 60
> -rwxr-xr-x 1 root root 36219 Nov 2 2010 kickstart.cgi
> lrwxrwxrwx 1 root root 1 Aug 22 12:21 public -> .
> -rwxr-xr-x 1 root root 6152 Nov 2 2010 setDbPartitions.cgi
> -rwxr-xr-x 1 root root 4496 Nov 2 2010 setPxeboot.cgi
> -rwxr-xr-x 1 root root 4724 Nov 2 2010 setPxeMode.cgi
> -bash-3.2$
Anything interesting (the link looks unusual to me).
Thanks again. At least I can reply now.
Peter
--
I removed the lock on the insert-ethers, but when insert-ethers runs, it
is attempting to use a new internal network IP (10.1.0.0 vs 10.1.1.1).
Is there anyway to change this back?
Unfortunately, I've made things worse by issuing:
insert-ethers --replace compute-0-0
Now, compute0-0 is gone from my network, and I can't add it back if
insert-ethers wants to use a new IP.
Thanks and I apologize for my errors.
Pete
On 1/13/2012 4:33 PM, Luca Clementi wrote:
--
So the initial problem was to install compute-0-5, since you made the
simbolyc link in /var/www/html what is the problem?
What changed after that?
If you connect a monitor to the node during the installation what is
the error that you see after you re-created the link?
What happen now if you do the wget as i told you before?
On Tue, Jan 17, 2012 at 1:28 PM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> Hi Luca,
>
> I removed the lock on the insert-ethers, but when insert-ethers runs, it is
> attempting to use a new internal network IP (10.1.0.0 vs 10.1.1.1). Is there
> anyway to change this back?
that is normal....
>
> Unfortunately, I've made things worse by issuing:
> insert-ethers --replace compute-0-0
>
> Now, compute0-0 is gone from my network, and I can't add it back if
> insert-ethers wants to use a new IP.
to force a particular hostname you can use
insert-enthers --basename compute --cabinet 0 --rank 0
Luca
After you helped me find and fix the missing symbolic link, the compute
node still wants to re-install ROCKS, but the installation proceeded a
little farther, trying to "retrieve" the kickstart keys. The install
fails at this point saying that it could not find the files on the FE
(10.1.1.1).
All these errors are from a monitor attached to the compute-0-0 directly.
However, I stupidly removed the lockfile on insert-ethers so that I
could try to run insert-ethers, and add the node back to my cluster. I
then sent the following commands:
insert-ethers --replace compute-0-0
Insert-ethers does run, but now it is trying to set up a cluster using
10.1.0.0 instead of 10.1.1.1. Insert-ethers does NOT see my other nodes
(which I did not restart). Will I now have to add back all my nodes to
use a new 10.1.0.0 internal IP?
If I do wget --no-check-certificate
'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8'
from the FE now I get:
> # --2012-01-18 11:37:26--
> https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8
> Connecting to 10.1.1.1:443... connected.
> WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
> State
> University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
> Self-signed certificate encountered.
> WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't
> match requested host name `10.1.1.1'.
> HTTP request sent, awaiting response... 200 OK
> Length: 320990 (313K) [application/octet-stream]
> Saving to: `kickstart.cgi?arch=i386&np=8'
>
> 100%[===================================================================================================>]
> 320,990 --.-K/s in 0.04s
>
> 2012-01-18 11:37:33 (8.69 MB/s) - `kickstart.cgi?arch=i386&np=8' saved
> [320990/320990]
> #
I'm still lost. Thanks for your help.
Peter
On Wed, Jan 18, 2012 at 9:41 AM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> Sorry Luca I'll try to explain.
> Initial problem was a compute node (compute-0-0) that had a disk failure. I
> tried to hotswap the disk. Upon "rebuilding" the disk, the compute node was
> visible to the head node, but wanted to reinstall ROCKS. Unfortunately I was
> getting an HTTP IO error after looking for the kickstart keys.
>
> After you helped me find and fix the missing symbolic link, the compute node
> still wants to re-install ROCKS, but the installation proceeded a little
> farther, trying to "retrieve" the kickstart keys. The install fails at this
> point saying that it could not find the files on the FE (10.1.1.1).
So the retrieve kickstart keys step is before the http step, that is
what Im not getting, it's like we did a step backward.
Can you please try to start the installation of the node compute-0-0
and then connect to a node and hit ctrl+Alt and F3 key at the same
time. Then you should see the logs of what is going on.
What are the last 5 lines you see there before the machine gets stuck?
>
> All these errors are from a monitor attached to the compute-0-0 directly.
>
> However, I stupidly removed the lockfile on insert-ethers so that I could
> try to run insert-ethers, and add the node back to my cluster. I then sent
> the following commands:
> insert-ethers --replace compute-0-0
>
> Insert-ethers does run, but now it is trying to set up a cluster using
> 10.1.0.0 instead of 10.1.1.1. Insert-ethers does NOT see my other nodes
> (which I did not restart). Will I now have to add back all my nodes to use a
> new 10.1.0.0 internal IP?
That is not the problem.
10.1.0.0 is the subnetwork that insert ether is using, it means the
network with all the IP in the form of 10.1.X.X, which is fine in your
case.
Insert-ether displays only the new discovered host (so already
discovered host are not displayed in the list).
If you wanna see the list of "your" host (aka already discovered host)
you should run rocks list host (can you post the output?).
>
> If I do wget --no-check-certificate
> 'https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8' from
> the FE now I get:
>>
>> # --2012-01-18 11:37:26--
>> https://10.1.1.1//install/sbin/public/kickstart.cgi?arch=i386&np=8
>>
>> Connecting to 10.1.1.1:443... connected.
>> WARNING: cannot verify 10.1.1.1's certificate, issued by `/O=Oklahoma
>> State
>> University/OU=YENKO454-CA/emailAddress=peter....@okstate.edu/L=Stillwater/ST=OK/C=US/CN=yenko.hbrc.okstate.edu':
>> Self-signed certificate encountered.
>> WARNING: certificate common name `yenko.hbrc.okstate.edu' doesn't match
>> requested host name `10.1.1.1'.
>> HTTP request sent, awaiting response... 200 OK
>> Length: 320990 (313K) [application/octet-stream]
>> Saving to: `kickstart.cgi?arch=i386&np=8'
>>
>>
>> 100%[===================================================================================================>]
>> 320,990 --.-K/s in 0.04s
>>
>> 2012-01-18
>> 11:37:33 (8.69 MB/s) - `kickstart.cgi?arch=i386&np=8' saved [320990/320990]
>> #
>
What's the content of this file?
How does the last lines of
/var/log/httpd/ssl_access_log
and
/var/log/httpd/ssl_error_log
after you try to install compute-0-0 and it fails?
Sincerely,
Luca
Sincerely,
Luca
Result of "rocks list host":
> -bash-3.2$ rocks list host
> HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
> yenko: Frontend 8 0 0 os install
> compute-0-4: Compute 8 0 4 os install
> compute-0-1: Compute 8 0 1 os install
> compute-0-3: Compute 8 0 3 os install
So, compute-0-0 now thinks it is compute-0-4. I saw this happening when
I last used insert-ethers.
> So the retrieve kickstart keys step is before the http step, that is
> what Im not getting, it's like we did a step backward.
Earlier the window said "looking for kickstart keys", went blank, and
then had an HTTP IO-error. After putting in the symbolic link, the
"looking for kickstart keys", goes blank, and is followed by "retrieving
kickstart keys", then gave a "could not find keys" error.
> What are the last 5 lines you see there before the machine getsstuck?
There are 7-lines that say:
> INIT cannot execute "/sbin/mingetty" INIT cannot execute
> "/sbin/mingetty" INIT cannot execute "/sbin/mingetty" INIT cannot
> execute "/sbin/mingetty" INIT cannot execute "/sbin/mingetty" INIT
> cannot execute "/sbin/mingetty" INIT Id "1" respawing too fast:
> disabled for 5 min.
I did hang around for 5 minutes, and the same set of lines repeated them selves exactly:
Thanks Luca, maybe this will give you a hint.
I don't want to complicate things, but BEHIND the window that says
"could not find keys" error, there is a message in the background (off
centered/out of phase) that says:"2012-01-19 19:12:26: (network.c.358)
can't bind to port: 80 address already in use"
Peter
On 1/19/2012 11:56 AM, Luca Clementi wrote:
> Hey Peter,
>
> On Wed, Jan 18, 2012 at 9:41 AM, peter....@okstate.edu
> <peter....@okstate.edu> wrote:
>> Sorry Luca I'll try to explain.
>> Initial problem was a compute node (compute-0-0) that had a disk failure. I
>> tried to hotswap the disk. Upon "rebuilding" the disk, the compute node was
>> visible to the head node, but wanted to reinstall ROCKS. Unfortunately I was
>> getting an HTTP IO error after looking for the kickstart keys.
>>
>> After you helped me find and fix the missing symbolic link, the compute node
>> still wants to re-install ROCKS, but the installation proceeded a little
>> farther, trying to "retrieve" the kickstart keys. The install fails at this
>> point saying that it could not find the files on the FE (10.1.1.1).
> So the retrieve kickstart keys step is before the http step, that is
> what Im not getting, it's like we did a step backward.
>
> Can you please try to start the installation of the node compute-0-0
> and then connect to a node and hit ctrl+Alt and F3 key at the same
> time. Then you should see the logs of what is going on.
>
All those processes appear to be running. There are eight instances of
apache running httpd, but otherwise everything looked normal.
Peter
for this problem you can use the
insert-ether --basename compute --cabinet 0 --rank 0
>
>
>> So the retrieve kickstart keys step is before the http step, that is
>> what Im not getting, it's like we did a step backward.
>
> Earlier the window said "looking for kickstart keys", went blank, and then
> had an HTTP IO-error. After putting in the symbolic link, the "looking for
> kickstart keys", goes blank, and is followed by "retrieving kickstart keys",
> then gave a "could not find keys" error.
>
>
Peter maybe you have some hardware problem with your system.
The keys the rocks installer is looking for, are on the local ramdisk (if
Im not wrong).
The missing symbolic link, the keys not there....
Maybe you have some hardware problem?
Can you check the disk of your frontend?
First run a
dmesg
e see if in the output there is any IO error.
And then you can reboot your frontend with:
shutdown -rF now
and then at the next boot you can see the disk check (it's hidden in
centos slpash screen, you have to click on show details).
Sincerely,
Luca
Hope that helps.
-Ed
No obvious hardware problems. However when I rebooted the front end, I
now get
> ]# rocks list host boot
> HOST ACTION
> yenko: -------
> compute-0-4: install
> compute-0-1: os
> compute-0-3: os
> [root@yenko ~]#
> ]# rocks list host
> HOST MEMBERSHIP CPUS RACK RANK RUNACTION INSTALLACTION
> yenko: Frontend 8 0 0 os install
> compute-0-4: Compute 8 0 4 os install
> compute-0-1: Compute 8 0 1 os install
> compute-0-3: Compute 8 0 3 os install
Shouldn't "rocks list host boot" say "os"?
Also, the Gnome desktop seems to have changed. I'll have to got to the
server room and check it out.
THanks,
pete
When I run "insert-ethers --basename compute --cabinet 0 --rank 0"
the insert-ethers window starts and I don't know where to proceed from there. Do I reboot the troublesome node and get it recognized? I would really like to rename compute-0-4 back to compute-0-0 because that is the compute node listed in the .ssh file. I don't think I can communicate with that node until it is back to being compute-0-0.
Also, when I rebooted my FE, as you instructed:
> And then you can reboot your frontend with:
> shutdown -rF now
> and then at the next boot you can see the disk check (it's hidden in
> centos slpash screen, you have to click on show details).
I did not see the disk-check you are referring to. Is there a file I can access? The other nodes are fine, it's just this one node. If I could just get it to re-install. I am willing to reformat and re-initialize the disks in this node if that's the next option. Is there any info I could get from booting this node using the CentOS "Live" CD?
Thanks,
Peter
On 1/20/2012 3:56 PM, Luca Clementi wrote:
Insert-ethers is a tool to intercept PXE boots, detect MAC addresses, and create entries in the rocks database for new nodes. It does have a few (leftover) features like removing database entries, but those have been largely moved to "/opt/rocks/bin/rocks" commands.
So in your case, to rename compute-0-4 to compute-0-0, you can do one of two things:
Version one:
1. Run "insert-ethers --replace compute-0-4"
2. Pick "compute" from list.
3. Make that node PXE boot.
4. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.
Version two:
1. Run "rocks remove host compute-0-4"
2. Run "rocks sync config; rocks sync users"
3. Run "insert-ethers --cabinet 0 --rank 0"
4. Pick "compute" from list.
5. Make that node PXE boot.
6. When the "( )" turns into a "(*)", indicating that the node has received a kickstart file, exit.
To re-install a node that already has an entry in the rocks database (i.e. a "known" node):
1. Run "rocks set host boot compute-0-0 action=install
2. Make that node PXE boot.
When the node is either installed or re-installed, the OS is completely new. This includes things like the node's SSH ID (values you would put in ~/.ssh/known_hosts or /etc/ssh/ssh_known_hosts).
But you can still currently do a "ssh compute-0-4", and after it's been installed as compute-0-0, you can "ssh compute-0-0".
I think you can detect a hardware problem by reading `dmesg` and/or looking in the logs. No need to do a filesystem check (fsck) as suggested by Luca. If you really want to be sure, you can boot your FE from a LiveCD, and run "fsck" on each partition of your FE. Keep in mind that if you have some large userdata section, it could take a really, really long time to run. For now, run fsck only on the /, /var, /boot, etc. partitions. I can't give you an exact list, because I don't know how you partitioned your FE. But you get the idea: check the local OS partitions.
I suspect that most of your problems stem from running insert-ethers too many times when you should have used "rocks set host boot compute-0-0 action=install".
Bart
________________________________
This message contains information that may be confidential, privileged or otherwise protected by law from disclosure. It is intended for the exclusive use of the Addressee(s). Unless you are the addressee or authorized agent of the addressee, you may not review, copy, distribute or disclose to anyone the message or any information contained within. If you have received this message in error, please contact the sender by electronic reply to em...@environcorp.com and immediately delete all copies of the message.
On Mon, Jan 23, 2012 at 12:15 PM, peter....@okstate.edu
<peter....@okstate.edu> wrote:
> This is getting too frustrating.
>
> When I run "insert-ethers --basename compute --cabinet 0 --rank 0"
>
> the insert-ethers window starts and I don't know where to proceed from
> there. Do I reboot the troublesome node and get it recognized? I would
> really like to rename compute-0-4 back to compute-0-0 because that is the
> compute node listed in the .ssh file. I don't think I can communicate with
> that node until it is back to being compute-0-0.
>
The name difference should not cause any problem.
If you want to change it you have to remove compute-0-4 with
rocks remove host compute-0-4
and then you can rediscover the host using insert-ethers with the rack
and cabinet
option and then reboot your host.
> Also, when I rebooted my FE, as you instructed:
>
>> And then you can reboot your frontend with:
>> shutdown -rF now
>> and then at the next boot you can see the disk check (it's hidden in
>> centos slpash screen, you have to click on show details).
>
> I did not see the disk-check you are referring to. Is there a file I can
> access?
Not that Im aware.
(it's hidden in centos slpash screen, you have to click on show
details during the boot)
> The other nodes are fine, it's just this one node. If I could just
> get it to re-install. I am willing to reformat and re-initialize the disks
> in this node if that's the next option. Is there any info I could get from
> booting this node using the CentOS "Live" CD?
To wipe out the disk data you can use gparted:
http://gparted.sourceforge.net/
boot from the gparted CD delete all the partition you have on the
disk, recreate one big partition
which takes all the disk, and that should wipe all the data away from the disk.
Then try to reinstall the node with insert-enther as indicated above.
Sincerely,
Luca
Using Ed's advice to convert compute-0-4 to compute-0-0, with some
obvious problems, I got the compute node renamed:
1. Run "insert-ethers --replace compute-0-4" ---> no problem
2. Pick "compute" from list. ---> no problem
3. Make that node PXE boot. ---> no problem
4. When the "( )" turns into a "(*)", indicating that the node has
received a kickstart file, exit. ---PROBLEM!
At this step, the instant that the insert-ethers window said it had
recognized a new appliance, the insert-ethers window CRASHED and
disapeared. I tried to restart insert-ethers quickly, but after a few
seconds the window filled up with a few lines of unrecognizeable
characters. I could not "F8" or "F9" to get out of insert-ethers, but
clicking on the upper right "X" did exit the window.
a. Running "rocks list host interface" showed that compute-0-4 was now
gone. So with Compute-0-4 gone, I tried
> 2. Run "rocks sync config; rocks sync users"
> 3. Run "insert-ethers --cabinet 0 --rank 0"
> 4. Pick "compute" from list.
> 5. Make that node PXE boot.
> 6. When the "( )" turns into a "(*)", indicating that the node has
> received a kickstart file, exit.
Which worked fine. I now had a compute-0-0 but the crash of
insert-ethers was disturbing.
I ran :"rocks set host boot compute-0-0 action=install"
Upon rebooting the node using PXE, the same error: "failed to connect to
HTTP server" AFTER the "retrieving kickstart files" window.
b. I checked all my paths, and checked all my symbolic links. All good.
So i tried something different. Instead of connecting to "10.1.1.1" I
tried to connect to head node using the external IP address. This
actually started to work!
"Running anaconda..."
"Probing for video card..."
"formatting file system..."
"formatting /var file system.."
then STOP! error says:
"Unable to read package metadata. This may be due to a missing repo data
directory. Please ensure that your install tree has been correctly
generated. Cannot retrieve repository metadata (repomd.xml) for
repository anaconda-base-201101061733.i386."
"Please verify its path and try again"
<abort> or <continue>
I hit continue, and another error said:
"Unable to read group info from repositories. This is a problem with the
generation of your install tree."
I was unable to switch my video back to the front end after this. So
rather than reboot my front end blindly, I first rebooted my compute-0-0
and then got another error: "exception bug..." Which I was able to save
to the FE.
The exception-bug file is a long file. Most of the parts look like
installation is going okay. The critical errors seemed to start at this
point"
> 22:59:25 INFO : Running kickstart %%pre script(s)
> 22:59:27 INFO : All kickstart %%pre script(s) have been run
> 22:59:31 WARNING : step installtype does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:31 WARNING : step complete does not exist
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en_US.UTF-8.html:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en_US.UTF-8:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en_US.html:
> HTTP Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en_US: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-en.html: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.en: HTTP Error
> 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES-C.html: HTTP
> Error 404: Not Found
> 22:59:33 CRITICAL: IOError 14 occurred getting
> http://127.0.0.1/install/rocks-dist/i386/RELEASE-NOTES.C: HTTP Error
> 404: Not Found
> 22:59:33 INFO : moving (1) to step partitionobjinit
later the next big set of errors seemed to start here:
> 23:00:09 INFO : moving (1) to step downloadrolls
> 23:00:09 INFO : moving (1) to step reposetup
> 23:00:09 DEBUG : repo time: 0.002
> 23:00:09 DEBUG : Setting up Package Sacks
> 23:00:09 ERROR : reading package metadata: Cannot retrieve
> repository metadata (repomd.xml) for repository:
> anaconda-base-201101061733.i386. Please verify its path and try again
> 23:08:47 INFO : no groups missing
> 23:08:47 INFO : moving (1) to step basepkgsel
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for
> ClusterTools_gnu
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : rpmdb time: 0.000
> 23:08:47 DEBUG : no package matching ClusterTools_gnu
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for EMBOSS
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching EMBOSS
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : Checking for virtual provide or file-provide for
> OpenIPMI
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching OpenIPMI
After about 100 "no package matching <packagename>" repetitions, the
exception bug ended abruptly with:
> 23:08:47 DEBUG : Setting up Package Sacks
> 23:08:47 DEBUG : no package matching yum
> 23:08:47 DEBUG : no such group Core
> 23:08:47 DEBUG : no such group Base
> 23:08:47 INFO : moving (1) to step postselection
> 23:08:47 DEBUG : Setting up Package Sacks
>
>
> /tmp/lvmout:
> Wiping cache of LVM-capable devices
> Wiping internal VG cache
> Finding all volume groups
> Reading all physical volumes. This may take a while...
Did you remember to remove the lock file for insert-ethers after the crash? It's not a program that a new instance can pick up where the last instance left off, so it was pointless to quickly re-start it. It's not a service, like sshd.
> a. Running "rocks list host interface" showed that compute-0-4 was now
> gone. So with Compute-0-4 gone, I tried
> > 2. Run "rocks sync config; rocks sync users"
> > 3. Run "insert-ethers --cabinet 0 --rank 0"
> > 4. Pick "compute" from list.
> > 5. Make that node PXE boot.
> > 6. When the "( )" turns into a "(*)", indicating that the node has
> > received a kickstart file, exit.
> Which worked fine. I now had a compute-0-0 but the crash of
> insert-ethers was disturbing.
So at that point, did compute-0-0 install correctly? It takes a while, something like 30 minutes to install (depending on your hardware, interconnect, etc.). Did you hook up a screen to compute-0-0 and verify that the install finished?
> I ran :"rocks set host boot compute-0-0 action=install"
> Upon rebooting the node using PXE, the same error: "failed to connect to
> HTTP server" AFTER the "retrieving kickstart files" window.
It's very odd that an initial install using insert-ethers would work, but a subsequent re-install would not work. Perhaps you interrupted it in the middle of an install, and left it in a half-hosed state.
> b. I checked all my paths, and checked all my symbolic links. All good.
> So i tried something different. Instead of connecting to "10.1.1.1" I
> tried to connect to head node using the external IP address.
Sorry, this too was pointless, as you eventually saw. Rocks is set up to work using the private IP range for PXE-based installs of compute nodes. If that doesn't work, then you should fix the problem rather than trying a work-around. It has worked for many, many other clusters, so something must be wrong with your install.
> This actually started to work!
> "Running anaconda..."
...
> "Unable to read group info from repositories. This is a problem with the
> generation of your install tree."
>
> I was unable to switch my video back to the front end after this. So
> rather than reboot my front end blindly, I first rebooted my compute-0-0
> and then got another error: "exception bug..." Which I was able to save
> to the FE.
...
> /tmp/lvmout:
> Wiping cache of LVM-capable devices
> Wiping internal VG cache
> Finding all volume groups
> Reading all physical volumes. This may take a while...
I'm not sure, but I thought that anaconda (the Redhat installer that Rocks uses) does not support LVM. Did this node initially have some LVM partitions on it? Perhaps you could boot compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank" disk before starting the re-install.
A few other things come to mind, suggestions for you to think about and/or try:
1. What are the two IP ranges and netmasks for your local public LAN and rock's private IP range? From your comments about "10.1.1.1" and connecting to the head node's public IP, I hope you don't mean that your public LAN uses 10.1.X.Y/24 and your private side uses 10.1.X.Y/16. Normally, a compute node has no route to the local public LAN except for being NATted by the headnode. So I don't see how it's possible for it, during the anaconda-based install, to have a route to your FE's public NIC.
2. Which rolls did you select? In one of my recent install attempts of Rocks 5.4.3, I selected some rolls from the Triton 5.4 tree, and some from the Stanford collection of rolls (some built for 5.4.3, some built for 5.4). Although the FE installed correctly, it would not insert compute nodes. The web page/wordpress was also not working, if that's a clue. What's the output on your FE of "rocks list roll"
Bart Brashers
> drwxrwxr-x 6 root lock 4096 Jan 26 04:02 .
> drwxr-xr-x 28 root root 4096 Aug 22 12:45 ..
> -rw-r--r-- 1 root root 0 Jan 23 10:27 after-rc
> -rw-r--r-- 1 root root 0 Jan 23 10:26 before-rc
> drwxr-xr-x 2 root root 4096 Mar 31 2010 dmraid
> -rw-r--r-- 1 root root 0 Jan 23 10:26 irqbalance
> drwxr-xr-x 2 root root 4096 Jan 23 10:26 iscsi
> drwx------ 2 root root 4096 Aug 22 12:14 lvm
> drwxr-xr-x 2 root root 4096 Jan 26 04:02 subsys
2. you asked "So at that point, did compute-0-0 install correctly?"
ANSWER: No. And I did hook up a monitor... because this is a small
cluster I have a KVM switch. The install was stopped with the can't
locate kickstart file error
3. You commented: It's very odd that an initial install using
insert-ethers would work, but a subsequent re-install would not work.
Perhaps you interrupted it in the middle of an install, and left it in a
half-hosed state.
ANSWER: Because I have a KVM setup, I tried to be careful this did
not happen. But something happened, and "half-hosed" seems to describe
it well.
4. You commented: "Sorry, this too was pointless, as you eventually
saw. Rocks is set up to work using the private IP range for PXE-based
installs of compute nodes."
ANSWER: Yes, sorry. Just a Biologist as stated previously.
5: You Commented: "I'm not sure, but I thought that anaconda (the Redhat
installer that Rocks uses) does not support LVM. Did this node
initially have some LVM partitions on it? Perhaps you could boot
compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank"
disk before starting the re-install."
ANSWER: I have to admit, I'm not sure about this but I think not.
The Biologist in me says the other nodes are okay (as was this one until
the disk crashed), and they were not treated differently. I have a
directory "/var/lock/lvm" but it is empty (on the head node, 5 SCSI
disks). if I issue the df command I get:
> # df
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/sda1 15872604 4314572 10738720 29% /
> /dev/sda5 47719976 32671108 12585704 73% /state/partition1
> /dev/sda2 3968124 307960 3455336 9% /var
> tmpfs 12219672 0 12219672 0% /dev/shm
> tmpfs 886632 3456 883176 1%
> /var/lib/ganglia/rrds
if I issue fdisk:
> # fdisk -l
>
> Disk /dev/sda: 72.4 GB, 72469184512 bytes
> 255 heads, 63 sectors/track, 8810 cylinders
> Units = cylinders of 16065 * 512 = 8225280 bytes
>
> Device Boot Start End Blocks Id System
> /dev/sda1 * 1 2040 16386268+ 83 Linux
> /dev/sda2 2041 2550 4096575 83 Linux
> /dev/sda3 2551 2677 1020127+ 82 Linux swap /
> Solaris
> /dev/sda4 2678 8810 49263322+ 5 Extended
> /dev/sda5 2678 8810 49263291 83 Linux
if I issue: lvdisplay:
> # lvdisplay -v
> Finding all logical volumes
> #
You (and Luca) are suggesting I start over... at least on compute-0-0. I
believe that is the best option. Dang.
My public IP, located on the University network is an University IP
nothing close to 10.1... anything. I was surprised I could access it.
Finally, the results of "rocks list roll"
> # rocks list roll
> NAME VERSION ARCH ENABLED
> os: 5.4 i386 yes
> web-server: 5.4 i386 yes
> ganglia: 5.4 i386 yes
> base: 5.4 i386 yes
> hpc: 5.4 i386 yes
> bio: 5.4 i386 yes
> service-pack: 5.4.2 i386 yes
> kernel: 5.4 i386 yes
> area51: 5.4 i386 yes
Thanks again!
-p
Lock files are ways a program can prevent other copies of itself from being launched. Let's say I have a program that opens and manipulates a database. I don't want two copies of the program making changes as the same time, because it was cause database corruption. So I make the program check for the existence of a little file somewhere (e.g. /var/lock) when it starts, and exit immediately if the file exists. I also make the program remove the lock file when it exits normally. But if the program crashes in the middle, it won't remove the lock file -- crash means it can't do anything else, it's dead.
So you normally won't see any lock file for insert-ethers, unless it crashed -- which you said it did. If you could immediately launch another instance of insert-ethers after the first one "crashed", then I suspect either (a) something is very wrong with your install, or (b) it didn't really crash, it exited normally. But I can't explain why the 2nd one you started wrote all that junk to the screen.
> 2. you asked "So at that point, did compute-0-0 install correctly?"
> ANSWER: No. And I did hook up a monitor... because this is a small
> cluster I have a KVM switch. The install was stopped with the can't
> locate kickstart file error
OK.
> 3. You commented: It's very odd that an initial install using
> insert-ethers would work, but a subsequent re-install would not work.
> Perhaps you interrupted it in the middle of an install, and left it in a
> half-hosed state.
> ANSWER: Because I have a KVM setup, I tried to be careful this did
> not happen. But something happened, and "half-hosed" seems to describe
> it well.
Well, now that we know it never installed correctly, I'd say the compute node is "blank" rather than "half hosed". You should be able to install on it, were your FE functioning correctly.
> 4. You commented: "Sorry, this too was pointless, as you eventually
> saw. Rocks is set up to work using the private IP range for PXE-based
> installs of compute nodes."
> ANSWER: Yes, sorry. Just a Biologist as stated previously.
OK.
> 5: You Commented: "I'm not sure, but I thought that anaconda (the Redhat
> installer that Rocks uses) does not support LVM. Did this node
> initially have some LVM partitions on it? Perhaps you could boot
> compte-0-0 with a LiveCD and wipe all the partitions, so it's a "blank"
> disk before starting the re-install."
> ANSWER: I have to admit, I'm not sure about this but I think not.
> The Biologist in me says the other nodes are okay (as was this one until
> the disk crashed), and they were not treated differently. I have a
> directory "/var/lock/lvm" but it is empty (on the head node, 5 SCSI
> disks). if I issue the df command I get:
OK, this just shows that there are no LVM partitions on the frontend. BTW, where are the other 4 SCSI disks on your FE mounted? Or are they not being used (yet)?
I was wondering if there were some leftover LVM partitions on compute-0-0. Can you do your calls to df, fdisk, and lvdisplay on compute-0-0, after booting from a LiveCD?
I suspect you won't find any LVM partitions there either, so this may be a waste of time.
> You (and Luca) are suggesting I start over... at least on compute-0-0. I
> believe that is the best option. Dang.
I think you misunderstand. We don't suggest you start over just on compute-0-0, we suggest you start over by re-installing your frontend. Start fresh with a new install of your cluster. You might consider taking the opportunity to upgrade to 5.4.3 while you're at it.
Speaking as an Atmospheric Scientist giving my $0.02 worth of advice to a Biologist, you can spend much more time trying to figure out what is wrong with your current Rocks install than it takes to install a new one. I've been there, recently. I wasted nearly a week trying to figure out why my install with a mix of Rocks, Triton, and Stanford rolls was slightly hosed (it was mostly working, but certain key bits were busted). When I bit the bullet and started fresh with just a basic set of rolls, it was up and running correctly in a few hours.
If you can, I suggest you NOT make a restore roll, so you don't inherit any problems. Look over the restore roll internals, and read the wiki entry
https://wiki.rocksclusters.org/wiki/index.php/Tips_and_tricks#Q._What_files_should_survive_a_frontend_re-install.2Fupgrade.3F,
and make some spot on a different disk (e.g. /dev/sdb) to copy these files (/etc/passwd, /etc/group, /etc/shadow, /etc/auto.home, and so on). Don't forget your extend*.xml files.
During the install, be sure to NOT FORMAT your disks with user data. Just format the /dev/sda disk with the Rocks install. You can even leave the /dev/sda5 partition containing your /state/partition1 (aka /export) alone, not re-format.
Then after the fresh install, you can merge the user data from the old versions of these files into the new versions -- they are just text files, so you can edit them by hand.
> My public IP, located on the University network is an University IP
> nothing close to 10.1... anything. I was surprised I could access it.
I'm really surprised too. Do you have your compute nodes on a separate switch, or at least a VLAN? If you log into one of your functioning nodes and type "traceroute www.redhat.com" does it show that the first step is your FE?
In particular, you must protect your compute nodes from any other DHCP servers that might be on your local public LAN. That's why we generally have a private switch or VLAN for them.
> Finally, the results of "rocks list roll"
> > # rocks list roll
> > NAME VERSION ARCH ENABLED
> > os: 5.4 i386 yes
> > web-server: 5.4 i386 yes
> > ganglia: 5.4 i386 yes
> > base: 5.4 i386 yes
> > hpc: 5.4 i386 yes
> > bio: 5.4 i386 yes
> > service-pack: 5.4.2 i386 yes
> > kernel: 5.4 i386 yes
> > area51: 5.4 i386 yes
These look pretty vanilla to me. That rules out the "incompatible rolls" theory for why you can't install compute-0-0.
I see that you've re-created the rocks distro, but maybe you need to clean it first? You could try the following:
# cd /export/rocks/install
# /bin/rm -rf rocks-dist # removes the old distro
# rocks create distro
Then verify compute-0-0 is set to re-install upon next boot:
# rocks list host boot
Next, verify that Rocks can generate a valid kickstart file:
# rocks list host profile compute-0-0 > /tmp/c0.ks
# less /tmp/c0.ks # scan for errors/truncation, mine is about 117K in size.
Then PXE-boot your node.
Thanks very much for your detailed help. I am hesitant to re-install the
entire cluster. Mostly now because I am no good at setting up disks. I
better learn how to format and mount disks before I upgrade/re-install.
My SysAdmin left and this has become my responsibility. I enjoy the
learning, but have had enough in this round. Part of the "tree" problem
could be a couple of "home" directories that are acting like they are
hard-linked directoreis (???), but have different inode numbers (took
most of yesterday to learn those terms).
If you still have some patience would you give me a little guidance on
how to format using "points" on the disks? Or a concise tutorial.
Specifically you said: "make some spot on a different disk (e.g.
/dev/sdb) to copy these files (/etc/passwd, /etc/group, /etc/shadow,
/etc/auto.home, and so on). Don't forget your extend*.xml files." More
specifically when you ask me "BTW, where are the other 4 SCSI disks on
your FE mounted?" I really don't have a clue. I assumed they were all
being used. My original Linux "mentors" told me that the physical disks
would all just be one "big logical disk". Compute-0-0 is a raid5 array
and I believe the other nodes are the same (but have not
double-checked). I do understand the concepts of logical volumes vs
physical volumes. I have worked with partitions on disks, but these were
mostly in Windows computers. The mount points of Linux disks are still
mysterious regarding moving them around or otherwise assigning them...
much less doing it appropriately for organization and system
security/stability.
The FE has 5 SCSI disks (old 18GB U320 Cheetas). The arrangement on the
head node is:
# df
Filesystem 1K-blocks Used Available Use% Mounted on
/dev/sda1 15872604 4314572 10738720 29% /
/dev/sda5 47719976 32671108 12585704 73% /state/partition1
/dev/sda2 3968124 307968 3455328 9% /var
tmpfs 12219672 0 12219672 0% /dev/shm
tmpfs 886632 3456 883176 1%
/var/lib/ganglia/rrds
But I didn't do anything on purpose. I just let ROCKS do it's thing.
It's a mystery to me why /dev/sda5 appears to have 32Gb of data already.
I expected the OS to be fairly lean. Also, no sda3 or sda4 (or sdb)...
hidden as part of the raid5 setup? Because I'm starting over, I could
put five 148GB SCSIs on the head node, if it is going to "fill-up"
faster. For user data storage, the FE will have an external SATA-array
of several terabytes.
Off to read the CentOS manual...
Peter
--
--
Hung-Sheng Tsao Ph D.
Founder& Principal
HopBit GridComputing LLC
cell: 9734950840
http://laotsao.blogspot.com/
http://laotsao.wordpress.com/
http://blogs.oracle.com/hstsao/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: laotsao.vcf
Type: text/x-vcard
Size: 608 bytes
Desc: not available
Url : https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20120126/76b54817/laotsao.vcf