[Rocks-Discuss] Nodes Fail to Kickstart

164 views
Skip to first unread message

Juan M Vanegas

unread,
Aug 6, 2008, 3:22:53 PM8/6/08
to npaci-rocks...@sdsc.edu
Hi,

We have recently setup a local Rocks 5.0 cluster and have been having
problems reinstalling the nodes. After a power failure that brought down
several of the nodes, all of the nodes but one was able to kickstart
correctly from the frontend. The one node would stop at the "select
language" screen. After this happened, other nodes have needed to be
restarted, but they don't kickstart correctly anymore. Kickstarting
works if I use the install CD to boot the nodes or if I remove the node
completely and add it back with insert-ethers. I have tried to read all
the posts on other users having problems with kickstarting, but I cannot
find the problem. The second console on the failed installation node says

ROCKS:rocksNetworkUp:no network devices in choose network device!
got to setupCdrom without a CD device


which may indicate that it cannot find a network device to use (this
nodes use forcedeth device) but it says it loaded the modules correctly
before it gave the error.

Things that I have tried:

- Remove rocks-dist folder and recreated distribution
- There are no customized xml files

The frontend node is a 64 bit AMD and the nodes are 64 bit AMD as well.
The installed rolls are the ones included in the jumbo dvd (base, bio,
ganglia, hpc, java, kernel, os, sge, web-server, xen)

Any help or suggestions would be greatly appreciated.

Sincerely,


Juan M. Vanegas
Biophysics Graduate Group
University of California, Davis

Anoop Rajendra

unread,
Aug 7, 2008, 1:20:26 PM8/7/08
to Discussion of Rocks Clusters
On the frontend,

what is the output of

# rocks list host pxeboot

-a

Jon Forrest

unread,
Aug 7, 2008, 1:37:30 PM8/7/08
to Discussion of Rocks Clusters
Juan M Vanegas wrote:
> Hi,
>
> We have recently setup a local Rocks 5.0 cluster and have been having
> problems reinstalling the nodes. After a power failure that brought down
> several of the nodes, all of the nodes but one was able to kickstart
> correctly from the frontend. The one node would stop at the "select
> language" screen.

I've seen this happen when 'insert-ethers' dies. In my case it
died because I didn't run 'rocks sync config' after removing
a node from the database (using rocks remove host).

I don't fully understand this either.

Cordially,
--
Jon Forrest
Research Computing Support
College of Chemistry
173 Tan Hall
University of California Berkeley
Berkeley, CA
94720-1460
510-643-1032
jlfo...@berkeley.edu

Hamilton, Scott L.

unread,
Aug 7, 2008, 1:49:57 PM8/7/08
to Discussion of Rocks Clusters
I had this happen to me several times during testing. I think it has to
do with a cached host id or ssh key or something. If you run fdisk on
the system that is failing to reinstall it will successfully reinstall.
It has something to do with rejecting the host id of the ssl server that
is distributing the package. If you look at /tmp/anaconda.log during
the failing installation you will see what I mean. It tries to make a
secure connection to the head node and fails. An new system (or an
existing with no data on the drive) will start an initial session
without requiring the host keys to match.

Thanks,

Scott

Philip Papadopoulos

unread,
Aug 7, 2008, 1:56:47 PM8/7/08
to Discussion of Rocks Clusters
On Thu, Aug 7, 2008 at 10:37 AM, Jon Forrest <jlfo...@berkeley.edu> wrote:

> Juan M Vanegas wrote:
>
>> Hi,
>>
>> We have recently setup a local Rocks 5.0 cluster and have been having
>> problems reinstalling the nodes. After a power failure that brought down
>> several of the nodes, all of the nodes but one was able to kickstart
>> correctly from the frontend. The one node would stop at the "select
>> language" screen.
>>
>
> I've seen this happen when 'insert-ethers' dies. In my case it
> died because I didn't run 'rocks sync config' after removing
> a node from the database (using rocks remove host).

That's an error on insert-ethers and has been fixed in CVS for the next
release. The problem happens because
of inconsistency between the DB and the config files on disk. The "Fix"
for now is to run "rocks sync config" before
you run insert-ethers -- OR run "rocks sync config" after you
added/removed/modified nodes using rocks commands.

-P


>
> I don't fully understand this either.
>
> Cordially,
> --
> Jon Forrest
> Research Computing Support
> College of Chemistry
> 173 Tan Hall
> University of California Berkeley
> Berkeley, CA
> 94720-1460
> 510-643-1032
> jlfo...@berkeley.edu
>
>


--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20080807/7b789d93/attachment.html

Juan M Vanegas

unread,
Aug 7, 2008, 4:08:52 PM8/7/08
to Discussion of Rocks Clusters
Thanks for the prompt response. Sure enough, the pxeboot action was set
to "os" and not "install" for all the compute nodes. I changed them all
to 'install' and now the nodes kickstart correctly. However, after they
reinstall the pxeboot action gets changed back to 'os'. Is this normal?
I haven't modified the configuration of the compute nodes, so they
should reinstall after they go down. Thanks,

Juan

Lloyd Brown

unread,
Aug 7, 2008, 4:38:08 PM8/7/08
to Discussion of Rocks Clusters
Juan M Vanegas wrote:
> Thanks for the prompt response. Sure enough, the pxeboot action was
> set to "os" and not "install" for all the compute nodes. I changed
> them all to 'install' and now the nodes kickstart correctly. However,
> after they reinstall the pxeboot action gets changed back to 'os'. Is
> this normal? I haven't modified the configuration of the compute
> nodes, so they should reinstall after they go down. Thanks,
>
> Juan
>
>


As far as I know, this is normal behavior. After installing, the system
assumes you want to boot from the local hard drive. However, the PXE
boot action isn't the only way that the nodes can install. All the "os"
action is doing is telling the node to boot from it's local hard drive.
If the node was shut down uncleanly, that local hard drive will still
have the appropriate kernel and GRUB configuration in place to start the
install process. However, if the hard drive's partition table, etc.,
are messed up, and it's not bootable anymore, then you'd need to do a
purely PXE install to get them up again. It really depends on the
damage that the outage caused.

--


Lloyd Brown
Systems Administrator
Fulton Supercomputing Lab
Brigham Young University
http://marylou.byu.edu


Reply all
Reply to author
Forward
0 new messages