[Rocks-Discuss] Replacing Head Node

James Rudd

unread,

Oct 13, 2011, 6:26:51 PM10/13/11

to Discussion of Rocks Clusters

Good Day,
I would like to get any suggestions on the best way to replace a head
node in a cluster. We are replacing the old head node (Rocks 5.4.2)
with a completely new server that I plan to install Rocks 5.4.3 on.

Main challenges I see are:
The new server will have the same IP as the previous head node, so
only 1 can be connected to external network at once.
The new server will need all data from old /export moved across,
preferably keeping file permissions/ ownership. I also intend to use
XFS for new /export file system while old server uses ext3. If they
both have the same IP address are their any suggestion for best way to
copy across network? I was thinking to temporarily create a secondary
IP for both servers in 192 range so they can communicate over that.
I plan to create a restore roll on the old server to transfer cluster
settings across to the new server, but I am unsure if this will have
any problems with MAC address changes.

Any other suggestion or problems that may crop up?

Thanks,
James

James Rudd
http://jrudd.org/
---------------------
HPC Cluster Administrator
Centre of Excellence for Silicon Photovoltaics and Photonics
University of New South Wales
Sydney NSW 2052
AUSTRALIA

Philip Papadopoulos

unread,

Oct 13, 2011, 7:02:03 PM10/13/11

to Discussion of Rocks Clusters

The following should be close to working:

Step 1. Create a restore roll and burn to DVD.
Step 2. Reconfigure your old frontend's private IP Address to 10.1.1.2 (or
something else non-conflicting)
Step 3. Turn off dhcpd on your old frontend. Turn off gmond/gmetad.
Step 4. Disconnect the public network (remove cable from old frontend)

Step 5. Build your new frontend with the restore roll
Step 6. Copy via the local private network from 10.1.1.2 (old frontend) to
10.1.1.1 new frontend. Something
like "ssh 10.1.1.2 'cd /export/home; tar cf - *' | (cd /export/home; tar
xvfBp -)

Step 7. Rebuild your nodes.

-P

--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20111013/d47e3722/attachment.html

James Rudd

unread,

Oct 13, 2011, 7:13:18 PM10/13/11

to Discussion of Rocks Clusters

Thanks Philip. Sounds exactly like what I need.

A side note, I was planning on using CentOS 5.7 DVD instead of OS
roll. Has anyone experienced any problems using Centos 5.7 with Rocks
5.4.3?
Mailing list archives show 5.6 appeared to work OK, so I'm hoping
there are no new problems with 5.7.

Thanks,
James

Philip Papadopoulos

unread,

Oct 13, 2011, 8:01:46 PM10/13/11

to Discussion of Rocks Clusters

havent tried 5.7 yet.
p

Sent from my phone
ppapad...@ucsd.edu

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20111013/b6d7f14b/attachment.html

Tim Carlson

unread,

Oct 13, 2011, 9:54:54 PM10/13/11

to Discussion of Rocks Clusters

On Thu, 13 Oct 2011, Philip Papadopoulos wrote:

My latest cluster is 5.4.3. I started with 5.6 and then patched it to
whatever is the most current patch set from Redhat.

I have one problem, but I'm not sure if it is a 5.7 issue or a scalability
issue. When I add more nodes, I end up with a number of "rocks" zombie
processes (one for each node I add).

[root@pic-admin01 install]# rocks list host | wc
1291 9045 100698

So there are over 1200 entries (half of these are IPMI or PDU type
entries) in the database and at some point along the way, nodes didn't
install right off the bat because of a lag in updating the dhcpd process.
I haven't bothered to debug what is going on yet.

Othewise no isses but of course YMMV.

Tim

Russell Jones

unread,

Oct 13, 2011, 10:29:32 PM10/13/11

to npaci-rocks...@sdsc.edu

Unless your cluster is facing the internet (I sure hope not) or an
insecure corporate network, it is my understanding that it's not a good
idea to patch the OS using the Redhat/CentOS repos. It can cause unknown
"issues" since the Rocks binaries and scripts are built against specific
versions and configuration syntax of the OS's software (think dhcp,
bind, etc), and the update process can overwrite them.

If it's working out of the box, leave it alone! Don't fix what ain't
broken :-)

I of course also could be completely and utterly wrong and this is no
longer the case with Rocks, but this is what has been passed down as
sound advice :)

Tim Carlson

unread,

Oct 14, 2011, 12:01:31 AM10/14/11

to Discussion of Rocks Clusters

On Thu, 13 Oct 2011, Russell Jones wrote:

Let's see, how can I put this subtley. Oh yeah.. That is pretty much
complete BS. :) I've been patching dozens of Rocks clusters since 2.x and
while I have had issues here and there, I have never really "broken" a
cluster.

Tim

--
-------------------------------------------
Tim Carlson, PhD
Senior Research Scientist
Environmental Molecular Sciences Laboratory

Philip Papadopoulos

unread,

Oct 14, 2011, 12:49:03 AM10/14/11

to Discussion of Rocks Clusters

On Thu, Oct 13, 2011 at 9:01 PM, Tim Carlson <tim.c...@pnl.gov> wrote:

> On Thu, 13 Oct 2011, Russell Jones wrote:
>
> Let's see, how can I put this subtley. Oh yeah.. That is pretty much
> complete BS. :) I've been patching dozens of Rocks clusters since 2.x and
> while I have had issues here and there, I have never really "broken" a
> cluster.
>

I would add that Tim also has a reasonable "skip list" of things that he
does NOT update.

And, unless I've misunderstood, he points only to the appropriate
RHEL/CentOS repos for the OS, no external repos that can really cause havoc.
Each release of Rocks gets better in terms of isolation from dependence on
particular system versions, so updating most OS-supplied packages are just
fine in practice.

Most damage (unrecoverable) comes about when people do something like "I
enabled the EPEL repo and now everything is broken". Yep, broken alright.

-P

> ------------------------------**-------------

> Tim Carlson, PhD
> Senior Research Scientist
> Environmental Molecular Sciences Laboratory
>
>

--
Philip Papadopoulos, PhD
University of California, San Diego
858-822-3628 (Ofc)
619-331-2990 (Fax)
-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20111013/c8ef64a1/attachment.html

Russell Jones

unread,

Oct 14, 2011, 9:12:47 AM10/14/11

to npaci-rocks...@sdsc.edu

Fair enough! Like I've said, that is the advice that has been passed
down to *me* as sound advice, and a few mailing list searches regarding
upgrading the frontend via yum agree with me, while of course a few
others disagree.

At least I can attest that not upgrading it via yum has prevented me
from even having issues here and there :)

Joe Kaiser

unread,

Oct 14, 2011, 12:47:27 PM10/14/11

to Discussion of Rocks Clusters

RHEL/CentOS 5.7 work fine.

Thanks,

Joe

--

Doing beautiful things is its own reward. - Teller of "Penn and Teller"

-------------- next part --------------
An HTML attachment was scrubbed...

URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20111014/e707c82e/attachment.html

James Rudd

unread,

Oct 16, 2011, 2:46:18 AM10/16/11

to Discussion of Rocks Clusters

Thanks for your help Phillip. I followed your instructions and it all
went well. CentOS 5.7 installed OK and I only needed the 1st DVD.

2 issues I found after reinstall:
The /etc/profile.d/ssh-key.sh script appears to have a bug in it
related to checking shell level $SHLVL. If you are running locally it
is fine, but if you are using SSH it quits any subshells you load.
e.g. From SSH session type in bash and it immediately quits. More
common occurrence I found was as soon as I started 'screen' from
within a bash window it would exit with a [screen is terminating]
message.
I commented out the $EXIT $? line to fix it but there should be a
better way to solve it.

I had a problem with DHCP for power and remote manager appliances. I
found that although registered in the DB and appearing in rocks list
host, none of my power or manager hosts were added to dhcp.conf or
named files. I had to go through using an "insert-ethers --replace
manager-0-?" for each server for them to be added to dhcp.conf.
These existed on previous head node and came across with the DB but it
seems that they don't get added to DHCP until they are reinserted by
insert-ethers. This may be related to the appliance attr managed =
false on both appliance types, but I'm not sure.

Thanks,
James

On Fri, Oct 14, 2011 at 10:02 AM, Philip Papadopoulos
<philip.pa...@gmail.com> wrote:

Reply all

Reply to author

Forward