[Rocks-Discuss] Upgrading the compute node kernel on a Rocks 4.0 system

0 views
Skip to first unread message

James Gladden

unread,
Jun 13, 2008, 12:52:03 AM6/13/08
to npaci-rocks...@sdsc.edu
I recently set about the task of adding some Dell PE1950 nodes to an
existing cluster running Rocks 4.0. These servers require LSI Logic
Fusion SAS drivers for the SAS hard drives and Broadnet Xtreme II
drivers for the Nic's, neither of which were part of the 4.6.9-11 kernel
package used in this distro of rocks.

I have manage to upgrade the initrd.img file used by the kickstart
process to include these drivers, so I am now able to successfully
complete the kickstart process. The documented method of rebuilding the
rocks-boot package against an newer kernel didn't quite work (the
resulting initrd.img did not contain all the needed kernel objects), but
I managed to overcome that obstacle by manually inserting the needed
driver kernel objects and updating the related pci tables in initrd.img.

However, I am still having difficulty getting Rocks and Kickstart to
build a system on the PE1950 that works correctly. I started by simply
adding the source code for the needed Fusion SAS drivers to the
distribution at /usr/src/linux-2.6.9-11.EL on my head node, adding the
necessary lines to the Makefile and Kconfig files, and the rebuilding
with "make rpm". This was successful. Once I moved the new kernel rpm
to /home/install/contrib.../ and did the "rocks-dist dist", the PE1950
would kickstart and successfully boot the resulting OS. However, since
it still lacked an ethernet driver, it was not a very useful compute node.

I was less successful adding the bnx2 ethernet driver source to the
4.6.9-11 source. Ill skip these details, but I hacked around the
problem by adding code to the extend-compute.xml <post> section that
brute force copied a pre-built bnx2.ko to the appropriate place in the
/lib/modules tree on the compute node. This kludge resulted in a
functional node that boots up and talks to the head node, etc. However,
its still has various minor problems, such as hanging during shutdown
because of a missing acpi module.

So at this point I thought should simply try building a current version
of the 4.6.9 kernel (that already had the needed driver), and then
installing it in /home/install/contrib.../ so that it would be the
kernel installed by Kickstart on the compute nodes. This has not worked
as expected, and this at point I somewhat mystified as to how Rocks and
Kickstart together determine what kernel really gets installed on a
node. Here is what I observer:

The Rocks for 4.0 distro seems to have come with two kernel rpms (one
smp, one not). If I look at what gets linked into
/home/install/rocks-dist/lan/x86_64/RedHat/RPMS this is what I see:

[root@clustertest RPMS]# ls -l kernel*

lrwxrwxrwx 1 root root 132 Jun 12 16:04 *kernel-2.6.9-11.EL.x86_64.rpm*
->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:04
kernel-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:04
kernel-doc-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-doc-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:04
*kernel-smp-2.6.9-11.EL.x86_64.rpm* ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 142 Jun 12 16:04
kernel-smp-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 143 Jun 12 16:04
kernel-sourcecode-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-sourcecode-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:04
kernel-utils-2.4-13.1.66.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-utils-2.4-13.1.66.x86_64.rpm

[root@clustertest RPMS]#

Examining the contents of the of
/home/install/rocks-dist/lan/x86_64/RedHat/base/hdlist file confirms
that these two kernel rpm's are included in the rpm database that
rocks-dist prepares for Kickstart:

[root@clustertest base]# /root/rdhdlist.py hdlist | grep kernel

0:kernel-*2.6.9-11.EL.x86_64* 1

0:kernel-devel-2.6.9-11.EL.x86_64 1

0:kernel-doc-2.6.9-11.EL.noarch 1

0:kernel-*smp-2.6.9-11.EL.x86_64* 1

0:kernel-smp-devel-2.6.9-11.EL.x86_64 1

0:kernel-sourcecode-2.6.9-11.EL.noarch 1

1:kernel-utils-2.4-13.1.66.x86_64 1

0:roll-kernel-kickstart-4.0.0-0.noarch 1

[root@clustertest base]#

When I kickstart one of my old compute nodes from this configuration,
the contents of the /boot directory on the resulting compute node look
like this.

[root@compute-0-1 boot]# ls -l

-rw-r--r-- 1 root root 41829 May 20 2005 config-2.6.9-11.EL

-rw-r--r-- 1 root root 41266 May 20 2005 config-2.6.9-11.ELsmp

drwxr-xr-x 2 root root 4096 Jun 12 16:19 grub

-rw-r--r-- 1 root root 956201 Jun 12 16:16 *initrd-2.6.9-11.EL.img*

-rw-r--r-- 1 root root 944430 Jun 12 16:16 *initrd-2.6.9-11.ELsmp.img*

drwxr-xr-x 3 root root 4096 Jun 12 16:14 kickstart

-r--r--r-- 1 root root 6 Jun 12 16:17 message

-rw-r--r-- 1 root root 21282 Dec 2 2004 message.ja

drwxr-xr-x 2 root root 4096 Jun 12 16:17 RCS

-rw-r--r-- 1 root root 851856 May 20 2005 System.map-2.6.9-11.EL

-rw-r--r-- 1 root root 868068 May 20 2005 System.map-2.6.9-11.ELsmp

-rw-r--r-- 1 root root 1731599 May 20 2005 *vmlinuz-2.6.9-11.EL*

-rw-r--r-- 1 root root 1609863 May 20 2005 *vmlinuz-2.6.9-11.ELsmp*

[root@compute-0-1 boot]#

Note that the result includes "initrd" and "vmlinuz" are include for
both the kernels. However, if I add my modified 2.6.9-11 kernel to the
/home/install/contrib.../ directory the behavior changes. The resulting
..../lan.../RPMS directory links look like this:

[root@clustertest RPMS]# ls -l kernel*

lrwxrwxrwx 1 root root 73 Jun 12 16:27
*kernel-2.6.911.ELsmp-1.x86_64.rpm* ->
/home/install/contrib/4.0.0/x86_64/RPMS/kernel-2.6.911.ELsmp-1.x86_64.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:27
kernel-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:27
kernel-doc-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-doc-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:27
*kernel-smp-2.6.9-11.EL.x86_64.rpm* ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 142 Jun 12 16:27
kernel-smp-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 143 Jun 12 16:27
kernel-sourcecode-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-sourcecode-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:27
kernel-utils-2.4-13.1.66.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-utils-2.4-13.1.66.x86_64.rpm

[root@clustertest RPMS]#

Notice that rocks-dist has included my modified kernel in the links
(kernel-2.6.911.ELsmp-1.x86_64.rpm) but has decided that it no longer
needs to include the non-smp kernel for the original OS roll. The
contents of ..../base/hdlist show the same thing. Why it does this is a
little mysterious to me - perhaps it just goes with the "best" two
kernel rpm's it finds? Kickstarting a compute node with this setup
results in a compute node with a /boot directory that looks like this:

-rw-r--r-- 1 root root 41266 May 20 2005 config-2.6.9-11.ELsmp

drwxr-xr-x 2 root root 4096 Jun 12 16:40 grub

-rw-r--r-- 1 root root 942707 Jun 12 16:37 *initrd-2.6.9-11.ELsmp.img*

drwxr-xr-x 3 root root 4096 Jun 12 16:35 kickstart

-r--r--r-- 1 root root 6 Jun 12 16:38 message

-rw-r--r-- 1 root root 21282 Dec 2 2004 message.ja

drwxr-xr-x 2 root root 4096 Jun 12 16:38 RCS

-rw-r--r-- 1 root root 868068 May 20 2005 System.map-2.6.9-11.ELsmp

-rw-r--r-- 1 root root 1609863 May 20 2005 *vmlinuz-2.6.9-11.ELsmp*

Now there is just one "vmlinuz" and one initrd. Judging from the names
that lack the "-1" suffix (my modified kernel rpm was named
kernel-2.6.911.ELsmp-1.x86_64.rpm) I would conclude that Kickstart chose
not to use ,y modified kernel. However, this actually does boot on a
PE1950 and the resulting /lib/modules tree really does include the
Fusion SAS kernel modules that I added to the kernel source. So I'm not
sure what Kickstart actually did here.

Finally, I build the 2.6.9-67 distribution from source, created the file
kernel-2.6.967.ELsmp1-4.x86_64.rpm, and moved it to the
/home/install/contrib.../ folder. The resulting .../lan/...RPMS
directory show these links:

[root@clustertest RPMS]# ls -l kernel*

lrwxrwxrwx 1 root root 74 Jun 12 16:54
*kernel-2.6.967.ELsmp1-4.x86_64.rpm* ->
/home/install/contrib/4.0.0/x86_64/RPMS/kernel-2.6.967.ELsmp1-4.x86_64.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:54
kernel-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:54
kernel-doc-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-doc-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 136 Jun 12 16:54
*kernel-smp-2.6.9-11.EL.x86_64.rpm *->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 142 Jun 12 16:54
kernel-smp-devel-2.6.9-11.EL.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-smp-devel-2.6.9-11.EL.x86_64.rpm

lrwxrwxrwx 1 root root 143 Jun 12 16:54
kernel-sourcecode-2.6.9-11.EL.noarch.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-sourcecode-2.6.9-11.EL.noarch.rpm

lrwxrwxrwx 1 root root 138 Jun 12 16:54
kernel-utils-2.4-13.1.66.x86_64.rpm ->
/home/install/ftp.rocksclusters.org/pub/rocks/rocks-4.0.0/rocks-dist/rolls/os/4.0.0/x86_64/RedHat/RPMS/kernel-utils-2.4-13.1.66.x86_64.rpm

[root@clustertest RPMS]#

Rock-dist once again chose to include two kernels, the first of them
being my newly built 2.6.9-67 rpm. However, this setup will not boot on
a PE1950. Booting on an older compute node results in /boot directory
that looks like this:

[root@compute-0-1 boot]# ls -l

total 5996

-rw-r--r-- 1 root root 41266 May 20 2005 config-2.6.9-11.ELsmp

-rw-r--r-- 1 root root 42881 Jun 11 22:11 config-2.6.967.ELsmp-1

drwxr-xr-x 2 root root 4096 Jun 12 17:07 grub

-rw-r--r-- 1 root root 943549 Jun 12 17:04 *initrd-2.6.9-11.ELsmp.img*

drwxr-xr-x 3 root root 4096 Jun 12 17:02 kickstart

-r--r--r-- 1 root root 6 Jun 12 17:05 message

-rw-r--r-- 1 root root 21282 Dec 2 2004 message.ja

drwxr-xr-x 2 root root 4096 Jun 12 17:05 RCS

-rw-r--r-- 1 root root 868068 May 20 2005 System.map-2.6.9-11.ELsmp

-rw-r--r-- 1 root root 898207 Jun 11 22:11 System.map-2.6.967.ELsmp-1

-rw-r--r-- 1 root root 1609863 May 20 2005 *vmlinuz-2.6.9-11.ELsmp*

-rw-r--r-- 1 root root 1656448 Jun 11 22:11 *vmlinuz-2.6.967.ELsmp-1*

[root@compute-0-1 boot]#

Note that both "vmlinuz" files are included, but only one "initrd" is
generated and it is named to reflect the kernel rpm from the original
rocks distro. However, looking at the /lib/modules path shows that
Kickstart really did load the modules from both kernel rpm's:
[root@compute-0-1 ~]# ls -l /lib/modules
total 12
drwxr-xr-x 3 root root 4096 Jun 12 17:07 2.6.9-11.ELsmp
drwxr-xr-x 3 root root 4096 Jun 12 17:01 2.6.967.ELsmp-1
drwxr-xr-x 2 root root 4096 May 20 2005 kabi-4.0-0smp
[root@compute-0-1 ~]#

Its just that the "initrd" that kickstart generated does not load the
SAS modules out of the 2.6.967.ELsmp-1 tree, so the boot fails.

So it appears the Kickstart for some reason prefers to generate a boot
configuration based on the original OS roll smp kernel, rather than the
new one I am offering, despite that fact that both rpm's get installed
on the compute node. Any insight into why this is happening would be
greatly appreciated.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: https://lists.sdsc.edu/pipermail/npaci-rocks-discussion/attachments/20080612/f784d3ab/attachment.html

Greg Bruno

unread,
Jun 13, 2008, 12:41:35 PM6/13/08
to Discussion of Rocks Clusters
you could try:

- get the latest prebuilt kernel RPMs for CentOS 4:

# cd /home/install/contrib/4.3/x86_64/RPMS
# wget ftp://mirrors.usc.edu/pub/linux/distributions/centos/4/updates/x86_64/RPMS/kernel*2.6.9-67.0.15*rpm

- then rebuild and install the rocks-boot package. the following
procedure should work on your rocks 4.0 system -- be sure to change
the references to 'rocks-5.0', and start at step 8:

http://www.rocksclusters.org/roll-documentation/base/5.0/customization-driver.html

- gb

Reply all
Reply to author
Forward
0 new messages