Linuxbios / Supermicro problem

18 views
Skip to first unread message

Ed

unread,
May 26, 2009, 3:06:37 PM5/26/09
to Linux Networx Users Group
So I inherited a small cluster of Evolocity2 nodes running SLES9 and
clusterworx. Until recently I hadn't had many issues I was forced to
resolve. But a recent hardware failure has got me scratching my head
looking for a solution.

We lost one of the evo2's motherboards. A spare with the custom
config was quickly located and I replaced the board, moved the cpus
over, etc. Quick hack on dhcp config and we were back in business.
Until I noticed the new node was only doing about 25% of the work it
had been before. Dual cpu /dual cores. So it seems that it is only
seeing one cpu. Swapped them thinking maybe one had died. Same
issue...

Ok, so the system uses linuxbios, so I went to compare the loaded
software, using lbflash, to the other nodes thinking something was
wonky with HT or something similar. But I don't have a /dev/mtd* on
the backend anywhere. 0.o (Granted my linuxbios knowledge is getting
old, but I thought tyans and supermicros of that age were covered.)
Anyway.... Using cmos_util I was able to enable dual_core, but it
seems to have no effect on the problem.

So I'm kind of stumped. One thing I see is that the replacement board
has an older LinuxBIOS version
and no way to flash it?

affected node : LinuxBIOS Version 1.1.80pre1Normal
rest of nodes : LinuxBIOS Version 1.1.8.1pre4Normal



And during the boot phase I get this message :

Initializing devices...
Root Device init
APIC_CLUSTER: 0 init
Initializing CPU #0
CPU: vendor AMD device 20f12
Enabling cache

Setting fixed MTRRs(0-88) type: UC
Setting fixed MTRRs(0-16) Type: WB, RdMEM, WrMEM
Setting fixed MTRRs(24-88) Type: WB, RdMEM, WrMEM
DONE fixed MTRRs
Setting variable MTRR 0, base: 0MB, range: 2048MB, type WB
Setting variable MTRR 1, base: 2048MB, range: 1024MB, type WB
Setting variable MTRR 2, base: 3072MB, range: 512MB, type WB
Setting variable MTRR 3, base: 3584MB, range: 256MB, type WB
Setting variable MTRR 4, base: 3840MB, range: 128MB, type WB
Setting variable MTRR 5, base: 3968MB, range: 64MB, type WB
DONE variable MTRRs
Clear out the extra MTRR's

MTRR check
Fixed MTRRs : Enabled
Variable MTRRs: Enabled

Setting up local apic... apic_id: 0 done.
Clearing memory 0K - 2097152K: ------------------------------- done
CPU #0 Initialized
APIC delivery error (4).
CPU 1 would not start!
CPU 1 did not initialize!
All AP CPUs stopped
PCI: 00:18.0 init
PCI: 01:01.0 init
set power on after power fail
RTC Init
PNP: 002e.0 init
PNP: 002e.2 init
PNP: 002e.5 init
PNP: 002e.b init
PCI: 01:02.1 init
PCI: 01:06.0 init
IDE1 IDE0
PCI: 01:07.0 init
SATA S SATA P
PCI: 01:08.0 init
SATA S SATA P
PCI: 01:09.0 init
dev_root mem base = 0x00fe000000
[0x50] <-- 0xfe000000
PCI: 01:0a.0 init
PCI: 01:0d.0 init
PCI: 01:0e.0 init
PCI: 05:01.0 init
PCI: 05:07.0 init
PCI: 05:08.0 init
PCI: 05:0a.0 init
PCI: 05:0d.0 init
PCI: 05:0e.0 init
PCI: 00:18.1 init
PCI: 00:18.2 init
PCI: 00:18.3 init
NB: Function 3 Misc Control.. done.
PCI: 00:19.0 init
PCI: 00:19.1 init
PCI: 00:19.2 init
PCI: 00:19.3 init
NB: Function 3 Misc Control.. done.
PCI: 04:00.0 init
Devices initialized


It appears rather different from all other nodes (bios revision?) :

Initializing devices...
Root Device init
APIC_CLUSTER: 0 init
Initializing CPU #0
CPU: vendor AMD device 20f12
Enabling cache


***Not doing amd_setup_mtrrs again on cpu 0***



MTRR check
Fixed MTRRs : Enabled
Variable MTRRs: Enabled

Setting up local apic... apic_id: 16 done.
X0: core: 0 node: 0
X1: siblings: 1
X2: core: 0 node: 0
core: 0 node: 0
Clearing memory 1024K - 2097152K: ------------------------------- done
1 Sibling Cores found
CPU: APIC: 20 enabled
CPU #0 Initialized
Initializing CPU #1
CPU: vendor AMD device 20f12
Enabling cache

Setting fixed MTRRs(0-88) type: UC
Setting fixed MTRRs(0-16) Type: WB, RdMEM, WrMEM
Setting fixed MTRRs(24-88) Type: WB, RdMEM, WrMEM
DONE fixed MTRRs
Setting variable MTRR 0, base: 0MB, range: 4096MB, type WB
Setting variable MTRR 1, base: 4096MB, range: 256MB, type WB
Setting variable MTRR 2, base: 3840MB, range: 256MB, type UC
DONE variable MTRRs
Clear out the extra MTRR's

MTRR check
Fixed MTRRs : Enabled
Variable MTRRs: Enabled

Setting up local apic... apic_id: 17 done.
X0: core: 0 node: 2
X1: siblings: 1
X2: core: 0 node: 1
core: 0 node: 1
Clearing memory 2097152K - 4456448K: --------------------------------++
++ done
1 Sibling Cores found
CPU: APIC: 21 enabled
CPU #1 Initialized
Initializing CPU #2
CPU: vendor AMD device 20f12
Enabling cache

Setting fixed MTRRs(0-88) type: UC
Setting fixed MTRRs(0-16) Type: WB, RdMEM, WrMEM
Setting fixed MTRRs(24-88) Type: WB, RdMEM, WrMEM
DONE fixed MTRRs
Setting variable MTRR 0, base: 0MB, range: 4096MB, type WB
Setting variable MTRR 1, base: 4096MB, range: 256MB, type WB
Setting variable MTRR 2, base: 3840MB, range: 256MB, type UC
DONE variable MTRRs
Clear out the extra MTRR's

MTRR check
Fixed MTRRs : Enabled
Variable MTRRs: Enabled

Setting up local apic... apic_id: 32 done.
X0: core: 0 node: 1
X1: siblings: 1
X2: core: 1 node: 0
core: 1 node: 0
1 Sibling Cores found
CPU #2 Initialized
Initializing CPU #3
Waiting for 1 CPUS to stop
CPU: vendor AMD device 20f12
Enabling cache

Setting fixed MTRRs(0-88) type: UC
Setting fixed MTRRs(0-16) Type: WB, RdMEM, WrMEM
Setting fixed MTRRs(24-88) Type: WB, RdMEM, WrMEM
DONE fixed MTRRs
Setting variable MTRR 0, base: 0MB, range: 4096MB, type WB
Setting variable MTRR 1, base: 4096MB, range: 256MB, type WB
Setting variable MTRR 2, base: 3840MB, range: 256MB, type UC
DONE variable MTRRs
Clear out the extra MTRR's

MTRR check
Fixed MTRRs : Enabled
Variable MTRRs: Enabled

Setting up local apic... apic_id: 33 done.
X0: core: 0 node: 3
X1: siblings: 1
X2: core: 1 node: 1
core: 1 node: 1
1 Sibling Cores found
CPU #3 Initialized
All AP CPUs stopped
PCI: 00:18.0 init
PCI: 01:01.0 init
set power on after power fail
RTC Init
PNP: 002e.0 init
PNP: 002e.2 init
PNP: 002e.5 init
PNP: 002e.8 init
PNP: 002e.9 init
PNP: 002e.b init
base = 0x0295, reg = 0x40, value = 0x81
base = 0x0295, reg = 0x48, value = 0x2a
base = 0x0295, reg = 0x4a, value = 0x21
base = 0x0295, reg = 0x4e, value = 0x80
base = 0x0295, reg = 0x43, value = 0xff
base = 0x0295, reg = 0x44, value = 0x3f
base = 0x0295, reg = 0x4c, value = 0x18
base = 0x0295, reg = 0x4d, value = 0x95
PCI: 01:02.1 init
PCI: 01:06.0 init
IDE1 IDE0
PCI: 01:07.0 init
SATA S SATA P
PCI: 01:08.0 init
SATA S SATA P
PCI: 01:09.0 init
dev_root mem base = 0x00fe000000
[0x50] <-- 0xfe000000
PCI: 01:0a.0 init
PCI: 01:0d.0 init
PCI: 01:0e.0 init
PCI: 05:01.0 init
PCI: 05:07.0 init
SATA device found
Bad SATA status
PCI: 05:08.0 init
SATA device found
Bad SATA status
PCI: 05:0a.0 init
PCI: 05:0d.0 init
PCI: 05:0e.0 init
PCI: 00:18.1 init
PCI: 00:18.2 init
PCI: 00:18.3 init
NB: Function 3 Misc Control.. done.
PCI: 00:19.0 init
PCI: 00:19.1 init
PCI: 00:19.2 init
PCI: 00:19.3 init
NB: Function 3 Misc Control.. done.
PCI: 04:00.0 init
Devices initialized

Any thoughts on how to resolve this? Even if CPU1 is really dead, why
can't I get my other core to come up? Is there a repository of
bios's anywhere (maybe sgi?) now that the old ftp site is down?

Ed

Joshua McDowell

unread,
May 27, 2009, 11:19:35 AM5/27/09
to lnx...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ed,

Can you be a little more specific..
What is a "custom config", are we talking about Linux BIOS settings? If
so, are you able to post your config for both the nodes?
You said that your versions were 1.1.80pre1 on the affected node and
1.1.8.1pre4 on the regular performing node. Are the motherboards the
same? Is there a valid reason they are running different versions of
Linux BIOS? If you are able and have a PLCC programmer, I would dump
and reprogram the effected node and see if it fixes the problem. Again,
this assuming that the motherboards are the same.
The more information that is supplied, the easier it is to nail down a
problem. I is certainly possible based on the symptoms you are
experiencing that you either have a bad motherboard or an incorrect
version of LinuxBios on that board.

Joshua McDowell
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkodWgQACgkQDiqOyViXQA6QywCfUkkQZAYBU2tmTaeL5OrfMGd2
BloAnjKyNaJEgUqzeRea/s6YMZBm5Ctt
=z602
-----END PGP SIGNATURE-----

Ed

unread,
May 27, 2009, 2:09:11 PM5/27/09
to Linux Networx Users Group
Sorry, let me be a touch more specific. The vendors I spoke with
(supermicro and sgi) label the motherboard as custom in some way.
('h8dce custom lnxi' or 'H8DCE Modified SM Dual Opt' and the sticker
on it said CUSTOM and another labeled P4DPE-LNXI or something
similar ) basically the 'standard' H8DCE had slightly different ports
+parallel , the evo2's didn't I guess?

Is the plcc removable? I can't recall seeing one I could pop and
program. Have to pull that node and look. Wondering if I could swap
it into another working node to move the issue and narrow it down.
With spares being this hard to come by, I'd REALLY like to narrow down
why I have one core instead of 4.

Is there a valid reason? No not really. Just happened to come
flashed with that version and I wasn't informed enough when I put
power to it to realise there could be a problem. I mean to figure
out how to download and re-flash the one I have. I don't have access
to a plcc burner (too bad I can't use the one I use for pinball
machine roms, maybe I can make one) but a bit more digging tells me I
ought to be able to get the newer version from all the other nodes
saved and re-burn it via "lbflash" once I hack up a custom kernel to
create a working /dev/mtd[x] device (and figure out how to provision
the new kernel). This is my first foray into anything at this level
on the cluster we have. It has, until recently, actually run
flawlessly minus a problem with the PDU I worked around. I have had
next to zero exposure to the LNXI software (cwx, etherboot, custom
bioses, etc) until now so I'm way behind the curve on some issues like
this. I'm sort of dreading an OS upgrade I really need to make
eventually, due to lack of knowledge about the workings of this
setup. I hate to nuke everything and use rocks, as I really like the
icebox setup and parts of cwx quite a bit.

Basically out here looking for suggestions, or to see if anyone else
had ever seen an issue like this before.

Ed


On May 27, 11:19 am, Joshua McDowell <jmcdow...@issisolutions.com>
wrote:

Joshua McDowell

unread,
May 27, 2009, 3:10:52 PM5/27/09
to lnx...@googlegroups.com
Ed,
The company I work for sells ( mostly refurbed some new ) H8DCE-LNXI motherboards. The only difference is the sound and printer port are removed. I think we have the other mobo u mentioned, but not sure as that's not something I do. I am the software person where I work, and one of the things I am trying to convince "most" to dump the LNXI software stack, and place stock factory BIOS on their nodes. then to move onto something that is much more supportable. Like the stack I am working with. Although there are several out there.


Joshua McDowell
- Lead integrator

Sent from my Blackberry

-----Original Message-----
From: Ed <ed....@gmail.com>

Date: Wed, 27 May 2009 11:09:11
To: Linux Networx Users Group<lnx...@googlegroups.com>
Subject: [lnxiug] Re: Linuxbios / Supermicro problem

Joshua McDowell

unread,
May 27, 2009, 3:40:24 PM5/27/09
to lnx...@googlegroups.com
Ed,
I apologize as I forgot to mention that based on your information. I would still guess bad cpu or motherboard. Keep in mind that just because that motherboard supports that CPU doesn't mean that the version of linux bios on it does.

Thanks,

Joshua McDowell
Joshua McDowell
- Lead integrator

Sent from my Blackberry

-----Original Message-----
From: Ed <ed....@gmail.com>

Date: Wed, 27 May 2009 11:09:11
To: Linux Networx Users Group<lnx...@googlegroups.com>
Subject: [lnxiug] Re: Linuxbios / Supermicro problem



Ed

unread,
May 29, 2009, 2:04:41 PM5/29/09
to Linux Networx Users Group
Went and pulled that node to look at the bios. Swapped the old PLC
off the dead mobo onto and it and low and behold life was good again.
Thanks for the help Joshua. Saved me a bunch of time compiling
kernels, playing with the provisioning system and/or kexec to replace
and burn the bios.

Ed


On May 27, 3:40 pm, "Joshua McDowell" <JMcDow...@ISSISolutions.com>
wrote:

Joshua McDowell

unread,
May 29, 2009, 4:28:46 PM5/29/09
to lnx...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Ed,

Provided the motherboards are identical, then there should be no
problem. If the motherboards aren't identical, I wouldn't trust the
node. Linux NetworX compiled linux BIOS for each and ever node, so that
version of linux BIOS should only work on that motherboard make/model.

Joshua
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkogRXkACgkQDiqOyViXQA5IJwCfa/lRRLMfmnAyHSscYANAXA3y
sNAAnA4Rf/1gu2TJkL+Zd6nzdVx4+nH0
=f11w
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages