Re: Problems booting SC5832

6 views
Skip to first unread message

Lawrence Stewart

unread,
Mar 22, 2011, 1:33:49 PM3/22/11
to sicorte...@googlegroups.com, Lawrence Stewart
I suspect the communications fabric has not come up properly.

In a normal boot, the "poolmask" message should be followed in the console logs
by things like:

[Tue Jan 18 14:01:35 2011][4294712.560000] scfabric do_link_maint:1268: TX link 0 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x06fff000
^M[4294712.560000] scfabric do_link_maint:1290: RX link 0 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x06ff6000
^M[4294712.560000] scfabric do_link_maint:1268: TX link 1 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x04ff4000
^M[4294712.560000] scfabric do_link_maint:1290: RX link 1 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x04f44000
^M[4294712.560000] scfabric do_link_maint:1268: TX link 2 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x00f00000
^M[4294712.560000] scfabric do_link_maint:1290: RX link 2 entered up state.
^M[4294712.560000] scfabric scfab_fsw_link_state_change:711: FswBlockReset <- 0x00000000
^M[4294712.560000] scfabric do_link_maint:1312: Link states: TX: 3 3 3 RX: 3 3 3
^Mall links are up^M
[4294712.618000] scfabric scfab_fabric_ioctl:1262: Downloaded master route table.
^Mfabric ioctl master routes returns status 0^M
Calling iface_hook: '/sbin/fabricd_iface_hook ifup'^M
Loading scio/sceth modules^M


So somewhere a node has a failing link, or didn't boot. To disable it, you need to find it.
The fabric bringup is coordinated by the master fabric daemon (mfd) on the ssp. Its log
file is /var/log/scx/mfd.log

Here is a segment of a good boot, from an SC072:

[2011-03-18 11:01:35,951] MFD started at Fri Mar 18 11:01:35 2011
[2011-03-18 11:01:35,969] using port 6170
[2011-03-18 11:01:35,969] Waiting for nodes to check in
[2011-03-18 11:03:39,183] Starting boot for partition sca
[2011-03-18 11:03:39,184] Required nodes = 12
[2011-03-18 11:03:49,821] all 12 nodes checked in, begin polling for links up
[2011-03-18 11:03:49,824] Polling 1
[2011-03-18 11:03:49,824] 0/12 nodes have links up
[2011-03-18 11:03:54,830] Polling 2
[2011-03-18 11:03:54,830] 12/12 nodes have links up
[2011-03-18 11:03:54,830] All nodes finished booting for sca.


You will likely find that not all nodes have checked in. If it isn't obvious who is missing,
you can run mfd in a more verbose mode, run mfd_watcher, or just examine the ends of the console
logs to see who is different.

See the man pages for mfd and mfd_watcher. To run mfd in a verbose mode it might be easiest to
patch scboot to change the command line for it. (adding -g DEBUG)


I suspect one or more nodes haven't checked in, or some link has failed bringup, and mfd has
not finished.

Finding a bad node by grepping the console logs can be done as well, by clever use of linux
utilities. I will try to find notes in my old email

-Larry

On Mar 22, 2011, at 1:05 PM, Patrick Finnegan wrote:

> Hi all,
>
> We have an SC5832 which now fails to boot. We've tried a power cycle
> of the system, and it doesn't seem to have helped.
>
> The LED pattern on the machine seems to be normal. Upon doing an
> "scboot -p scx", scboot stops at "Waiting for NBD nodes".
>
> The console messages on the NBD nodes all end approximately like this
> (this is from m4n6):
>
>>>> Found MGTnet interface
> ifconfig eth0 172.31.152.100 mtu 9000 broadcast 172.31.152.255 netmask
> 255.255.255.0
> [4294688.395000] PM: Writing back config space on device 0000:02:04.0
> at offset 1 (was 2b00000, writing 2b00006)
> [4294689.754000] PM: Writing back config space on device 0000:02:04.0
> at offset 1 (was 2b00000, writing 2b00006)
> route add -net 172.31.150.0 netmask 255.255.255.0 dev eth0
> Bringing up eth1...
> [4294691.261000] PM: Writing back config space on device 0000:02:04.1
> at offset 1 (was 2b00000, writing 2b00006)
> [4294692.618000] tg3: eth1: No firmware running.
> udhcpc (v1.10.3) started
> Leaving interface eth1 unconfigured.
> Sending discover...
> Sending discover...
> Sending discover...
> Leaving interface eth1 unconfigured.
> No lease, failing
> Loading scfab module
> [4294702.194000] scfabric detect_cluster_type:1705: I am Blizzard
> module 4 node 6
> [4294702.207000] scfabric scfab_fsw_init:344: Initialized fabric
> switch.
> [4294702.207000] [cpu 4] slow irq startup : Fabric Link (21)
> [4294702.208000] scfabric scfab_fl_init:266: zcalib_tx: 80 77 77 82 76
> 75 77
> [4294702.208000] scfabric scfab_fl_init:270: zcalib_rx: 80 78 78 82 76
> 75 77
> [4294702.209000] scfabric scfab_fl_init:324: TX and RX links
> calibrated.
> [4294702.209000] scfabric scfab_fl_init:325: Success.
> [4294702.209000] scfabric load_microcode:1531: Loading microcode...:
> 25:11 2011]
> [4294702.209000] scfabric load_microcode:1559: Microcode revision
> 3.42486 loaded.
> [4294702.209000] scfabric scfab_open_dma_dev:262: opened DMA context
> 0, pid = 0
> [4294702.210000] scfabric scfab_allocate_buffers:604: allocated
> buffers for DMA context 0
> [4294702.210000] scfabric scfab_setup_interrupt_handler:1423: set up
> interrupt handler for context 0
> [4294702.211000] scfabric scfab_ctx0_init:185: success.
> [4294702.211000] scfab: initialized 14 DMA devices
> Running fabricd...
> rootfs(NFSNBD): Mounting rootfs image... (Sat Jan 1 00:00:26 UTC
> 2000)
> rootfs(NFSNBD): Rootfs image mount done (Sat Jan 1 00:00:26 UTC 2000)
> [4294703.058000] scfabric scfab_fsw_set_vc_decrement_mask:510:
> FswIbMode[0] <- 0x00000001
> [4294703.058000] scfabric scfab_fsw_set_vc_decrement_mask:510:
> FswIbMode[1] <- 0x00000001
> [4294703.058000] scfabric scfab_fsw_set_vc_decrement_mask:510:
> FswIbMode[2] <- 0x00000001
> [4294703.058000] scfabric scfab_fsw_set_vc_pool_mask:456:
> FswPoolMask[0] <- 0x0000ff00
> [4294703.058000] scfabric scfab_fsw_set_vc_pool_mask:456:
> FswPoolMask[1] <- 0x0000ff00
> [4294703.058000] scfabric scfab_fsw_set_vc_pool_mask:456:
> FswPoolMask[2] <- 0x0000ff00
> fabricd: poolmask set to ff00
>
> Does anyone have any suggestions at what else to look at? The scboot
> process otherwise seems to be working normally.
>
> Thanks for any help,
>
> --
> Patrick Finnegan
> Purdue University ITSO/Research Systems
>
> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

Patrick Finnegan

unread,
Mar 22, 2011, 1:05:26 PM3/22/11
to SiCortex Users

Patrick Finnegan

unread,
Mar 22, 2011, 4:06:57 PM3/22/11
to sicorte...@googlegroups.com, Lawrence Stewart
Thanks for the help!

Adding -g DEBUG to mfd's commandline helped to find a node that was bad
and spewing ECC errors to its console.

Once I disabled the node, the (rest of the) system came up properly.

Pat


--
Purdue University Research Computing -- http://www.rcac.purdue.edu

Dave McGuire

unread,
Mar 22, 2011, 4:29:01 PM3/22/11
to sicorte...@googlegroups.com
On 3/22/11 4:06 PM, Patrick Finnegan wrote:
> Thanks for the help!
>
> Adding -g DEBUG to mfd's commandline helped to find a node that was bad
> and spewing ECC errors to its console.
>
> Once I disabled the node, the (rest of the) system came up properly.

No, it's hopeless, it will never run right again. Pack it up and
ship it to me, and I'll "dispose" of it for you. ;)

Hi Pat! ;)

-Dave

--
Dave McGuire
Port Charlotte, FL

Patrick Finnegan

unread,
Mar 22, 2011, 4:45:19 PM3/22/11
to sicorte...@googlegroups.com

Hmm, I think I've heard that phrase before...

At some point I ought to get back to working on my SC072 at home. Too
many other projects though...

Pat

Reply all
Reply to author
Forward
0 new messages