SiCortex 1458 scboot issues

19 views
Skip to first unread message

Richard Żak

unread,
Apr 18, 2012, 2:39:06 PM4/18/12
to SiCortex Users
I have an SiCortex 1458. We moved to a new facility which meant
changing IPs from 172.16.x.x to 10.10.x.x. I changed the IP on the
head node, and the external compute nodes used DHCP. I ran sinfo:

PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
sci up infinite 243 down* sci-m0n[0-26],sci-
m1n[0-26],sci-m2n[0-26],sci-m3n[0-26],sci-m4n[0-26],sci-m5n[0-26],sci-
m6n[0-26],sci-m7n[0-26],sci-m8n[0-26]
sci-comp up infinite 239 down* sci-m0n[0,2-5,7-26],sci-
m1n[0,2-26],sci-m2n[0-26],sci-m3n[0-26],sci-m4n[0-26],sci-
m5n[0-26],sci-m6n[0-26],sci-m7n[0-26],sci-m8n[0-5,7-26]
sci-ok up infinite 186 down* sci-m0n[0,2-5,7-26],sci-
m2n[0-26],sci-m3n[0-26],sci-m4n[0-26],sci-m6n[0-26],sci-m7n[0-26],sci-
m8n[0-5,7-26]

When I try to run scboot, I get a lot of errors about what seems to be
blade 8. Any thoughts? I looked through some SiCortex documentation
I found on a DVD image I pulled from http://mirror.anl.gov/pub/sicortex/isos/V3.1/.

sicortex-ssp ~ # scboot
/var/state/route_info.sci checks out OK!

Booting partition: sci

Checking Module Service Processors
unrecognized num Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect()
poll failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))'], rev Diagcomm failure: ['MSP', 0, -1,
'diagcomm_connect() poll failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Warning: inconsistent board speeds!
sci-msp0: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp1: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp2: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp3: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp4: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp5: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp6: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp7: 2101-03, rev 06: B1-633MHz-capable (2)
sci-msp8: Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll
failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))'], rev Diagcomm failure: ['MSP', 0, -1,
'diagcomm_connect() poll failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']: unrecognized num Diagcomm failure: ['MSP', 0, -1,
'diagcomm_connect() poll failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))'], rev Diagcomm failure: ['MSP', 0, -1,
'diagcomm_connect() poll failed (msp0 (sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))']
Diagcomm failure: ['MSP', 0, -1, 'diagcomm_connect() poll failed (msp0
(sci-msp8:1235))'] (1)
Reverting to lowest common denominator (1)

Creating boot configuration

Halting all nodes
scand unresponsive, try 0/5
scand unresponsive, try 1/5
scand unresponsive, try 2/5
scand unresponsive, try 3/5
scand unresponsive, try 4/5
scand connection failed
Halt of nodes on sci-msp8 failed
Caught signal, cleaning up.

sicortex-ssp ~ #

Lawrence Stewart

unread,
Apr 18, 2012, 9:01:19 PM4/18/12
to sicorte...@googlegroups.com, Lawrence Stewart
SO it sounds like either

* Your module 8 did not power up correctly
* Your module 8 doesn't have control network ethernet connectivity
* Your module 8 doesn't have the correct IP on the control network.

Unplug it, wait 2 minutes, and plug it in again
Module 8 is the one on the right.

Check the dnsmasq log files on the ssp to see if it requested a DHCP address and got a TFTP download
of its uclinux image (THis is the module service processor we're talking about, not the node processors)

(Compare log entries for the other modules to see if they follow the same pattern)

Check the link light on the control network ethernet cable
Check the blinky leds on module 8 to see if they match the other modules

Unless the msp correctly boots uclinux and starts its various services, it won't be listening for
the port 1235 connections from the boot process.

You can try "telnet msp0 1236" also. There is an ash shell there that can be used for some poking around
but it is unlikely that the shell is up if the 1235 process is not.

-Larry

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>

Reply all
Reply to author
Forward
0 new messages