The machine appeared hung, I couldn't get to the console on the nodes.
After a failed scboot attempt, I did a power cycle of the whole
machine, and now scboot aborts here:
Loading and booting linux
bamf: Loading vmlinux
bamf: Loading bootk
diagcomm_start: error from MSP, msp 7, node -1: RPC error: attempt to
load data timed out: nodemask = 0x200
Failed loading linux
Caught signal, cleaning up.
Any suggestions for what to look at?
Pat
> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>
>
> > diagcomm_start: error from MSP, msp 7, node -1: RPC error: attempt to
> > load data timed out: nodemask = 0x200
1. nodemask 0x200 implicates node 9 in particular
2. I _think_ the fact that it dies at the load data stage might
implicate the DRAM on that node
If that's so, diags will definitely pinpoint it.
And in the meantime, you can likely workaround it by disabling that node
and trying again.
--
Bobby Woods-Corwin
m...@alum.mit.edu
On Monday, October 10, 2011 10:15 AM, "Narayan Desai"
The usual course of action is to
* reboot after disabling that specific node in sicortex-system.conf
or
* run the diags on that node (at least) and see what they say.
or
* just try it again
If the trouble is a bad dimm, you can fix it yourselves (if you have spares!), otherwise
just mark it disabled and press on with the other however many nodes.
If you need the diagnostics user manual or system admin guide, I've got copies.
-Larry