Problems booting SC5832

12 views
Skip to first unread message

Patrick Finnegan

unread,
Oct 10, 2011, 11:06:00 AM10/10/11
to sicorte...@googlegroups.com, s...@purdue.edu
Hi, new problems with the SC5832 today.

The machine appeared hung, I couldn't get to the console on the nodes.
After a failed scboot attempt, I did a power cycle of the whole
machine, and now scboot aborts here:


Loading and booting linux
bamf: Loading vmlinux
bamf: Loading bootk
diagcomm_start: error from MSP, msp 7, node -1: RPC error: attempt to
load data timed out: nodemask = 0x200

Failed loading linux
Caught signal, cleaning up.

Any suggestions for what to look at?

Pat

Narayan Desai

unread,
Oct 10, 2011, 11:15:00 AM10/10/11
to sicorte...@googlegroups.com, s...@purdue.edu
I think that we saw something similar when we had a module go dead
earlier this year. Do diags show anything useful?
-nld

> --
> You received this message because you are subscribed to the Google Groups "SiCortex Users" group.
> To post to this group, send email to sicorte...@googlegroups.com.
> To unsubscribe from this group, send email to sicortex-user...@googlegroups.com.
> For more options, visit this group at http://groups.google.com/group/sicortex-users?hl=en.
>
>

Bobby Woods-Corwin

unread,
Oct 10, 2011, 11:25:25 AM10/10/11
to sicorte...@googlegroups.com, s...@purdue.edu
Larry would have more detailed recollections, but I will mention two
points:

> > diagcomm_start: error from MSP, msp 7, node -1: RPC error: attempt to
> > load data timed out: nodemask = 0x200

1. nodemask 0x200 implicates node 9 in particular
2. I _think_ the fact that it dies at the load data stage might
implicate the DRAM on that node

If that's so, diags will definitely pinpoint it.

And in the meantime, you can likely workaround it by disabling that node
and trying again.

--
Bobby Woods-Corwin
m...@alum.mit.edu

On Monday, October 10, 2011 10:15 AM, "Narayan Desai"

Lawrence Stewart

unread,
Oct 10, 2011, 3:29:20 PM10/10/11
to sicorte...@googlegroups.com, Lawrence Stewart, s...@purdue.edu
As Bobby says (Hi Bobby!), that sort of error indicates a failure in a specific node,
rather than a whole module or the system.

The usual course of action is to

* reboot after disabling that specific node in sicortex-system.conf

or

* run the diags on that node (at least) and see what they say.

or

* just try it again

If the trouble is a bad dimm, you can fix it yourselves (if you have spares!), otherwise
just mark it disabled and press on with the other however many nodes.

If you need the diagnostics user manual or system admin guide, I've got copies.

-Larry

Reply all
Reply to author
Forward
0 new messages