Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

The day the Octane died.

138 views
Skip to first unread message

Jonathan C. Patschke

unread,
Nov 18, 2001, 11:21:47 AM11/18/01
to
Hello, world,

A couple of months ago, I posted here asking about upgrading my
trusty old Indigo2 KI to an Octane SI. Well, did it. I've got a nice
setup with 512MB of RAM, 168GB of disc, and SI graphics. For
development, I really can't complain (although dual CPUs would be
uber-sweet), and I thank everyone for pointing me towards that decision.

However (you knew this was coming), I think I may have a flaky
system. I've emailed the person I purchased the machine from, but I
thought I'd ask the general opinion on here, too. Sadly, this is
roughly 5 days after the warranty expired. :(

The first time I booted up the system, I actually left it running
for a month. Didn't need to power it off ever. A reboot to install new
software was required a few times, but I never needed to power the
system down. According to IRIX, I had an uptime of over 11000 days,
thanks to a flat tod clock battery. The last time I rebooted was at
least three weeks ago--the system's been wonderfully stable.

Earlier this week, we had some -really- bad weather tear through my
part of Texas, so I opted to unplug everything in my office, just in
case the line conditioner failed (trust me... the lightning from this
storm was unbelievable--5 second long forks--you'd have done the same).

So, I start my Octane up a day later and am presented with a
blinking red LED on the lightbar. Yay. Okay, loose DIMM. Maybe one of
the other admins moved my machine in rigging up the new UPS systems.
Remove, eraser-clean, reseat, reboot. Solid red LED. Okay, do the same
to the SI card (except using canned air on the compression connector,
instead of an eraser). The system works.

Later that day, we're about to put the new UPS system online so
that we don't lose power the next time that the weather decides to get
-really- nasty and take out TXU for half a day. I power down again, if
for nothing else, to see if the Octane really -is- fixed.

I give it power again, and get a solid red LED. Okay, time to
break out the serial cable, as there is no possible way that the XIO
card carrier, or the SI card on it have moved. Here's what I get upon
bootup:


.*
.$$2%33: 08&&&&&&&&"&6000"0, 80%#4%$: 080000000040000000, .%#%)6%$:
080000000000


(yes, that's at the correct serial settings). Pressing the NMI
button yields the following on the serial console:


8#%04)/.> <6%#4/2=
22 .: 08&&&&&&&&9&#16074
.: 08&&&&&&&&&&&&&&&&, .!$ !$$2: 083$&&&0082$71$462, .:
08&&&&&&&&"&#00"00
.!53%: 08"0008074, 4!453: 0834010080
. : 080004000000000000, .!53%: 080000000000000000
)3#: 08000000000000000000000000, "53
%- .!4!: 08&&&&&&&&&&&&&&&#04008, .!$
.##: 080000000000000000
)$'%4 22 .-$: 0800000000, )$'%4 22 .$$2: 0800000000 0800000000
)$'%4 22 90%: 0800000000, )$'%4 22 .$$2: 0800000000.0800000000
.)$'%4 . .%!$ .)-%/54 .$$2: 0800000000
"/7 ..: 0800000000
.)$'%4 22 .-$: 0800000000, .)$'%4 22 .$$2: 0800000000.0800000000
.2)$'% ..: 0800000040
.)$'%4 22 .-$: 08&&&&&&&&, .)$'%4 22 .$$2: 080000&&&&.080000&&&&
.58 22 .-$: 0819000105, .%30 .5& 22 .$$2: 0800000000.0800420178
.. / 22 .$$2: 0800000000.0800420178
.3 ...: 0882800002
.. 22 .$$2: 0800000000.0804000050


Which, while not readable, looks a lot like a memory-trace than
line noise to me (lots of 8s and 0s, with what looks like deliberate
formatting). Perhaps the previous IRIX install was done with a
non-English PROM file? The two messages are bit-for-bit consistent, no
matter what I do. I've tried swapping out the first bank of DIMMs with
another bank, in case the system couldn't get enough low memory to POST
properly, but that gives the same error message.

I even completely disassembled the system, cleaned out every nook
and cranny with canned air, and made sure that everything socketed (or
otherwise user-removable) was properly seated. Still a solid red LED.
This isn't a vibration problem, methinks.

So, here I am, back on my trusty Indigo2 (missing the snappiness of
the Octane), asking advice again. Please, can anyone help me? I've
sunk a (relative) lot of money into this workstation (I'm above the $1k
mark that another poster was whining was a prerequisite to getting help
:), and, up until now, it's worked excellently. These machines (well,
when working--I'm assuming that they're not all this touchy) truly are a
monument to what SGI can build in a desktop workstation.

Two addition, but less urgent things: 1) My time-of-day clock is
flat, and I've had no luck finding a reliable source of DS1687-5 chips.
Where are they all? 2) There's an empty DIP socket on my Octane's
IP30 board, beneath where the second processor would live. What would
go there?

--
Jonathan

Remove "theobvious." to email me, but please post replies to the
newsgroup so that all may benefit.

Ralf Beyer

unread,
Nov 18, 2001, 1:10:29 PM11/18/01
to
"Jonathan C. Patschke" schrieb:


>From earlier postings:

- I removed the system boards and reseated the memory, cleaned the
dust off the connections and it booted up right away.

- In some cases reseating the system module did the trick. May be
necessary to repeat the procedure several times.

However, some system modules shut-off and could not be brought back
by this method.

- Remove the power cord, ground yourself, remove the system module,
put it on a flat surface with the components facing you, and press
the module with the black heat sink in the center of the board
firmly against the surface. Reassemble and restart the machine.

This succeeded in 5 of 5 cases with 5 different Octanes but after
some time the defect showed-up again.

The cause may be a bad main board which must be replaced. If it
is version -003 replace it by a system board with version number
-004 or better -005.

The version number is shown on the IP30 board but can also be
read under IRIX by

'hinv -mv'

Look for a line similar to:

Location: /hw/node/xtalk/15
IP30 Board: barcode HHL594 part 030-0887-003 rev C

- Removed the system module and installed it into another Octane: no
success, only fan and light on. Installed the system module back
into the original Octane: no success, only fan and light on. Then
unplugged the power cable, depressed the power button and while it
was depressed I plugged the power cable in again: the machine came-
up as usual.

- Let the Octane cool down long enough and turn it on again.

- Look for a sticking plastic power switch lever. Remove the front
panel, start the machine with the chassis button, and then replace
the front panel. This helped more than once.

- Keep the air intakes clean, cooling is critical.

- Ingo Fellner <i.fe...@puz.de> once wrote:

Try to take out the mainboard (and eventually power supply), clean
it and its connectors from dust (with compressed air), put it all
together and try again - we have much strange errors with Octanes
that could be solved like this.

- Alexis Cousein <a...@brussels.sgi.com> once wrote:

DO *NOT* CLEAN CPOP CONNECTORS WITH COMPRESSED AIR.
At least not if you don't want to ruin them. There are special cans
with just the right kind of pressure (and inert gases) to clean
these, but normal air coming at much higher pressure out of a
compressor are *not* a good idea.

- Simon Pigot <si...@dpiwe.tas.gov.au> once wrote:

I have 0887-003 (large central black heat sink) and 0887-005
(smaller central silver heat sink) IP30s and the problems occur/have
occured with both ie. power supply fan starts when machine is
connected to wall socket but nothing else happens - no LED, no
response to power button. Don't know about the 0887-004 boards or
the later 1467-00x boards.

Only hints I can get are that its the HEART and SPIDER chips under
the central heat sink on the IP30 which cause most of the problems
(see Ralf Beyer's earlier messages about pressing on the heat sink -
a friend just emailed me with success by doing this but it hasnt
worked on my 005 board so maybe it has something else wrong :-() and
that you'll definitely get this problem if you don't keep the air
intakes clean so it looks like cooling is critical too (not terribly
surprising). Lastly, as mentioned earlier, it looks like
transporting them doesn't help either as they seem to like being
dead on arrival - I don't think I'd buy one without a warranty!

What is the status of the LEDs? Ben Drago <bdr...@sgi.com> once wrote:

The LEDs are simply link status lights. There is actually seven
LEDs:

BaseIO X
QA X X PCI Expansion
QD X X QB
QC X X Heart

The BaseIO and Heart are connected internally and are part of the
IP30, so these will always be lit. The QA-QD refer to the quad
module, which is labeled like this when facing the rear of the
Octane:

QA | QB
-------
QC | QD

QA should always be lit as the first graphics card is installed in
Quadrant A. The other LEDs will be lit depending on what XIO options
are installed.

The PCI Expansion LED will be lit if there is a PCI shoebox
installed.

The 030-1467-001 is the 'new' IP30 board.

Please report how you resolved the case.

Regards
Ralf Beyer
--
beyer.bra...@freenet.de

Jonathan C. Patschke

unread,
Nov 19, 2001, 10:52:41 AM11/19/01
to
Ralf Beyer was once known to say:

> - Remove the power cord, ground yourself, remove the system module,
> put it on a flat surface with the components facing you, and press
> the module with the black heat sink in the center of the board
> firmly against the surface. Reassemble and restart the machine.
>
> This succeeded in 5 of 5 cases with 5 different Octanes but after
> some time the defect showed-up again.

That did it. Many, many thanks. If you've a PayPal account, email the
details and the proper amount to me so that I may buy you a pint of your
favorite stout.

I would guess that the recurring problem would be due to marginal
connections on the BGA package holding the two chips under the
heatsink. From what I've read (in the last day) elsewhere, it seems
that the heart and spider chip are there. I was under the impression
that the heart was the big chip in the frontplane. Just for my
edification, what's that chip?

> The cause may be a bad main board which must be replaced. If it
> is version -003 replace it by a system board with version number
> -004 or better -005.

It is, indeed, an 003 board. I assume that this manufacturing defect
was corrected in the later revisions of the IP30?

> The LEDs are simply link status lights. There is actually seven
> LEDs:

I'd wondered what purpose those lights served. After reading your post
in full, it seemed that something was Not Right, as the lightbar code
reported no XIO/gfx card, but the appropriate XIO LED was illuminated.
This makes perfect sense if the heart module has maginal connections to
other parts of the system.

I'm guessing that the 15-pin D-shell connector next to the LED array is
some sort of voodoo diagnosis port?

Again, many many thanks for helping me resolve this issue. Also, many
thanks to Tim over at NovaStar (from whom I purchased my Octane). His
firm stands behind their customers and their products, and he's
consistently been willing to help me resolve the other issues I've had
(almost all of which were caused by UPS).

Wolfgang Szoecs

unread,
Nov 19, 2001, 3:54:21 PM11/19/01
to
In article <3BF92AC9...@celestrion.theobvious.net>,

"Jonathan C. Patschke" <j...@celestrion.theobvious.net> writes:

> From what I've read (in the last day) elsewhere, it seems
> that the heart and spider chip are there. I was under the impression
> that the heart was the big chip in the frontplane. Just for my
> edification, what's that chip?

that's the crossbar - also known as XBOW.
HEART is directly on the IP30.

wolfgang

Ralf Beyer

unread,
Nov 19, 2001, 4:53:21 PM11/19/01
to
"Jonathan C. Patschke" schrieb:

>
> Ralf Beyer was once known to say:
> > - Remove the power cord, ground yourself, remove the system module,
> > put it on a flat surface with the components facing you, and press
> > the module with the black heat sink in the center of the board
> > firmly against the surface. Reassemble and restart the machine.
> >
> > This succeeded in 5 of 5 cases with 5 different Octanes but after
> > some time the defect showed-up again.
>
> That did it. Many, many thanks. If you've a PayPal account, email the
> details and the proper amount to me so that I may buy you a pint of your
> favorite stout.

Glad you resolved the case - at least temporarily.

> It is, indeed, an 003 board. I assume that this manufacturing defect
> was corrected in the later revisions of the IP30?

SGI developed a new version (not only revision) for that board ...

Thanks for the feedback and for your offer.

Best regards
Ralf Beyer
--
beyer.bra...@freenet.de

Ralf Beyer

unread,
Nov 19, 2001, 5:12:21 PM11/19/01
to
"Jonathan C. Patschke" schrieb:

>
> Ralf Beyer was once known to say:
> > - Remove the power cord, ground yourself, remove the system module,
> > put it on a flat surface with the components facing you, and press
> > the module with the black heat sink in the center of the board
> > firmly against the surface. Reassemble and restart the machine.
> >
> > This succeeded in 5 of 5 cases with 5 different Octanes but after
> > some time the defect showed-up again.
>
> That did it. Many, many thanks. If you've a PayPal account, email the
> details and the proper amount to me so that I may buy you a pint of your
> favorite stout.

Glad you resolved the case - at least temporarily.

> It is, indeed, an 003 board. I assume that this manufacturing defect


> was corrected in the later revisions of the IP30?

SGI developed a new version (not only revision) for that board ...

0 new messages