Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Why is this system failing.....

14 views
Skip to first unread message

ScottK

unread,
May 20, 1998, 3:00:00 AM5/20/98
to

Hi all, I have a serious problem at a client and I am at my wits end. They
have experienced two major system failures in the last year that "should"
not have happened.

They are currently running SCO Unix 3.2.4.2 on a Pentium 133, 64mb Ram, DPT
Smartcache IV Controller. 500mb IDE Boot Drive and 2 Seagate Cheetah Drives
mirrored as their data drives. A SCSI TR4 drive is their backup.

3 Weeks ago one of their Cheetah drives failed, at the end of the day of
course. Since the drives were mirrored I pulled the bad drive, fixed the
termination and the system was running again. I rebuilt most of their lost
data and off they went to re-enter what they had lost. Now, last night the
second Cheetah failed.

This wouldn't seem like such a bid deal if this system hadn't done the exact
same thing to two micropolis drives 9 months ago.

My question is: Is it possible that the controller could work for 9 months
without any problems and then all of a sudden kill two drives. Does anyone
know the long term effects of having a main power panel within 10 feet of a
server. Is this just bad luck. I am specifically looking for anyone who
has had reliability issues with the dpt controllers in the past.

Any insight would be grealy appreciated.

s_c_...@kerfoot.com (Take out the _)


Warren Beaudry

unread,
May 20, 1998, 3:00:00 AM5/20/98
to

I'd be suspicious of:
DPT controller memory.
Power Supply.


ScottK <the...@kerfoot.com> wrote in article
<6jutc8$okj$1...@argentina.it.earthlink.net>...

ScottK

unread,
May 20, 1998, 3:00:00 AM5/20/98
to

I'm in the process of replacing the case and power supply.


Warren Beaudry wrote in message
<01bd842b$a5610120$fa0d...@beaudryw.v-wave.com>...

Kevin Smith

unread,
May 21, 1998, 3:00:00 AM5/21/98
to

In article <6jutc8$okj$1...@argentina.it.earthlink.net> "ScottK" <the...@kerfoot.com> writes:
>Hi all, I have a serious problem at a client and I am at my wits end. They
>have experienced two major system failures in the last year that "should"
>not have happened.
>
>They are currently running SCO Unix 3.2.4.2 on a Pentium 133, 64mb Ram, DPT
>Smartcache IV Controller. 500mb IDE Boot Drive and 2 Seagate Cheetah Drives
>mirrored as their data drives. A SCSI TR4 drive is their backup.
>
>3 Weeks ago one of their Cheetah drives failed, at the end of the day of
>course. Since the drives were mirrored I pulled the bad drive, fixed the
>termination and the system was running again. I rebuilt most of their lost
>data and off they went to re-enter what they had lost. Now, last night the
>second Cheetah failed.

If they were mirrored, why did you lose anything?

> ...
--
Do two rights make | Kevin Smith, ShadeTree Software, Philadelpha, PA, USA
a libertarian | 001-215-487-3811 shady.com,kevin bbs.cpcn.com,sysop

Jeff Liebermann

unread,
May 22, 1998, 3:00:00 AM5/22/98
to

On Wed, 20 May 1998 08:41:28 -0700, "ScottK"
<the...@kerfoot.com> wrote:

>My question is: Is it possible that the controller could work for 9 months
>without any problems and then all of a sudden kill two drives. Does anyone
>know the long term effects of having a main power panel within 10 feet of a
>server. Is this just bad luck. I am specifically looking for anyone who
>has had reliability issues with the dpt controllers in the past.

Neato. A hardware detective problem. Time for more guesswork.

If the drives were mirrored, why did you have to do any data
restoring? It should have just kept plunking along.

I wish you would have given a better description of the failure.
For example, it would have been nice to know what were the
symptoms of the drive failure:
Did it complain about bad sectors or blocks?
Did it fail suddenly or gradually? (Very important)
Did the DPT raid monitor software spew messages on the screen?
Did fire and smog spew forth from the drives?
Did you test the dead drives after removal?
Did BOTH the Micropolis drives eventually die or just one?
Are you using DPT's ECC RAM on the SCSI adapter?
Was any attempt made to re-mirror the drives?
An amazing amount of insight can be gained from error messages.

Others mentioned overheating. I agree. Most of my big servers
have multiple fans. One ALR QSMP box has 5 or 6 big noisy 5" dia
fans. If it were on casters, I think it would roll powered by
the exhaust air. I've added multiple fans to smaller systems if
the exhaust temperature is more than about 20C above ambient.

My *GUESS* is that the drives did not fail. They merely
scrambled the data. I deduce this from that fact that you had to
restore data. This means that something caused the machine to
trash data on both drives, but more so on one drive. The fact
that it happened previously on a different set of drives
indicates that there is probably a single point of failur. This
points to something shared by the drives but not by the rest of
the system as there have been no other common failures (parity
checks, panics, hangs, etc).

I have about 8ea DPT 2122 (EISA) and 31?? (PCI) cards in service.
No SmartCache IV yet. Most use the non-ECC memory. I've been
told this is risky but have never experienced any problems.
However, I do test my subsystems with the usual write a big file,
do a checksum, copy the file, do another checksum, erase the
file, copy again, another checksum, ad nausium. If there's any
flakey memory, this will show it. On systems where paranoia was
a concern, I disabled the write cacheing. I would say the DPT is
fairly safe.

Since you had to move terminators around, I'll assume that you
don't have an overpriced hot-swap drive carrier. I'll also
assume that you have decent ribbon cables, active terminators,
short cables, no mix of external/internal scsi devices,
terminator power at the end of the cable run, and some clue how
all this works. If not, start looking at the setup carefully.
If my *GUESS* that you are trashing data unequally is correct,
then the only thing that could cause that in a mirrored
arrangement is the ribbon cable and terminators. It is possible
for the DPT controller to do the same thing, but it would
probably also fail hardware diagnostics. Ask DPT.

The infrequency of the problem leads me to suspect that either
the system is susceptible to outside influence or that DPT is
doing a heroic job of correcting a high error rate on the SCSI
bus and dropping the ball occassionally. Dunno. An autopsy of
the allegedly dead drives would be helpful.

For outside influence and other causes of occassional failures,
check the following:
- Nearby radio transmitters. Security guards?
- Static electricity from carpeting, chairs, and plastic cloths.
- Static blast when changing backup tapes.
- Flakey AC power.
- Multiple AC power strips in series.
- Rotten connection on the box AC on/off switch.
- Zinc plateing peeling off the case (why me?).
- Rotten solder connections on power supply boards.
- Bad crimps on power connector leads.
- Ground induced glitches via unipolar devices (RS-232 ports)
- Junk power from external SCSI devices.
- Exessively long or unshielded external SCSI cables. If you can
easily bend the SCSI cable, it's junk.

Good luck.

[x]email [x]news [ ]mailing list

--
Jeff Liebermann 150 Felker St #D Santa Cruz CA 95060
(408)699-0483 pgr (408)426-1240 fax (408)336-2558 home
http://www.cruzio.com/~jeffl WB6SSY
je...@comix.santa-cruz.ca.us je...@cruzio.com

0 new messages