Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Sco 5.0.2 SCSI Problems

0 views
Skip to first unread message

Pat Pushor

unread,
Jan 18, 2002, 1:36:35 PM1/18/02
to
Hi all.

I am having a problem with the scsi bus loosing it's mind under SCO
Openserver v5.0.2. First, the hardware:

Pentium 100Mhz workstations with an NCR 53c810 single channel scsi
adapter build onto the motherboard. These are custom systems.. :-(
One ST34371N Seagate Baraccuda HD as ID=0 (boot), one as ID=1, and a
SCSI CDROM as ID=5. The vendor of the hardware states that the scsi
chain is "actively terminated".

SCO Openserver v5.0.2 running the slha driver as shipped, with no
hardware supplement patches applied. The slha driver under the
hardware supplement is identical to the shipped version, regardless.

In the syslog we frequently see:

256 Jan 14 16:44:32 b0100042 NOTICE: Sdsk: Unrecoverable error
writing SCSI
» disk 0 dev 1/42 (ha=0 id=0 lun=0) block=572342
257 Jan 14 16:44:32 b0100042
258 Jan 14 16:44:32 b0100042 WARNING: pendq timeToDie=26023 for
rp=C0154C00
» on ha=0 id=0 lun=0 tag=08
259 Jan 14 16:44:32 b0100042 CDB= 0A 06 AF F1 04 00 00 00 00 00
260 Jan 14 16:44:32 b0100042
261 Jan 14 16:44:32 b0100042 NOTICE: Sdsk: Unrecoverable error
writing SCSI
» disk 0 dev 1/42 (ha=0 id=0 lun=0) block=137394
262 Jan 14 16:44:32 b0100042
263 Jan 14 16:44:32 b0100042 NOTICE: Sdsk: Unrecoverable error
writing SCSI
» disk 0 dev 1/42 (ha=0 id=0 lun=0) block=133874
264 Jan 14 16:44:32 b0100042 Command aborted: Command sent
before pr
» evious one was completed
265 Jan 14 16:44:32 b0100042 Check Condition: Sense Key=0B,
ASC/ASCQ=4E 00 o
» n ha=0 id=0 lun=0

The system locks up tighter than a popcorn fart after enough of these
messages fill the syslog.

Another instance of the problem is a kernel panic that can't dump
because of the scsi bus confusion. The output of the screen, though,
looks like:

IS_SIP & LSS_SGE & ?? intcode=A8 on ha=0 istat=02 sist0=0000000C
sist1=00 sstat0=00
UDC ssid=00, start ID=83, current ID=FF
UDC sdid=00

WARNING: issuing BDR Msg to ha=0 id=0 due to 'CFRun: Unexpected
Disconnect'
WARNING: Attempting to send_BDR to ha=0 id=0
Unexpected trap in Kernel mode:

cr0 0x8001001B cr2 0x00000095 cr3 0x00002000 tlb 0x00000000
ss 0x0000621C uesp 0xC0158000 efl 0x00010202 ipl 0x00000007
cs 0x00000158 eip 0xF0095F4A err 0x00000002 trap 0x0000000E
eax 0x00000080 eox 0x02009600 edx 0x0200FC49 ebx 0x00000000
esp 0xE0000ACC ebp 0xE0000B1C esi 0xC0156000 edi 0xC0158000
ds 0x00000660 es 0x00000160 fs 0x00000000 gs 0x00000000
cpu 0x00000001

PANIC: K_trap - Kernel mode trap type 0x0000000E

===

I have confirmed that the bus is configured correctly, no multiple
ID's...I believe termination to be done properly...

Any ideas on where to start? These boxes are completely custom and
ripping one apart is an option to investigate further, but are there
any software related points that immediately come to mind?

Thanks very much!

Bill Vermillion

unread,
Jan 18, 2002, 2:57:50 PM1/18/02
to
In article <d5848dca.02011...@posting.google.com>,
Pat Pushor <pa...@crossthread.com> wrote:

>I am having a problem with the scsi bus loosing it's mind under SCO
>Openserver v5.0.2. First, the hardware:

[details deleted - wjv]

>The system locks up tighter than a popcorn fart after enough of these
>messages fill the syslog.

:-) I'll remember that phrase.

>Another instance of the problem is a kernel panic that can't dump
>because of the scsi bus confusion. The output of the screen, though,
>looks like:

....

>I have confirmed that the bus is configured correctly, no multiple
>ID's...I believe termination to be done properly...

>Any ideas on where to start? These boxes are completely custom and
>ripping one apart is an option to investigate further, but are there
>any software related points that immediately come to mind?

Where to start. With an answer to this question.

The systems are running 5.0.2 which implies they have been running
a very long time.

So it's actually a few questions.

1) Am I correct in assuming them have been running well up until now?

2) Has anyone opened the case for any reason - eg clean the dust
out, et cetera?

3) has any new software been added.

4) has the machine been moved to a new location

5) have any significant changes been made to the location. eg new
airconditioner, new microwave oven nearby, any major electrical
work.

I've seen all of these in one way or other cause bizarre failures.

2) opening the case could mean cables are slightly loose - or
moving them next to something which either induces a signal into
the cable, or causes the cables signal to be induced into the box
itself. This will happen when someone decides to neaten up the
system and fold the cables on themselve and then tape them onto a
metal portion of the case. Signal gets shorted to the case or the
folding induces errors onto itsefl.

3) sw - obvious.

4) new location - different interferences, things come loose inside
the case, or poor electricity.

5) anything like this can cause a crash. The clue on microwaves is
often it happens at lunch time when people turn it on for lunch.

Best of luck. There are many things to check.

It could also be just incipient failure of the SCSI controller.

--
Bill Vermillion - bv @ wjv . com

Pat

unread,
Jan 18, 2002, 11:32:58 PM1/18/02
to
Hi Bill, thanks for the input.

I know what you are saying, I have been all over the scenarios myself. Yes,
the systems have been running for some time. The developers here chalked
the errors up to bad software (written in house) and bad application code,
and everyone thought these would be ironed out in time. Well, the system is
near completion from a sw standpoint, and the errors are still around.
Believe it or not, they have been here for a long time.

No local changes or electro magnetic interference, as far as I can tell.

These boxes don't get opened. I think one is about too, though...

"Bill Vermillion" <b...@wjv.com> wrote in message news:Gq5Gs...@wjv.com...

0 new messages