Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

aic7xxx: SCSI Bus Reset

3 views
Skip to first unread message

Mikael Abrahamsson

unread,
Dec 24, 2002, 6:34:16 AM12/24/02
to
On Tue, 24 Dec 2002, Shanker Balan wrote:

> Hello:
>
> I am experience aic7xxx SCSI reset errors which happens every so often
> on my NFS server:

I get this too. Different drives from time to time, I get the same errors
either with a AH3950U2 (on a VP6 Abit motherboard) or a 7899 based U160
onboard controller (on my new Tyan 2688 motherboard). I run in LVD mode
with a cable which is less than a meter long and with a LVD-SCA 6 drive
backplane with built in termination.

My stop usually lasts for 30 seconds and then it clears and continues.

I use Redhat 7.3 as well with the redhat supplied kernels (I've tried
different ones).

--
Mikael Abrahamsson email: swm...@swm.pp.se

-
To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
the body of a message to majo...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Shanker Balan

unread,
Dec 24, 2002, 6:21:59 AM12/24/02
to
Hello:

I am experience aic7xxx SCSI reset errors which happens every so often
on my NFS server:

Hardware:

Gigabyte GA-7DPXDw Dual SMP Motherboard with 512MB of RAM
2 AMD Athlon 1900
Adaptec AIC-7892A U160/m (rev 02)
QUANTUM Model: ATLAS10K3_18_SCA
QUANTUM Model: ATLAS10K3_73_SCA
QUANTUM Model: ATLAS10K3_73_SCA

Software:

RedHat Linux 7.3 with all updates
[root@master root]# uname -r
2.4.18-18.7.x

In the hope of solving the problem I tried the following with no
success:

- Non-SMP kernel
- SMP kernel in NOAPIC mode
- Booted an older RedHat kernel
- Replaced SCSI controller
- Replaced SCSI cables
- Swapped SCSI controller slots

Here is a snip from syslog:

Dec 24 16:20:38 master kernel: scsi0:0:0:0: Attempting to queue an ABORT message
Dec 24 16:20:38 master kernel: scsi0:0:0:0: Command found on device queue
Dec 24 16:20:38 master kernel: aic7xxx_abort returns 0x2002
Dec 24 16:20:48 master kernel: scsi0:0:0:0: Attempting to queue an ABORT message
Dec 24 16:20:48 master kernel: scsi0:0:0:0: Command found on device queue
Dec 24 16:20:48 master kernel: aic7xxx_abort returns 0x2002 Dec 24 16:20:54 master kernel: scsi0:0:0:0: Attempting to queue an ABORT
message
Dec 24 16:20:54 master kernel: scsi0: Dumping Card State while idle, at SEQADDR 0x9
[...]
Dec 24 16:20:55 master kernel: scsi0:0:0:0: Device is disconnected, re-queuing SCB
Dec 24 16:20:55 master kernel: Recovery code sleeping Dec 24 16:20:55 master kernel: (scsi0:A:0:0): Abort Tag Message Sent
Dec 24 16:20:55 master kernel: (scsi0:A:0:0): SCB 9 - Abort Tag Completed.
Dec 24 16:20:55 master kernel: Recovery SCB completes
Dec 24 16:20:55 master kernel: Recovery code awake
Dec 24 16:20:55 master kernel: aic7xxx_abort returns 0x2002
Dec 24 16:20:55 master kernel: scsi0:0:0:0: Attempting to queue an ABORT message
Dec 24 16:20:55 master kernel: scsi0:0:0:0: Command not found
Dec 24 16:20:55 master kernel: aic7xxx_abort returns 0x2002
Dec 24 16:20:55 master kernel: scsi0:0:0:0: Attempting to queue an ABORT message
Dec 24 16:20:55 master kernel: scsi0:0:0:0: Command not found

The full log is at http://people.exocore.com/shanu/scsi_reset.log

This is goes on for a couple of minutes. Sometimes things come crashing
down immediately and sometimes its still usable after the SCSI reset.

I have tried to track down the problem by going thru the linux-scsi
archives and searching google groups but I still have not been able to
find a solution.

Things which I am yet to try:

- Flash motherboard BIOS
- Change motherboard
- Change disks

What I would like to know is how to interpret the SCSI messages so that
I can have a better understanding of the problem and make suitable
changes to the system.

Thank you for your time!


-- Shanu
http://shankerbalan.com/

lspci:

00:08.0 SCSI storage controller: Adaptec AIC-7892A U160/m (rev 02)
Subsystem: Adaptec 29160 Ultra160 SCSI Controller
Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 11
BIST result: 00
I/O ports at e400 [disabled] [size=256]
Memory at f7100000 (64-bit, non-prefetchable) [size=4K]
Expansion ROM at <unassigned> [disabled]
[size=128K] Capabilities: [dc] Power Management version 2


--
It will be advantageous to cross the great stream ... the Dragon is on
the wing in the Sky ... the Great Man rouses himself to his Work.

Shanker Balan

unread,
Dec 24, 2002, 7:09:56 AM12/24/02
to
Hello:

Mikael Abrahamsson wrote,


> I get this too. Different drives from time to time, I get the same
> errors either with a AH3950U2 (on a VP6 Abit motherboard) or a 7899
> based U160 onboard controller (on my new Tyan 2688 motherboard).

I was planning to move from the Gigabyte to a Tyan Tiger MPX S2466.

> I run in LVD mode with a cable which is less than a meter long and
> with a LVD-SCA 6 drive backplane with built in termination.

Hmm.. I never thought about the termination part as I am using a 4U
SuperMicro rack mount cabinet with hot swappable drive bays.

http://www.supermicro.com/PRODUCT/Chassis/SC742-2.htm

Perhaps I should take my drives out and hook them up directly to the
SCSI cable!



> My stop usually lasts for 30 seconds and then it clears and continues.

My box is the NFS server for a 10 node cluster. A small disruption in
service usually brings everything to a halt. :(

> I use Redhat 7.3 as well with the redhat supplied kernels (I've tried
> different ones).

Sigh! Everyone has the same problem :(

-- Shanu
http://shankerbalan.com/

Justin T. Gibbs

unread,
Dec 24, 2002, 10:54:57 AM12/24/02
to
> Hello:
>
> I am experience aic7xxx SCSI reset errors which happens every so often
> on my NFS server:
>
> Hardware:
>
> Gigabyte GA-7DPXDw Dual SMP Motherboard with 512MB of RAM
> 2 AMD Athlon 1900
> Adaptec AIC-7892A U160/m (rev 02)
> QUANTUM Model: ATLAS10K3_18_SCA
> QUANTUM Model: ATLAS10K3_73_SCA
> QUANTUM Model: ATLAS10K3_73_SCA

Can you provide the firmware revisions of these drives. Anything
less than B440 is suspect. The behavior your report looks like
the drives stop returning queued I/O. I have seen this before on
at least B120, although it was when running at U320/packetized and
not U160. My guess is that the bug in these early firmware revs
is protocol agnostic.

--
Justin

Justin T. Gibbs

unread,
Dec 24, 2002, 1:35:00 PM12/24/02
to
> On Tue, 24 Dec 2002, Shanker Balan wrote:
>
>> Hello:
>>
>> I am experience aic7xxx SCSI reset errors which happens every so often
>> on my NFS server:
>
> I get this too.

It is hard to say, from the information you have provided, if your problem
is similar or different from Shanker's. Can you provide drive model
numbers and firmware? A dmesg from your system (you can send it privately)
should have this information. It would also be useful to see the actual
driver messages that are displayed when the failure occurs.

Mikael Abrahamsson

unread,
Dec 24, 2002, 2:15:36 PM12/24/02
to
On Tue, 24 Dec 2002, Justin T. Gibbs wrote:

> > QUANTUM Model: ATLAS10K3_18_SCA
> > QUANTUM Model: ATLAS10K3_73_SCA
> > QUANTUM Model: ATLAS10K3_73_SCA
>
> Can you provide the firmware revisions of these drives. Anything
> less than B440 is suspect. The behavior your report looks like
> the drives stop returning queued I/O. I have seen this before on
> at least B120, although it was when running at U320/packetized and
> not U160. My guess is that the bug in these early firmware revs
> is protocol agnostic.

Host: scsi0 Channel: 00 Id: 01 Lun: 00
Vendor: QUANTUM Model: ATLAS10K3_36_SCA Rev: 120G
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 02 Lun: 00
Vendor: QUANTUM Model: ATLAS10K3_36_SCA Rev: 120G
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 03 Lun: 00
Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 120G
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 04 Lun: 00
Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 120G
Type: Direct-Access ANSI SCSI revision: 03
Host: scsi0 Channel: 00 Id: 05 Lun: 00
Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 120G
Type: Direct-Access ANSI SCSI revision: 03

So I guess I should upgrade my firmware? It's the drive id 3 and 4 I have
had problems with so far.

--
Mikael Abrahamsson email: swm...@swm.pp.se

-

Mikael Abrahamsson

unread,
Dec 25, 2002, 4:46:06 AM12/25/02
to
On Tue, 24 Dec 2002, Mikael Abrahamsson wrote:

> Host: scsi0 Channel: 00 Id: 05 Lun: 00
> Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 120G
> Type: Direct-Access ANSI SCSI revision: 03
>
> So I guess I should upgrade my firmware? It's the drive id 3 and 4 I have
> had problems with so far.

I went into ftp://ftpdownload.maxtor.com/pub/Quantum
Products/Disk_Firmware/Atlas-10KIII/160/ and found what I believe is a
firmware from this year. The weird part is that I can find no references
what so ever either on Google or on Maxtor/Quantum website, to these
firmwares, how to install them, what to be mindful of etc. If I
google/altavista there are absolutely no references to the file names of
the firmwares, the above ftp link etc.

Do you have any links etc describing the procedure? I can't say I get a
warm tingly feeling from the thought of trying to upgrade the firmware of
my precious drives "in the blind".

Any help appreciated.

Shanker Balan

unread,
Dec 26, 2002, 12:34:23 AM12/26/02
to
Hello:

Justin T. Gibbs wrote,

> Can you provide the firmware revisions of these drives. Anything less
> than B440 is suspect. The behavior your report looks like the drives
> stop returning queued I/O. I have seen this before on at least B120,
> although it was when running at U320/packetized and not U160. My
> guess is that the bug in these early firmware revs is protocol
> agnostic.

Ok, here is a snip from dmesg:

SCSI subsystem driver Revision: 1.00
scsi0 : Adaptec AIC7XXX EISA/VLB/PCI SCSI HBA DRIVER, Rev 6.2.8
<Adaptec 29160 Ultra160 SCSI adapter>
aic7892: Ultra160 Wide Channel A, SCSI Id=7, 32/253 SCBs
blk: queue c256e218, I/O limit 4095Mb (mask 0xffffffff)
Vendor: QUANTUM Model: ATLAS10K3_18_SCA Rev: 020W


Type: Direct-Access ANSI SCSI revision: 03

blk: queue dfd8be18, I/O limit 4095Mb (mask 0xffffffff)
Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 020W


Type: Direct-Access ANSI SCSI revision: 03

blk: queue dfd8ba18, I/O limit 4095Mb (mask 0xffffffff)
Vendor: QUANTUM Model: ATLAS10K3_73_SCA Rev: 020W


Type: Direct-Access ANSI SCSI revision: 03

blk: queue dfd8b618, I/O limit 4095Mb (mask 0xffffffff)
scsi0:A:0:0: Tagged Queuing enabled. Depth 253
scsi0:A:1:0: Tagged Queuing enabled. Depth 253
scsi0:A:2:0: Tagged Queuing enabled. Depth 253
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi disk sdb at scsi0, channel 0, id 1, lun 0
Attached scsi disk sdc at scsi0, channel 0, id 2, lun 0
(scsi0:A:0): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
SCSI device sda: 35916548 512-byte hdwr sectors (18389 MB)
Partition check:
sda: sda1 sda2 sda3 sda4 < sda5 sda6 sda7 sda8 sda9 >
(scsi0:A:1): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
SCSI device sdb: 143666192 512-byte hdwr sectors (73557 MB)
sdb: sdb1 sdb2 sdb3
(scsi0:A:2): 160.000MB/s transfers (80.000MHz DT, offset 127, 16bit)
SCSI device sdc: 143666192 512-byte hdwr sectors (73557 MB)
sdc: sdc1 sdc2 sdc3

-- Shanu
http://shankerbalan.com/


--
It's a good thing we don't get all the government we pay for.

Shanker Balan

unread,
Dec 26, 2002, 2:29:09 AM12/26/02
to
Hello:

Justin T. Gibbs wrote,


> It is hard to say, from the information you have provided, if your
> problem is similar or different from Shanker's. Can you provide drive
> model numbers and firmware? A dmesg from your system (you can send it
> privately) should have this information. It would also be useful to
> see the actual driver messages that are displayed when the failure
> occurs.

Is "Tagged Command Queuing" (TCQ) the same as Queued IO? Shall I try
disabling TCQ passing aic7xxx=tag_info:{xx} as a boot time arg so that I
can narrow down the problem to the firmware?

-- Shanu

0 new messages