Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

Problem with current sa(4) driver

2 views
Skip to first unread message

J Wunsch

unread,
Apr 14, 2001, 2:41:47 PM4/14/01
to
Hi folks,

it seems the latest cleanup in CAMs error handling code (ken's commit
from 2001/03/27) has broken the ability for the sa(4) driver to handle
residuals properly. I first noticed it on a new kernel that the
implied test read for any mt(1) operation broke on my Tandberg TDC4222
drive. The test attempts to read 8 KB from the tape, but tape
blocking was usually 10 or 32 KB, depending on how the supplied
cartridge has been written. The symptom was that the driver continued
to read the entire tape, until it eventually hit EOM. Thus, a `mt
blocksize variable' took usually an hour if a full backup cartridge
was in the drive...

I then noticed that any form of tape block size vs. read(2) buffer
sice mismatch causes this behaviour, also for other drive types (see
below). Since Matt replied that he has no idea why, i got interested
in investigating myself :), and pursued it a little.

Let me first start with my recent mail to Matt that contains a summary
of the code paths involved.

> Since i got hold of an old
>
> # camcontrol inquiry sa1
> pass6: <ARCHIVE Python 28388-XXX 5.AC> Removable Sequential Access SCSI-2 device
> pass6: Serial Number DR0J340
> pass6: 7.812MB/s transfers (7.812MHz, offset 8)
>
> drive as well, i've tried the same procedure there. Similar behaviour
> to the Tandberg TDC4222 drive. As soon as i request a too large block
> from the drive, it starts reading the entire tape (until i reset the
> SCSI bus, or it will eventually encounter EOM).
>
> I've now DDB'ed it, and can tell you at least the following:
>
> sadone() calls saerror(), which will always return ERESTART (== -1),
> thus it loops at this point.
>
> saerror(), when called, bcopy()s the sense data, resid etc. into
> softc, then (most likely) enters the SSD_ILI handling, and eventually
> calls cam_periph_error() to handle the error. I think something
> might be wrong with this.
>
> cam_periph_error() eventually calls camperiphscsisenseerror(), which
> in turn calls scsi_error_action(). err_action will (apparently) be
> set by this to SS_RETRY. In the end, this will cause
> cam_periph_error() to return ERESTART.
>
> I did not investigate yet scsi_error_action() itself, nor examined its
> return value directly (only indirectly based on the code path inside
> cam_periph_error()).

So now, i think it's rather in the domain of the sa(4) driver to
handle an illegal length indication properly by itself, since it's
rather special to handling tapes that a `short read' from the tape
(supplied blocksize to read(2) is larger than logical block size on
the tape) is basically a normal operating condition which is never to
be returned as an EIO, but always to be reported (in the b_resid)
filed to the bio layer.

So my first attempt was to catch this case, and never call
cam_periph_error() in the first place at all, but return 0 from
saerror(), indicating it not actually to be an error at all. Copying
the residual value over to csio->resid has already been properly
handled in saerror(). This procedure solves the initial problem; a
test read with a 32 KB blocksize dd command from a 10 KB blocked tar
tape results in 10240 bytes being read. However, using this method,
saclose() always jams the process waiting in state "cbwait", as a
result of the final call to "saprevent(..., PR_ALLOW)".

Somehow i thought it's necessary to actually call cam_periph_error(),
in order to fetch the check condition from the device. OK, i did so,
and rearranged my modification to first call cam_periph_error(), but
to then ignore its returned ERESTART, and have saerror() return 0.
This reproducibly causes a segmentation violation panic with the
following stack trace:

(gdb) bt
#0 0xc011f683 in heap_down (queue_array=0xc04e66fc, index=-1, num_entries=0)
at ../../cam/cam_queue.c:345
#1 0xc011f390 in camq_remove (queue=0xc0851a44, index=-1)
at ../../cam/cam_queue.c:180
#2 0xc0121f0c in xpt_run_dev_sendq (bus=0xc08229c0)
at ../../cam/cam_queue.h:199
#3 0xc012487f in camisr (V_queue=0xc02bc4b4) at ../../cam/cam_xpt.c:6946
#4 0xc015e857 in ithread_loop (arg=0xc04f1980) at ../../kern/kern_intr.c:517
#5 0xc015da01 in fork_exit (callout=0xc015e618 <ithread_loop>,
arg=0xc04f1980, frame=0xc39c8fa8) at ../../kern/kern_fork.c:731

The error happens because of:

(gdb) up
#1 0xc011f390 in camq_remove (queue=0xc0851a44, index=-1)
at ../../cam/cam_queue.c:180
(gdb) p index
$1 = -2

(I have also observed -1 once.)

cam_pinfo *
camq_remove(struct camq *queue, int index)
{
cam_pinfo *removed_entry;

if (index == 0 || index > queue->entries)
return (NULL);
removed_entry = queue->queue_array[index];
if (queue->entries != index) {
queue->queue_array[index] = queue->queue_array[queue->entries];
queue->queue_array[index]->index = index;
=> heap_down(queue->queue_array, index, queue->entries - 1);
}
removed_entry->index = CAM_UNQUEUED_INDEX;
queue->entries--;
return (removed_entry);
}


Now, is this a bug somewhere in the CAM code, or did i overlook
something in my attempt to fix the sa driver?

I've also reviewed the old (working) code, and it seems that one
always returned 0 from cam_periph_error(), after first decrementing
the retry count:

case SSD_CURRENT_ERROR:
{

switch (sense_key) {
case SSD_KEY_NO_SENSE:
^^^^^^^^^^^^^^^^^^^^^^ <- we do have no sense for an ILI cond.
/* Why were we called then? Well don't bail now */
/* FALLTHROUGH */
case SSD_KEY_EQUAL:
/* These should be filtered by the peripheral drivers */
/* FALLTHROUGH */
case SSD_KEY_MISCOMPARE:
print_sense = FALSE;
/* FALLTHROUGH */
case SSD_KEY_RECOVERED_ERROR:

/* decrement the number of retries */
retry = ccb->ccb_h.retry_count > 0;
if (retry)
ccb->ccb_h.retry_count--;

error = 0;
break;

[I've finally subscribed to this list again, so no need to send
me a personal Cc.]
--
cheers, J"org .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/ NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)

To Unsubscribe: send mail to majo...@FreeBSD.org
with "unsubscribe freebsd-scsi" in the body of the message

Justin T. Gibbs

unread,
Apr 15, 2001, 1:06:39 AM4/15/01
to
>So now, i think it's rather in the domain of the sa(4) driver to
>handle an illegal length indication properly by itself, since it's
>rather special to handling tapes that a `short read' from the tape
>(supplied blocksize to read(2) is larger than logical block size on
>the tape) is basically a normal operating condition which is never to
>be returned as an EIO, but always to be reported (in the b_resid)
>filed to the bio layer.

While it is true that the sa driver should be filtering out this
particular case because there is no error, returning ERESTART for
NO_SENSE is also wrong. You should be able to fix that by changing
the table entry for that sense code in cam_periph.c.

>Somehow i thought it's necessary to actually call cam_periph_error(),
>in order to fetch the check condition from the device.

This would only be the case if a controller either did not support
"auto sense" or the CAM status was "auto sense failed".

>OK, i did so,
>and rearranged my modification to first call cam_periph_error(), but
>to then ignore its returned ERESTART, and have saerror() return 0.
>This reproducibly causes a segmentation violation panic with the
>following stack trace:

ERESTART means the error recovery code has already re-queued the
CCB to retry the operation. By ignoring this code, you are telling
the caller of saerror() to complete the command normally resulting in
an eventual release of this particular ccb back to the free pool.
A ccb can only be doing one thing at a time. 8-)

To understand the symptoms of the panic, look at these definitions
in cam.h:

#define CAM_UNQUEUED_INDEX -1
#define CAM_ACTIVE_INDEX -2
#define CAM_DONEQ_INDEX -3

So, depending on exactly when the CCB was reused, the ccbq was
manipulated, and perhaps the error recovery code's retry completed,
you might see any of these three indexes or a valid index.

>[I've finally subscribed to this list again, so no need to send
>me a personal Cc.]

I can feel a David O'Brien moment coming one. 8-)

--
Justin

J Wunsch

unread,
Apr 18, 2001, 4:54:58 PM4/18/01
to
As Justin T. Gibbs wrote:

> While it is true that the sa driver should be filtering out this
> particular case because there is no error, returning ERESTART for
> NO_SENSE is also wrong. You should be able to fix that by changing
> the table entry for that sense code in cam_periph.c.

You mean, like this?

Index: cam_periph.c
===================================================================
RCS file: /home/ncvs/src/sys/cam/cam_periph.c,v
retrieving revision 1.34
diff -c -r1.34 cam_periph.c
*** cam_periph.c 2001/04/04 18:24:35 1.34
--- cam_periph.c 2001/04/17 17:46:11
***************
*** 1369,1374 ****
--- 1369,1376 ----

switch (err_action & SS_MASK) {
case SS_NOP:
+ error = 0;
+ break;
case SS_RETRY:
action_string = "Retrying Command";
error = ERESTART;

Tried this, it fixes the problem with ILI's, sa(4) now properly
returns a short read. However, it uncovers a new bug that was just
waiting around...

% dd if=/dev/sa0 of=/dev/null bs=10k
dd: /dev/sa0: Input/output error
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1+0 records in
1+0 records out
10240 bytes transferred in 70.992489 secs (144 bytes/sec)

/var/log/messages says:

(sa0:sym0:0:1:0): READ(06). CDB: 8 0 0 28 0 0
(sa0:sym0:0:1:0): CAM Status: SCSI Status Error
(sa0:sym0:0:1:0): SCSI Status: Check Condition
(sa0:sym0:0:1:0): NO SENSE info:2800 asc:0,1
(sa0:sym0:0:1:0): Filemark detected
(sa0:sym0:0:1:0): Retries Exhausted

So now, when hitting the EOM filemark, we get an EIO. Again, this
looks like something where sa(4) should IMHO special-case the error
decision, instead of relying on cam_periph_error() to DTRT (which it
cannot).

I tried to manually patch the return value of cam_periph_error() to 0
in kgdb, but this just gets me back at the second problem:

% ps axl
UID PID PPID CPU PRI NI VSZ RSS WCHAN STAT TT TIME COMMAND
...
107 373 1 0 -8 0 244 33 cbwait DWE p0- 0:00.00 dd if=/dev/sa0 of=/dev/null bs=10

It sits there, and waits indefinately. I'm at a loss here to see why
this happens. :-(

> ERESTART means the error recovery code has already re-queued the
> CCB to retry the operation. By ignoring this code, you are telling
> the caller of saerror() to complete the command normally resulting in
> an eventual release of this particular ccb back to the free pool.

OK, understood, thanks for the explanation!

--
cheers, J"org .-.-. --... ...-- -.. . DL8DTL

http://www.sax.de/~joerg/ NIC: JW11-RIPE
Never trust an operating system you don't have sources for. ;-)

To Unsubscribe: send mail to majo...@FreeBSD.org

0 new messages