aac(4) handling of probe when no devices are there

Alexander Sack

unread,

Dec 14, 2009, 4:47:50 PM12/14/09

to freebsd...@freebsd.org, freebs...@freebsd.org

Hello Again:

I guess I have a technical question/concern that I was looking for
feedback. During the probe sequence, aac(4) conditionally responds
to INQUIRY commands depending on target LUN:

aac_cam.c/aac_cam_complete():
532 if (command == INQUIRY) {
533 if (ccb->ccb_h.status == CAM_REQ_CMP) {
534 device = ccb->csio.data_ptr[0] & 0x1f;
535 /*
536 * We want DASD and PROC devices to only be
537 * visible through the pass device.
538 */
539 if ((device == T_DIRECT) ||
540 (device == T_PROCESSOR) ||
541 (sc->flags & AAC_FLAGS_CAM_PASSONLY))
542 ccb->csio.data_ptr[0] =
543 ((device & 0xe0) | T_NODEVICE);
544 } else if (ccb->ccb_h.status ==
CAM_SEL_TIMEOUT &&
545 ccb->ccb_h.target_lun != 0) {
546 /* fix for INQUIRYs on Lun>0 */
547 ccb->ccb_h.status =
CAM_DEV_NOT_THERE;
548 }
549 }

Why is CAM_DEV_NOT_THERE skipped on LUN 0? This is true on my target
6.1-amd64 machine as well as CURRENT. The reason why I ask this is
because now that aac(4) is sequential scanned, there are a lot of cam
interrupts that come in on my 6.x machine where the threshold is only
500 and I get the interrupt storm threshold warning for swi2 pretty
quickly:

Interrupt storm detected on "swi2:"; throttling interrupt source

Obviously its contingent on the number of adapters you have on your
system. On CURRENT I didn't see this because the threshold is double
(I think its a 1000 by default).

The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
the scan. The probe sequence in CURRENT as well as 6.1 handles
CAM_SEL_TIMEOUT a little differently depending on context.

scsi_xpt.c/probedone():
1090 } else if (cam_periph_error(done_ccb, 0,
1091 done_ccb->ccb_h.target_lun > 0
1092 ? SF_RETRY_UA|SF_QUIET_IR
1093 : SF_RETRY_UA,
1094 &softc->saved_ccb) ==
ERESTART) {
1095 return;
1096 } else if ((done_ccb->ccb_h.status & CAM_DEV_QFRZN) != 0) {
1097 /* Don't wedge the queue */
1098 xpt_release_devq(done_ccb->ccb_h.path, /*count*/1,
1099 /*run_queue*/TRUE);
1100 }
1101 /*
1102 * If we get to this point, we got an error status back
1103 * from the inquiry and the error status doesn't require
1104 * automatically retrying the command. Therefore, the
1105 * inquiry failed. If we had inquiry information before
1106 * for this device, but this latest inquiry command failed,
1107 * the device has probably gone away. If this device isn't
1108 * already marked unconfigured, notify the peripheral
1109 * drivers that this device is no more.
1110 */
1111 if ((path->device->flags & CAM_DEV_UNCONFIGURED) == 0)
1112 /* Send the async notification. */
1113 xpt_async(AC_LOST_DEVICE, path, NULL);
1114
1115 xpt_release_ccb(done_ccb);
1116 break;
1117 }

But on cam_periph_error(), this will issue a xpt_async(AC_LOST_DEVICE,
path, NULL) regardless of whether or not the device has been scene
already (as per the comment above), i.e. on every initial bus scan,
you will get into (on an aac(4) card with LUN > 0):

cam_periph.c/cam_periph_error():
1697 case CAM_SEL_TIMEOUT:
1698 {
.
.
1729 /*
1730 * Let peripheral drivers know that this device has gone
1731 * away.
1732 */
1733 xpt_async(AC_LOST_DEVICE, newpath, NULL);
1734 xpt_free_path(newpath);
1735 break;

Is this really right? This generates A LOT of interrupts noise when no
devices are attached during the initial scan, i.e. we are treating the
initial scan of failed INQUIRY commands on the SCSI BUS as if we
really lost a device during a selection timeout. (we even generate a
path to issue the async event).

Obviously if aac(4) returned CAM_NO_DEVICE_THERE you avoid this but
there is some history here and I've yet to fully grasp the intent of
the original fix on LUNs greater than zero. What was the problem?

Comments/thoughts appreciated?

Thanks!

-aps
_______________________________________________
freebsd...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-current
To unsubscribe, send any mail to "freebsd-curre...@freebsd.org"

Alexander Sack

unread,

Dec 14, 2009, 5:09:40 PM12/14/09

to freebsd...@freebsd.org, freebs...@freebsd.org

I should have properly titled the thread a little bit better, but
basically we always generate a ton of software CAM interrupts during a
LUN scan for targets on aac(4) that do not really exist (i.e. nothing
is truly there). We do this because we treat the initial INQUIRY sent
down equivalent to a selection timeout instead of the device is not
really there. There seems to be an historical workaround for part of
this issue but I am trying to delve deeper in order to do the *right
thing* for our 6.1 deployments (as well as 7.x and CURRENT).

Scott Long

unread,

Dec 15, 2009, 4:55:24 AM12/15/09

to Alexander Sack, freebs...@freebsd.org, freebsd...@freebsd.org

In the parallel scsi world, a selection timeout means that all LUNs
within the entire target do not (or no longer) exist. So returning
CAM_SEL_TIMEOUT for LUN 1 would tell CAM to invalidate LUN 0 as well.

If you look higher up in this function, you'll see a note about the
error/status codes from the AAC firmware coincidentally matching CAM's
status codes. My guess is that somewhere along the line, someone at
Adaptec stopped reading the SCSI spec and starting returning
CAM_SEL_TIMEOUT for LUNs greater than 0, which is why this work-around
is now in the driver.

> This is true on my target
> 6.1-amd64 machine as well as CURRENT. The reason why I ask this is
> because now that aac(4) is sequential scanned, there are a lot of cam
> interrupts that come in on my 6.x machine where the threshold is only
> 500 and I get the interrupt storm threshold warning for swi2 pretty
> quickly:
>
> Interrupt storm detected on "swi2:"; throttling interrupt source
>
> Obviously its contingent on the number of adapters you have on your
> system. On CURRENT I didn't see this because the threshold is double
> (I think its a 1000 by default).
>
> The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
> the scan. The probe sequence in CURRENT as well as 6.1 handles
> CAM_SEL_TIMEOUT a little differently depending on context.
>

It's not at all clear to me what is going on here. Can you instrument
the code to record the status of everything that is being issued to
the aac_cam module?

Scott

Alexander Sack

unread,

Dec 16, 2009, 12:11:50 PM12/16/09

to Scott Long, freebs...@freebsd.org, freebsd...@freebsd.org

Interesting. Learn something everyday. I did not know that a
selection timeout on a non-zero LUN meant no other LUN was available.
As a colleague noted, "Has Adaptec ever read the SCSI spec?" Just
kidding (somewhat)....

>> �This is true on my target

>> 6.1-amd64 machine as well as CURRENT. �The reason why I ask this is
>> because now that aac(4) is sequential scanned, there are a lot of cam
>> interrupts that come in on my 6.x machine where the threshold is only
>> 500 and I get the interrupt storm threshold warning for swi2 pretty
>> quickly:
>>
>> Interrupt storm detected on "swi2:"; throttling interrupt source
>>
>> Obviously its contingent on the number of adapters you have on your
>> system. �On CURRENT I didn't see this because the threshold is double
>> (I think its a 1000 by default).
>>
>> The issue is the number of xpt_async(AC_LOST_DEVICE, ..) calls during
>> the scan. �The probe sequence in CURRENT as well as 6.1 handles
>> CAM_SEL_TIMEOUT a little differently depending on context.

Yeah I spoke too soon. I think that is a red herring though and
misinterpretation of what that was really doing (in this case just
seeing the device as unconfigured and moving on).

But I STILL don't understand why its treated as a AC_LOST_DEVICE event
at scan time (i.e. more overhead than really necessary but perhaps I
am not thinking of all the possibilities down this code path, i.e. why
create a path, then call xpt_asyc, all to just set the flag as
unconfigured - perhaps its more align with the model than anything
else and I'm reading too much into it).

> It's not at all clear to me what is going on here. �Can you instrument the
> code to record the status of everything that is being issued to the aac_cam
> module?

Yes surely. I think what might be happening is that after the
INQUIRY fails, xpt_release_ccb() which I think will also check to see
if any more CCBs should be sent to the device and send them.
Basically the boot -v output is I am getting a CAM_SEL_TIMEOUT for
each target and just hit into the 500 interrupt storm default
threshold on 6.1.

Let me investigate further...I'm on the right track, but I need to
instrument more...Scott its my first time playing with CAM (be
gentle). :D

-aps

Alexander Sack

unread,

Dec 19, 2009, 6:36:47 PM12/19/09

to Scott Long, freebs...@freebsd.org, freebsd...@freebsd.org

Sorry for the delay. Its the holidays and a bunch of stuff going on.

Alright, honestly, it looks like everything is FINE minus the fact
that camisr() is going to get called a lot given the number of buses
and targest on the system. I instrumented aac_cam_action() and it
appears to be normal:

mfid0: <MFI Logical Disk> on mfi0
mfid0: 238418MB (488280064 sectors) RAID
aacd0: <RAID 5> on aac0
aacd0: 9533430MB (19524464640 sectors)
aacd1: <RAID 5> on aac1
aacd1: 9533430MB (19524464640 sectors)
GEOM: new disk mfid0
GEOM: new disk aacd0
GEOM: new disk aacd1
GEOM_LABEL: Label for provider mfid0 is label/disk0.
GEOM_LABEL: Label for provider aacd0 is label/disk1.
GEOM_LABEL: Label for provider aacd1 is label/disk2.
(probe5:aacp5:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe5:aacp5:0:0:0): Retrying Command
(probe2:aacp2:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe2:aacp2:0:0:0): Retrying Command
(probe5:aacp5:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe5:aacp5:0:0:0): Retrying Command
(probe2:aacp2:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe2:aacp2:0:0:0): Retrying Command
(probe5:aacp5:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe5:aacp5:0:0:0): Retrying Command
(probe2:aacp2:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe2:aacp2:0:0:0): Retrying Command
(probe0:aacp3:0:8:1): error 22
(probe0:aacp3:0:8:1): Unretryable Error
(probe5:aacp5:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe5:aacp5:0:0:0): Retrying Command
(probe4:aacp0:0:8:1): error 22
(probe4:aacp0:0:8:1): Unretryable Error
(probe2:aacp2:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe2:aacp2:0:0:0): Retrying Command
(probe0:aacp3:0:8:2): error 22
(probe0:aacp3:0:8:2): Unretryable Error
(probe5:aacp5:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe5:aacp5:0:0:0): error 5
(probe5:aacp5:0:0:0): Retries Exausted
(probe4:aacp0:0:8:2): error 22
(probe4:aacp0:0:8:2): Unretryable Error
(probe2:aacp2:0:0:0): Request completed with CAM_REQ_CMP_ERR
(probe2:aacp2:0:0:0): error 5
(probe2:aacp2:0:0:0): Retries Exausted
(probe0:aacp3:0:8:3): error 22
(probe0:aacp3:0:8:3): Unretryable Error
.
.
.
(probe0:aacp0:0:19:3): Unretryable Error
(probe2:aacp3:0:19:2): error 22
(probe2:aacp3:0:19:2): Unretryable Error
(probe1:aacp0:0:19:4): error 22
(probe1:aacp0:0:19:4): Unretryable Error
(probe2:aacp3:0:19:3): error 22
(probe2:aacp3:0:19:3): Unretryable Error
(probe0:aacp0:0:19:5): error 22
(probe0:aacp0:0:19:5): Unretryable Error
(probe2:aacp3:0:19:4): error 22
(probe2:aacp3:0:19:4): Unretryable Error
(probe1:aacp0:0:19:6): error 22
(probe1:aacp0:0:19:6): Unretryable Error
(probe2:aacp3:0:19:5): error 22
(probe2:aacp3:0:19:5): Unretryable Error
(probe0:aacp0:0:19:7): error 22
(probe0:aacp0:0:19:7): Unretryable Error
(probe2:aacp3:0:19:6): error 22
(probe2:aacp3:0:19:6): Unretryable Error
(probe2:aacp3:0:19:7): error 22
(probe2:aacp3:0:19:7): Unretryable Error

Interrupt storm detected on "swi2:"; throttling interrupt source

If you look at aac_cam_action()/aac_cam_complete() via KTR:

5:0:1 CAM_DEV_NOT_THERE
144 5:0:1 op: 12

143 4:17:0 0xa
142 4:17:0 op: 12

141 3:8:4 CAM_DEV_NOT_THERE
140 3:8:4 op: 12

139 0:8:3 CAM_DEV_NOT_THERE
138 0:8:3 op: 12

137 1:13:0 0xa
136 1:13:0 op: 12

135 2:0:0 op: 0

134 4:16:0 0xa
133 4:16:0 op: 12

132 3:8:3 CAM_DEV_NOT_THERE
131 3:8:3 op: 12

130 5:0:0 op: 0

129 2:0:0 0x84
128 0:8:2 CAM_DEV_NOT_THERE
127 1:12:0 0xa
126 0:8:2 op: 12

125 1:12:0 op: 12

124 5:0:0 0x84
123 4:15:0 0xa
122 3:8:2 CAM_DEV_NOT_THERE
121 2:0:0 op: 12

120 4:15:0 op: 12

119 3:8:2 op: 12

118 5:0:0 op: 12

117 2:0:0 0x84
116 0:8:1 CAM_DEV_NOT_THERE
115 1:11:0 0xa
114 5:0:0 0x84
113 4:14:0 0xa
112 3:8:1 CAM_DEV_NOT_THERE
111 0:8:1 op: 12

110 1:11:0 op: 12

109 4:14:0 op: 12

108 3:8:1 op: 12

107 2:0:0 op: 12

106 2:0:0 0x84
105 1:10:0 0xa
104 5:0:0 op: 12

103 4:13:0 0xa
102 4:13:0 op: 12

101 5:0:0 0x84
100 4:12:0 0xa
99 0:8:0 op: 0

98 1:10:0 op: 12

97 4:12:0 op: 12

96 2:0:0 op: 12

95 3:8:0 op: 0

94 2:0:0 0x84
93 1:9:0 0xa
92 5:0:0 op: 12

91 4:11:0 0xa
90 4:11:0 op: 12

89 5:0:0 0x84
88 4:10:0 0xa
87 0:8:0 op: 12

86 1:9:0 op: 12

85 4:10:0 op: 12

84 3:8:0 op: 12

83 2:0:0 op: 12

82 0:8:0 op: 1a

81 5:0:0 op: 12

80 2:0:0 0x84
79 1:8:0 0xa
78 4:9:0 0xa
77 4:9:0 op: 12

76 5:0:0 0x84
75 4:8:0 0xa
74 3:8:0 op: 1a

73 1:8:0 op: 12

72 0:8:0 op: 12

71 4:8:0 op: 12

70 3:8:0 op: 12

69 1:7:0 0xa
68 1:7:0 op: 12

67 0:7:0 0xa
66 0:7:0 op: 12

65 4:7:0 0xa
64 4:7:0 op: 12

63 3:7:0 0xa
62 3:7:0 op: 12

61 1:6:0 0xa
60 1:6:0 op: 12

59 0:6:0 0xa
58 0:6:0 op: 12

57 4:6:0 0xa
56 4:6:0 op: 12

55 3:6:0 0xa
54 3:6:0 op: 12

53 1:5:0 0xa
52 1:5:0 op: 12

51 0:5:0 0xa
50 0:5:0 op: 12

49 4:5:0 0xa
48 4:5:0 op: 12

47 3:5:0 0xa
46 3:5:0 op: 12

45 1:4:0 0xa
44 1:4:0 op: 12

43 0:4:0 0xa
42 0:4:0 op: 12

41 4:4:0 0xa
40 2:0:0 op: 12

39 5:0:0 op: 12

38 4:4:0 op: 12

37 3:4:0 0xa
36 3:4:0 op: 12

35 1:3:0 0xa
34 1:3:0 op: 12

33 0:3:0 0xa
32 5:0:0 op: 12

31 2:0:0 op: 12

30 0:3:0 op: 12

29 4:3:0 0xa
28 4:3:0 op: 12

27 3:3:0 0xa
26 3:3:0 op: 12

25 1:2:0 0xa
24 1:2:0 op: 12

23 0:2:0 0xa
22 0:2:0 op: 12

21 4:2:0 0xa
20 4:2:0 op: 12

19 3:2:0 0xa
18 3:2:0 op: 12

17 1:1:0 0xa
16 0:1:0 0xa
15 4:1:0 0xa
14 3:1:0 0xa
13 1:1:0 op: 12

12 4:1:0 op: 12

11 3:1:0 op: 12

10 0:1:0 op: 12

9 1:0:0 0xa
8 4:0:0 0xa
7 3:0:0 0xa
6 0:0:0 0xa
5 5:0:0 op: 12

4 4:0:0 op: 12

3 3:0:0 op: 12

2 2:0:0 op: 12

1 1:0:0 op: 12

0 0:0:0 op: 12

At a limit of 500 on 6.1, it was easy to hit this threshold. I
assumed before that the CAM_REQ_CMP_ERR's were CAM_SEL_TIMEOUT and
confused the issues. Sorry for that, its been a long week dealing
with several issues.

With enough cards, you will even hit the 1000 limit in CURRENT (though
I admit its not really that serious). Right?