Avago LSI SAS 3008 & Intel SSD Timeouts

list-news

unread,

Jun 3, 2016, 4:37:17 PM6/3/16

to

Greetings!

For the last month I've been fighting with a SAS 3008 issues on a
SuperMicro 2028TP-DECR (2 node Twin server). The system is running
10.3-RELEASE #1 r298749

The errors we see on this system look very similar to the errors posted
here back in March:
https://lists.freebsd.org/pipermail/freebsd-scsi/2016-March/007000.html

The system hosts a large postgresql database and an application which
spends most of its days calculating log entries on a couple hundred GB
of tables. Initially this system had 2 WD HDD mirrored boot drives, and
8 Intel 730 - 240GB's SSD's in a zfs pool with 4 mirrored vdevs.
Shortly after install/setup, a couple of those Intel 730's began showing
READ & WRITE errors displayed with zpool status, additionally mpr0
messages were showing drive timeouts when attempting READ(10), WRITE(10)
and SYNCHRONIZE CACHE(10), such as this:

Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
allocated tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
Aborting command 0xfffffe0000cefb00
Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 16
[...snipped about 50 mores lines for brevity]

As a solution, the immediate thought was the drives were faulty. We
decided to replace some of the drives with Intel 480GB DC3610's. After
replacing 4 of the drives we noticed the 3610's were also throwing
errors. Read/write error counts displayed with zpool status would climb
into the thousands in a night of calculations with dozens of mpr errors
in the logs. This was happening to all of the drives in the array. Some
are more likely than others, but given enough time, all drives would
display the errors. This provoked some suspicions that this problem was
backplane and/or controller related, and the drives were fine (or
possibly the drive firmware is the same in the 730's and 3610's and both
exhibit this problem, but the backplane did seem more likely at the time).

We moved Node A system sled (cpu/mem/sas/mb) into Node B chassis slot to
see if the errors would change at all (thinking maybe some SATA ports
were possibly bad on the backplane). We hit a bit of luck and noticed
the boot drives, when inserted into Node B drive sleds 1&2 wouldn't fire
up and were undetectable by the SAS controller. SuperMicro concluded
this was likely a backplane issue and a backplane was replaced. The new
backplane came in, the drives in Node B sleds 1 / 2 are now visible by
the controller during boot, confirming the backplane did have a
problem. (This is important to mention as it seems an internets
suggested fix for similar timeout errors is to replace the backplane
and/or cables.)

Unfortunately, the luck ran out, once we put the system back under load,
the errors came back. I rebuilt the pool just using four 3610's as 2x
mirrored vdevs and copied all the data over. Here is the pool as of
yesterday:

pool: zp1
state: ONLINE
scan: resilvered 107M in 0h0m with 0 errors on Thu Jun 2 14:52:43 2016
config:

NAME STATE READ WRITE CKSUM
zp1 ONLINE 0 0 0
mirror-0 ONLINE 0 0 0
da3 ONLINE 0 0 0
da5 ONLINE 0 0 0
mirror-1 ONLINE 0 0 0
da7 ONLINE 0 0 0
da8 ONLINE 0 0 0

When performing logging calculations, the system looks like this pretty
much continuously (via gstat):

L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
7 8175 1374 7526 0.5 3000 48931 0.5 65.2| da3
6 8337 1542 8025 0.5 2995 48947 0.5 68.1| da5
2 8740 1529 8584 0.5 3138 52564 0.4 72.6| da7
3 8637 1427 7961 0.5 3136 52564 0.4 65.6| da8

And zpool iostat:
zp1 414G 474G 10.9K 8.94K
50.2M 51.8M
mirror 207G 237G 5.44K 4.56K
25.0M 26.2M
da3 - - 2.50K 2.37K
12.3M 26.2M
da5 - - 2.65K 2.37K
13.3M 26.1M
mirror 207G 237G 5.42K 4.39K
25.3M 25.6M
da7 - - 2.62K 2.40K
13.2M 25.6M
da8 - - 2.57K 2.40K
12.5M 25.6M

After a couple minutes (usually between 2 and 10) with drive %busy and a
cpu load around 70%, the system hangs.

During the hang, one of two scenarios play out (possibly more, but these
are the ones officially witnessed):

Scenario 1) Processes continue to function as long as they don't touch
the storage. One drive immediately hits 0 r/w 0 r/s, the rest of the
drives quickly fall to 0 r/s 0 w/s (maybe within 1 - 5 seconds), and the
lights on the chassis for the drives goes black. Any commands I enter
in the terminal hang after the newline.

During Scenario 1 - gstat output (notice the L(q) column for da7):
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
0 0 0 0 0.0 0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0.0| da5
8 0 0 0 0.0 0 0 0.0 0.0| da7
0 0 0 0 0.0 0 0 0.0 0.0| da8

Scenario 2) Some processes continue to function in a read-only state,
but all writes for all drives fall to 0.

During Scenario 2 - gstat output (again notice the L(q) column for da3):
dT: 1.002s w: 1.000s
L(q) ops/s r/s kBps ms/r w/s kBps ms/w %busy Name
69 0 0 0 0.0 0 0 0.0 0.0| da3
0 3207 3207 45727 0.2 0 0 0.0 37.8| da5
0 2753 2753 24673 0.2 0 0 0.0 23.1| da7
0 3027 3027 26425 0.2 0 0 0.0 28.6| da8

About 60 seconds later the system becomes responsive again (I assume
kern.cam.da.default_timeout is the 60 second timer, although I haven't
changed this to confirm), and a bunch of error logs appear for the drive
that had a number in column L(q) > 0 (da7 in scenario 1 last night, da3
in scenario 2 today). Any commands I typed in the terminal finally fire
off, the cpu and disk usage quickly climb back to 70-80%+. If the
errors are bad enough (or possibly if two drives hang simultaneously?)
it'll show read and/or write errors in zpool status.

Failed attempts to repair: Replaced Backplane. Full memory tests.
Upgraded 3610's to the latest firmware. Used Node B in place of Node A.
Flashed the controller with latest IT firmware from SuperMicro. Swapped
power supplies. Pulled one proc off Node A and half the memory, put
that into Node B and tried the tests again. Set
kern.cam.da.X.delete_method=DISABLE. Probably tuned a couple dozen
other cam and zfs sysctl/loader variables in testing.

Nothing stopped these errors.

Until I ran this:
# camcontrol tags daX -N 1
(where X is 3,5,7 and 8)

With tags set to 1 for each device, all errors for the last 18 hours
have ceased (system ran all night at load). I've not spent the time yet
to determine how much that affects performance, but it is definitely
working.

To reproduce this bug. Today I ran:
# camcontrol tags daX -N 8

On all drives. A few minutes later, the drive hangs (see gstat output
for da3 above in scenario 2, which I had copied just seconds before the
errors that follow):

Jun 3 13:34:10 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 09
cc 97 30 00 00 08 00 length 4096 SMID 961 command timeout cm
0xfffffe0000d2ad50 ccb 0xfffff807bfcef800 target 12, handle(0x000d)
Jun 3 13:34:10 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 3 13:34:10 s18 kernel: mpr0: timedout cm 0xfffffe0000d2ad50
allocated tm 0xfffffe0000cdc150
Jun 3 13:34:10 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
Aborting command 0xfffffe0000d2ad50
Jun 3 13:34:10 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
[...snipped about 50 mores lines for brevity]

With 1 tag, gstat still shows L(q) values that jump into the 70's at
times, but with what I understand of the source, devstat calculates the
queued operations separate from camcontrol's simultaneous
dev_openings. So I assume that's nothing to be concerned about but
figured I'd mention it.

# camcontrol tags da7 -v

(pass7:mpr0:0:16:0): dev_openings 0
(pass7:mpr0:0:16:0): dev_active 1
(pass7:mpr0:0:16:0): allocated 1
(pass7:mpr0:0:16:0): queued 0
(pass7:mpr0:0:16:0): held 0
(pass7:mpr0:0:16:0): mintags 2
(pass7:mpr0:0:16:0): maxtags 255

Here is a full error message from last night on da7 (gstat scenario 1
above was copied right before these lines). Apologies for the verbosity,
but it may be helpful to some, logged with dev.mpr.0.debug_level=0x037F:

Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 command timeout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: timedout cm 0xfffffe0000cefb00
allocated tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 15
Aborting command 0xfffffe0000cefb00
Jun 2 19:48:42 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 16
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14
24 35 e0 00 00 08 00 length 4096 SMID 562 command timeout cm
0xfffffe0000d0a1a0 ccb 0xfffff80daeb84000 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0a1a0
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 command timeout cm
0xfffffe0000d1e720 ccb 0xfffff80628c3d800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1e720
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 98 00 00 08 00 length 4096 SMID 571 command timeout cm
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d0ad70
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 90 00 00 08 00 length 4096 SMID 540 command timeout cm
0xfffffe0000d084c0 ccb 0xfffff80288e58000 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d084c0
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 88 00 00 08 00 length 4096 SMID 817 command timeout cm
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 target 16, handle(0x0011)
Jun 2 19:48:42 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 2 19:48:42 s18 kernel: mpr0: queued timedout cm 0xfffffe0000d1f050
for processing by tm 0xfffffe0000cdd3b0
Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
Jun 2 19:48:42 s18 kernel: EventDataLength: 2
Jun 2 19:48:42 s18 kernel: AckRequired: 0
Jun 2 19:48:42 s18 kernel: Event: SasDiscovery (0x16)
Jun 2 19:48:42 s18 kernel: EventContext: 0x0
Jun 2 19:48:42 s18 kernel: Flags: 1<InProgress>
Jun 2 19:48:42 s18 kernel: ReasonCode: Discovery Started
Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
Jun 2 19:48:42 s18 kernel: DiscoveryStatus: 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00 length 8192 SMID 186 completed cm
0xfffffe0000ceb420 ccb 0xfffff80cd1d88800 during recovery ioc 804b scsi
0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00 length 8192 SMID 186 terminated ioc 804b scsi 0
state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 98 00 00 08 00 length 4096 SMID 571 completed timedout cm
0xfffffe0000d0ad70 ccb 0xfffff8018a54b800 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 98 00
00 08 00 length 4096 SMID 571 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 90 00 00 08 00 length 4096 SMID 540 completed timedout cm
0xfffffe0000d084c0 ccb 0xfffff80288e58000 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 90 00
00 08 00 length 4096 SMID 540 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13
cd fc 88 00 00 08 00 length 4096 SMID 817 completed timedout cm
0xfffffe0000d1f050 ccb 0xfffff8004c3ef800 during recovery ioc 804b scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 13 cd fc 88 00
00 08 00 length 4096 SMID 817 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00 length 4096 SMID 240 completed timedout cm
0xfffffe0000cefb00 ccb 0xfffff805dfeca800 during recovery ioc 8048 scsi
0 state c xf (da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00
00 08 00 length 4096 SMID 562 completed timedout cm 0xfffffe0000d0a1a0
ccb 0xfffff80daeb84000 during recovery ioc 804b scsi 0 state c xf
(da7:mpr0:0:16:0): READ(10). CDB: 28 00 14 24 35 e0 00 00 08 00 length
4096 SMID 562 terminated ioc 804b scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 completed timedout
cm 0xfffffe0000d1e720 ccb 0xfffff80628c3d800 during recovery ioc 804b
scsi 0
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 810 terminated ioc 804b
scsi 0 state c xfer 0
Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 abort TaskMID
240 status 0x0 code 0x0 count 7
Jun 2 19:48:42 s18 kernel: (xpt0:mpr0:0:16:0): SMID 15 finished
recovery after aborting TaskMID 240
Jun 2 19:48:42 s18 kernel: mpr0: Unfreezing devq for target ID 16
Jun 2 19:48:42 s18 kernel: mpr0: EventReply :
Jun 2 19:48:42 s18 kernel: EventDataLength: 4
Jun 2 19:48:42 s18 kernel: AckRequired: 0
Jun 2 19:48:42 s18 kernel: Event: SasTopologyChangeList (0x1c)
Jun 2 19:48:42 s18 kernel: EventContext: 0x0
Jun 2 19:48:42 s18 kernel: EnclosureHandle: 0x2
Jun 2 19:48:42 s18 kernel: ExpanderDevHandle: 0x9
Jun 2 19:48:42 s18 kernel: NumPhys: 31
Jun 2 19:48:42 s18 kernel: NumEntries: 1
Jun 2 19:48:42 s18 kernel: StartPhyNum: 8
Jun 2 19:48:42 s18 kernel: ExpStatus: Responding (0x3)
Jun 2 19:48:42 s18 kernel: PhysicalPort: 0
Jun 2 19:48:42 s18 kernel: PHY[8].AttachedDevHandle: 0x0011
Jun 2 19:48:42 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
Jun 2 19:48:42 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
Jun 2 19:48:42 s18 kernel: mpr0: (73)->(mprsas_fw_work) Working on
Event: [16]
Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Event Free: [16]
Jun 2 19:48:42 s18 kernel: mpr0: (74)->(mprsas_fw_work) Working on
Event: [1c]
Jun 2 19:48:42 s18 kernel: mpr0: (75)->(mprsas_fw_work) Event Free: [1c]
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): READ(10). CDB: 28 00 15
5a 18 c0 00 00 08 00
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): CAM status: Command timeout
Jun 2 19:48:42 s18 kernel: (da7:mpr0:0:16:0): Retrying command
Jun 2 19:48:43 s18 kernel: mpr0: EventReply :
Jun 2 19:48:43 s18 kernel: EventDataLength: 2
Jun 2 19:48:43 s18 kernel: AckRequired: 0
Jun 2 19:48:43 s18 kernel: Event: SasDiscovery (0x16)
Jun 2 19:48:43 s18 kernel: EventContext: 0x0
Jun 2 19:48:43 s18 kernel: Flags: 0
Jun 2 19:48:43 s18 kernel: ReasonCode: Discovery Complete
Jun 2 19:48:43 s18 kernel: PhysicalPort: 0
Jun 2 19:48:43 s18 kernel: DiscoveryStatus: 0
Jun 2 19:48:43 s18 kernel: mpr0: (75)->(mprsas_fw_work) Working on
Event: [16]
Jun 2 19:48:43 s18 kernel: mpr0: (76)->(mprsas_fw_work) Event Free: [16]
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): WRITE(10). CDB: 2a 00 11
8d fe 50 00 00 10 00
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): CAM status: SCSI Status Error
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI status: Check Condition
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): SCSI sense: UNIT
ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
Jun 2 19:48:43 s18 kernel: (da7:mpr0:0:16:0): Retrying command (per
sense data)

I'm not sure if this is an Intel, CAM, or Avago issue (or something
else?). But I'd be happy to help figure it out if anyone has any
suggestions for me to test?

Thanks!

-Kyle

_______________________________________________
freebs...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

Steven Hartland

unread,

Jun 3, 2016, 5:49:48 PM6/3/16

to

First thing would be to run gstat with -d to see if you're actually
stacking up deletes, a symptom of which can be r/w dropping to zero.

If you are seeing significant deletes it could be a FW issue on the drives.

If you can eliminate that you could still be seeing a cabling /
backplane issue. SSD's stress the signalling much more than HDD's and on
several occasions we've had to replace backplanes a few times until it
was fixed.

Other thing to check is FW version on controller and on the drives, are
you up to date?

list-news

unread,

Jun 3, 2016, 6:33:33 PM6/3/16

to

Thanks for the input Steven!

#gstat -do
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
1 9671 2977 16275 0.3 3979 68439 0.3 2286 51067
1.8 429 0.5 77.8| da3
0 9681 3039 19510 0.3 3927 66828 0.3 2286 51067
1.9 429 0.5 76.1| da5
4 9727 3079 20295 0.3 3917 66485 0.3 2319 50943
1.8 412 0.5 77.4| da7
1 9936 3246 18353 0.3 3958 66672 0.3 2319 50943
1.6 413 0.5 78.8| da8

#camcontrol daX -N 8

(5 minutes later)

#gstat -do (still running after drives hang)
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
69 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da3
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da5
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da7
0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da8

(60 seconds later)
Jun 3 17:01:36 s18 kernel: (da3:mpr0:0:12:0): READ(10). CDB: 28 00 1b
80 fc c0 00 00 10 00 length 8192 SMID 258 command timeout cm
0xfffffe0000cf12a0 ccb 0xfffff81010eae800 target 12, handle(0x000d)
Jun 3 17:01:36 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 3 17:01:36 s18 kernel: mpr0: timedout cm 0xfffffe0000cf12a0
allocated tm 0xfffffe0000cdc2a0
Jun 3 17:01:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 2
Aborting command 0xfffffe0000cf12a0
Jun 3 17:01:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
[...]

#camcontrol daX -N 1
(20 minutes just passed, no errors)

#gstat -do (still running)
L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
2 6790 3647 67501 0.9 1627 21469 0.9 1319 24320
1.2 198 1.0 92.8| da3
2 6496 3359 70132 0.9 1620 21473 1.0 1319 24320
1.2 198 1.0 93.4| da5
4 6268 3236 58771 0.9 1455 23094 1.0 1389 25840
1.0 188 0.9 90.4| da7
4 6675 3654 71142 0.8 1444 23070 0.9 1389 25840
1.0 188 1.0 89.0| da8

##########

sysctl info
dev.mpr.0.driver_version: 09.255.01.00-fbsd
dev.mpr.0.firmware_version: 10.00.03.00

This firmware was just flashed using SuperMicro's latest recommended
firmware (10.00.03.00) about a week ago. I do believe Avago may have
something newer for the 3008, although I'm not certain how flashing
something from Avago would affect the supermicro card. If need be, I
can contact them and gather their thoughts on anything newer than
10.00.03. I'm not sure exactly how to upgrade the driver_version, or if
that's necessary/possible without upgrading to FreeBSD 11?

All the drives have been updated to the latest Intel firmware for the s3610.
# smartctl -a /dev/da3 | grep Firmware
Firmware Version: G2010150

Another point of interest. I have 6 of these nodes in operation (3
servers), all using Intel 730,3500,3610's, and all of them seem to be
exhibiting similar errors over time. Arguably, they aren't under
anywhere near the same io load, and they aren't seeing errors very often.

Here is a different server (same system type = 2028TP-DECR), using an
older mpr firmware (v6), having an error on May 9th, and then again on
Jun 3. da4 is an intel 730 in this server, and the OS is 10.2-RELEASE
#0 r286795:

May 9 11:21:41 s15 kernel: (noperiph:mpr0:0:4294967295:0): SMID 2
Aborting command 0xfffffe0000c0b8e0
May 9 11:21:41 s15 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
May 9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b
5c 9b 9b 00 00 08 00 length 4096 SMID 795 terminated ioc 804b scsi 0
state c xfer 0
May 9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 578 ter(da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b 9a f3 49 00
00 0c 00
May 9 11:21:41 s15 kernel: minated ioc 804b scsi 0 state c xfer 0
May 9 11:21:41 s15 kernel: (da4:mpr0:0:12:0): CAM status: Command timeout
May 9 11:21:41 s15 kernel: mpr0: (da4:Unfreezing devq for target ID 12
May 9 11:21:41 s15 kernel: mpr0:0:12:0): Retrying command
May 9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0b
9a f3 49 00 00 0c 00
May 9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): CAM status: SCSI Status Error
May 9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): SCSI status: Check Condition
May 9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

May 9 11:21:42 s15 kernel: (da4:mpr0:0:12:0): Retrying command (per
sense data)Jun 3 12:32:06 s15 kernel: (noperiph:mpr0:0:4294967295:0):
SMID 5 Aborting command 0xfffffe0000bf46c0
Jun 3 12:32:06 s15 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d
e9 08 9e 00 00 58 00 length 45056 SMID 273 terminated ioc 804b scsi 0
state c xfer 0
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d
d3 24 03 00 00 08 00 length 4096 SMID 155 terminated ioc 804b scsi 0
state c xfer 0
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length
512 SMID 845 terminated ioc 804b scsi 0 state c xfer 0
Jun 3 12:32:06 s15 kernel: mpr0: Unfreezing devq for target ID 12
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d
cf 4f 64 00 00 02 00
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): CAM status: Command timeout
Jun 3 12:32:06 s15 kernel: (da4:mpr0:0:12:0): Retrying command
Jun 3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): WRITE(10). CDB: 2a 00 0d
e9 08 9e 00 00 58 00
Jun 3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): CAM status: SCSI Status Error
Jun 3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): SCSI status: Check Condition
Jun 3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 3 12:32:07 s15 kernel: (da4:mpr0:0:12:0): Retrying command (per
sense data)

-Kyle

Borja Marcos

unread,

Jun 6, 2016, 4:51:52 AM6/6/16

to

> On 03 Jun 2016, at 23:49, Steven Hartland <kil...@multiplay.co.uk> wrote:
>
> First thing would be to run gstat with -d to see if you're actually stacking up deletes, a symptom of which can be r/w dropping to zero.
>
> If you are seeing significant deletes it could be a FW issue on the drives.

Hmm. I’ve suffered that badly with Intel P3500 NVMe drives, which suffer at least from a driver problem: trims are not coalesced.
However I didn’t experience command timeouts. Reads and, especially, writes, stalled badly.

A quick test for trim related trouble is setting the sysctl variable vfs.zfs.vdev.bio_delete_disable to 1. It doesn´t require
a reboot and you can quickly compare results.

In my case, a somewhat similar problem in an IBM server was caused by a faulty LSI3008 card it seems. As I didn´t have spare LSI3008 cards
at the time I replaced it by a LSI2008 and everything works perfectly. Before anyone chimes in suggesting card incompatibility of some sort,
I have a twin system with a LSI3008 working like a charm. ;)

Borja.

list-news

unread,

Jun 6, 2016, 6:19:26 PM6/6/16

to

System was running solid all weekend with camcontrol tags set to 1 for
each device, zero errors.

Last week I did try
*# sysctl kern.cam.da.X.delete_method=**DISABLE*
for each drive, but it still threw errors.

Also, I did try out bio_delete_disable earlier today:
*# camcontrol tags daX -N 255*
(Firstly resetting tags back to 255 for each device, as they are
currently at 1.)

*# sysctl vfs.zfs.vdev.bio_delete_disable=1*
(a few minutes later)

Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 577 command timeout cm
0xfffffe0001351550 ccb 0xfffff804e78e3800 target 12, handle(0x000c)
Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 6 12:28:36 s18 kernel: mpr0: timedout cm 0xfffffe0001351550
allocated tm 0xfffffe0001322150
Jun 6 12:28:36 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
Aborting command 0xfffffe0001351550
Jun 6 12:28:36 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 12
Jun 6 12:28:36 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 18
45 1c c0 00 00 08 00 length 4096 SMID 583 command timeout cm
0xfffffe0001351d30 ccb 0xfffff806b9556800 target 12, handle(0x000c)
Jun 6 12:28:36 s18 kernel: mpr0: At enclosure level 0, slot 4,
connector name ( )
Jun 6 12:28:36 s18 kernel: mpr0: queued timedout cm 0xfffffe0001351d30
for processing by tm 0xfffffe0001322150
...

During the 60 second hang:
*# gstat -do*

L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name

70 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da2

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da4

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da6

0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da7

Also during the 60 second hang:
*# camcontrol tags da3 -v*
(pass2:mpr0:0:12:0): dev_openings 248
(pass2:mpr0:0:12:0): dev_active 7
(pass2:mpr0:0:12:0): allocated 7
(pass2:mpr0:0:12:0): queued 0
(pass2:mpr0:0:12:0): held 0
(pass2:mpr0:0:12:0): mintags 2
(pass2:mpr0:0:12:0): maxtags 255

Also during the 60 second hang:
*# sysctl dev.mpr*
dev.mpr.0.spinup_wait_time: 3
dev.mpr.0.chain_alloc_fail: 0
dev.mpr.0.enable_ssu: 1
dev.mpr.0.max_chains: 2048
dev.mpr.0.chain_free_lowwater: 2022
dev.mpr.0.chain_free: 2048
dev.mpr.0.io_cmds_highwater: 71
dev.mpr.0.io_cmds_active: 4
dev.mpr.0.driver_version: 09.255.01.00-fbsd
dev.mpr.0.firmware_version: 10.00.03.00
dev.mpr.0.disable_msi: 0
dev.mpr.0.disable_msix: 0
dev.mpr.0.debug_level: 895
dev.mpr.0.%parent: pci1
dev.mpr.0.%pnpinfo: vendor=0x1000 device=0x0097 subvendor=0x15d9
subdevice=0x0808 class=0x010700
dev.mpr.0.%location: pci0:1:0:0 handle=\_SB_.PCI0.BR1A.H000
dev.mpr.0.%driver: mpr
dev.mpr.0.%desc: Avago Technologies (LSI) SAS3008
dev.mpr.%parent:

Something else that may be of consideration: I ran fio & bonnie++ for
about an hour of heavy io (with tags still set to 255 drive busy showing
90-100%). No errors. I fire up my application (threaded Java/Postgres
application), and within minutes:

*# gstat -do*
dT: 1.002s w: 1.000s

L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da2

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da4
71 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da6

0 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da7

*Error:*
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 30
65 13 90 00 00 10 00
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per
sense data)
...

*And again 2 minutes later:*

Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 21
66 63 58 00 00 10 00
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status Error
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check Condition
Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per
sense data)
...

*And again 3 minutes later:*

Jun 6 13:41:29 s18 kernel: (da7:mpr0:0:18:0): WRITE(10). CDB: 2a 00 33
44 b5 b8 00 00 10 00
...

*#camcontrol tags daX -N **1*
(And now, after 15 minutes, zero errors.)

In putting some thoughts to this, which may or may not be off base
(please feel free to correct me btw), I've noticed the following:

1) There doesn't seem to be any indication as to what causes the drive
to time-out. The command that fails in the error log is one of the
following: READ(10), WRITE(10), ATA COMMAND PASS THROUGH(16), and
SYNCHRONIZE CACHE(10). As I understand it, that was the command being
executed, timed-out, and retried, not what potentially caused the drive
lock-up.

2) When my application is run, it hammers postgres pretty hard, and when
postgres is running I get the errors. FIO & Bonnie++ doesn't give me
errors; daily use of the system doesn't give me errors. I'm assuming
postgresql is sending far more of a certain type of command to the io
subsystem than those other applications, and the first command that
comes to mind is fsync.

3) I turned fsync off in postgresql.conf (I'm brave for science!!) then
ran my application again with tags at 255, 100% cpu load, 70-80% drive
busy%.

*1.5 hours later at full load - finally, a single timeout:*
Jun 6 16:31:33 s18 kernel: (da2:mpr0:0:12:0): READ(10). CDB: 28 00 2d
50 1b 78 00 00 08 00 length 4096 SMID 556 command timeout cm
0xfffffe000134f9c0 ccb 0xfffff83aa5b25000 target 12, handle(0x000c)

I ran it for another 20 minutes with no additional timeouts.

I assume the fsync command turns into a zfs -> cam -> SYNCHRONIZE CACHE
command for each device. And postgres is sending this command
considerably more often than a typical application (at least with fsync
turned on in postgresql.conf), which would explain why when fsync is
turned off or minimal fsyncs are being sent (ie typical system usage),
the error is rare. Yet, when fsync is being sent repeatedly, the errors
start happening every few minutes. The only reason I can think why
setting tags to 1 eliminates the errors entirely must have something to
do with Intel drives not handling parallel commands from cam when one
(or more) of the commands are SYNCHRONIZE CACHE. Thoughts?

Thanks,

-Kyle

Borja Marcos

unread,

Jun 7, 2016, 2:35:17 AM6/7/16

to

> On 07 Jun 2016, at 00:19, list-news <list...@mindpackstudios.com> wrote:
>
> *# sysctl vfs.zfs.vdev.bio_delete_disable=1*
> (a few minutes later)

So trim is not causing it.

>
> *Error:*
> Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 30 65 13 90 00 00 10 00
> Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
> Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
> Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 6 13:36:15 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per sense data)
> ...
>
> *And again 2 minutes later:*
>
> Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): WRITE(10). CDB: 2a 00 21 66 63 58 00 00 10 00
> Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): CAM status: SCSI Status Error
> Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI status: Check Condition
> Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> Jun 6 13:38:43 s18 kernel: (da2:mpr0:0:12:0): Retrying command (per sense data)

I suffered this particular symptom because, it seems of a broken LSI3008 card. Finally I replaced it with a LSI2008 (I didn’t have a spare
LSI3008 handy) and the errors vanished. In my case it is a NFS storage based on ZFS and Samsung SSD disks serving several Xen
hosts.

In my case the disks are SATA.

I know that it was a defective card and not a problem with the LSI3008 cards or driver because I have a twin system working like a charm
from day zero.

I would try, if possible, to swap the controller.

list-news

unread,

Jun 7, 2016, 1:09:20 PM6/7/16

to

The system is a Twin. In the first post I mentioned this but I probably
wasn't clear.

The twin unit is this one:
https://www.supermicro.com/products/system/2u/2028/sys-2028tp-decr.cfm

I've used all components from twin node A and B (cpu / memory /
mainboard / controller). I still get the errors. The backplane was the
original thought of concern, and that has been RMA'd and replaced -
errors continue. I've even swapped out power supplies with another
identical unit I have here.

In every case the errors continue, until I do this:
#camcontrol daX -N 1
(for each drive in the zpool)

Then the errors stop.

The system errors every few minutes while my application is running.
Set tags to -N 1, and everything goes quiet. 16 cores at 100% cpu and
drives 80% busy @ ~15k IO p/s, for about 5 hours solid before it
finishes a batch, no errors are reported with -N set to 1. If I set
tags with -N 255 for each device, errors start again within 5 minutes,
and continue every 2-5 minutes, until the batch is finished.

-Kyle

Steven Hartland

unread,

Jun 7, 2016, 3:02:52 PM6/7/16

to

Have you tried direct attaching the drives?

list-news

unread,

Jun 7, 2016, 3:24:56 PM6/7/16

to

I have additional confirmation that it's not faulty hardware.

I moved the 4 disks that carry the postgresql database over to another
server (same model - TWIN 2028-DECR). Mounted the zpool and fired up my
application.

This server is using a much earlier firmware on the SAS controller.
Different CPU / Memory / etc.

Errors happen within the first couple minutes, and continue every few
minutes (notice time-stamps for each drive timeout every few minutes):

Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e0 00 00 08 00 length 4096 SMID 582 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e8 00 00 08 00 length 4096 SMID 1009 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): ATA COMMAND PASS

THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length

512 SMID 315 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 33
91 5c 68 00 00 08 00 length 4096 SMID 183 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36
f2 39 40 00 00 10 00 length 8192 SMID 446 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 715 terminated ioc 804b

scsi 0 state c xfer 0

Jun 7 13:08:32 s17 kernel: mpr0: Unfreezing devq for target ID 14
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 36
ea dc 60 00 00 08 00
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: Command timeout
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): READ(10). CDB: 28 00 0e
74 79 e0 00 00 08 00
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): CAM status: SCSI Status
Error
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI status: Check Condition
Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 7 13:08:32 s17 kernel: (da10:mpr0:0:14:0): Retrying command (per
sense data)
Jun 7 13:11:08 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 4
Aborting command 0xfffffe0000be0140
Jun 7 13:11:08 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 10
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee f0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee d8 00 00 10 00 length 8192 SMID 262 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS

THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length

512 SMID 692 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 19
be 13 a0 00 00 10 00 length 8192 SMID 509 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 d8 00 00 08 00 length 4096 SMID 911 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 d0 00 00 08 00 length 4096 SMID 918 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 21
3c 00 c8 00 00 08 00 length 4096 SMID 585 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 297 terminated ioc 804b

scsi 0 state c xfer 0

Jun 7 13:11:08 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 35
26 ca f0 00 00 08 00
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun 7 13:11:08 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f6 ee f0 00 00 08 00
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 7 13:11:09 s17 kernel: (da2:mpr0:0:10:0): Retrying command (per
sense data)
Jun 7 13:13:04 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 5
Aborting command 0xfffffe0000bfcca0
Jun 7 13:13:04 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 10
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): ATA COMMAND PASS

THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length

512 SMID 504 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
8d 99 48 00 00 08 00 length 4096 SMID 677 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
6b df b8 00 00 10 00 length 8192 SMID 563 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd a8 00 00 08 00 length 4096 SMID 723 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd b0 00 00 08 00 length 4096 SMID 335 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 478 terminated ioc 804b

scsi 0 state c xfer 0

Jun 7 13:13:04 s17 kernel: mpr0: Unfreezing devq for target ID 10
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
d6 de f0 00 00 08 00
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): CAM status: Command timeout
Jun 7 13:13:04 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: mpr0: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): ATA COMMAND PASS
THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: mpr0: (da2:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: mpr0:0:mpr0: 10:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: 0): mpr0: Retrying command
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).

CDB: 35 00 00 00 00 00 00 00 00 00

Jun 7 13:13:05 s17 kernel: mpr0: (da2:mpr0:0:10:0): CAM status: CCB
request completed with an error
Jun 7 13:13:05 s17 kernel: log_info(0x31120440): originator(PL),
code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: (da2:mpr0: mpr0:0:log_info(0x31120440):
originator(PL), code(0x12), sub_code(0x0440)
Jun 7 13:13:05 s17 kernel: 10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1b
8d 99 48 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 13
6b df b8 00 00 10 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd a8 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 0d
f7 cd b0 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): READ(10). CDB: 28 00 1e
d6 de f0 00 00 08 00
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): CAM status: CCB request
completed with an error
Jun 7 13:13:05 s17 kernel: (da2:mpr0:0:10:0): Retrying command
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SYNCHRONIZE CACHE(10).

CDB: 35 00 00 00 00 00 00 00 00 00

Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): CAM status: SCSI Status Error
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI status: Check Condition
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Error 6, Retries exhausted
Jun 7 13:13:06 s17 kernel: (da2:mpr0:0:10:0): Invalidating pack
Jun 7 13:15:11 s17 kernel: (noperiph:mpr0:0:4294967295:0): SMID 6
Aborting command 0xfffffe0000c1e960
Jun 7 13:15:11 s17 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 11
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): ATA COMMAND PASS

THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00 length

512 SMID 942 terminated ioc 804b scsi 0 state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 23
7f 21 c0 00 00 08 00 length 4096 SMID 359 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 31
bb 68 30 00 00 08 00 length 4096 SMID 597 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
80 02 68 00 00 50 00 length 40960 SMID 786 terminated ioc 804b scsi 0
state c xfer(da3:mpr0:0:11:0): READ(10). CDB: 28 00 22 02 ea 38 00 00 10 00
Jun 7 13:15:12 s17 kernel: 0
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: Command timeout
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): READ(10). CDB: 28 00 19
7e 0d 30 00 00 10 00 length 8192 SMID 602 terminated ioc 804b scsi 0
state c xfer (da3:0
Jun 7 13:15:12 s17 kernel: mpr0:0: (da3:mpr0:0:11:0): SYNCHRONIZE
CACHE(10). CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 441
terminated ioc 804b scsi 0 sta11:te c xfer 0
Jun 7 13:15:12 s17 kernel: 0): mpr0: Retrying command
Jun 7 13:15:12 s17 kernel: Unfreezing devq for target ID 11
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SYNCHRONIZE CACHE(10).

CDB: 35 00 00 00 00 00 00 00 00 00

Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): CAM status: SCSI Status Error
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI status: Check Condition
Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 7 13:15:12 s17 kernel: (da3:mpr0:0:11:0): Retrying command (per
sense data)

gstat output:
(I'm guessing I caught this during the da2 error)

#gstat -do

L(q) ops/s r/s kBps ms/r w/s kBps ms/w d/s kBps
ms/d o/s ms/o %busy Name
70 0 0 0 0.0 0 0 0.0 0 0
0.0 0 0.0 0.0| da2
0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da3

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da10

0 0 0 0 0.0 0 0 0.0 0 0

0.0 0 0.0 0.0| da11

I then set the tags down to 1 for each device:

#camcontrol tags da2 -N 1
#camcontrol tags da3 -N 1
#camcontrol tags da10 -N 1
#camcontrol tags da11 -N 1

And, no errors for the last hour, system still running at full load.

Everything is feeling like an NCQ firmware issue. Intel s3610 says it
supports NCQ in it's SSDs with 32 tags. But I've pulled the errors with
tags set to 8 plenty of times.

(See NCQ line below.)

# camcontrol identify da2

pass2: <INTEL SSDSC2BX480G4 G2010150> ACS-2 ATA SATA 3.x device
pass2: 1200.000MB/s transfers, Command Queueing Enabled
protocol ATA/ATAPI-9 SATA 3.x
device model INTEL SSDSC2BX480G4
firmware revision G2010150
serial number [redacted]
WWN [redacted]
cylinders 16383
heads 16
sectors/track 63
sector size logical 512, physical 4096, offset 0
LBA supported 268435455 sectors
LBA48 supported 937703088 sectors
PIO supported PIO4
DMA supported WDMA2 UDMA6
media RPM non-rotating

Feature Support Enabled Value Vendor
read ahead yes yes
write cache yes yes
flush cache yes yes
overlap no
Tagged Command Queuing (TCQ) no no
Native Command Queuing (NCQ) yes 32 tags
NCQ Queue Management no
NCQ Streaming no
Receive & Send FPDMA Queued no
SMART yes yes
microcode download yes yes
security yes no
power management yes yes
advanced power management no no
automatic acoustic management no no
media status notification no no
power-up in Standby no no
write-read-verify no no
unload yes yes
general purpose logging yes yes
free-fall no no
Data Set Management (DSM/TRIM) yes
DSM - max 512byte blocks yes 4
DSM - deterministic read yes zeroed
Host Protected Area (HPA) yes no 937703088/937703088
HPA - Security no

And it doesn't appear I have any way to deactivate it in firmware.
Which would be a nice test. I did attempt this with no luck:
# camcontrol negotiate da2 -T disable
(pass2:mpr0:0:10:0): transfer speed: 1200.000MB/s
(pass2:mpr0:0:10:0): tagged queueing: enabled
camcontrol: XPT_SET_TRANS_SETTINGS CCB failed

-Kyle

Steven Hartland

unread,

Jun 7, 2016, 3:53:20 PM6/7/16

to

CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO
delete but assuming your running ZFS can you set the following in
loader.conf and see how you get on.
vfs.zfs.trim.enabled=0

Regards
Steve

list-news

unread,

Jun 7, 2016, 3:53:34 PM6/7/16

to

I don't believe the mainboard has any SATA ports. It does have a PCIe
slot IIRC though, and I may be able to rig something up with another LSI
adapter I have laying around. If I can get it to fit and find a way to
power the drives.

Although, this seems unlikely unless you are seeing something I'm not?

With that last test: If it's the SAS controller, 3 different ones
running two different firmware versions are all causing the issue. If
it's the backplane, I have now tested 3 of them as well, two of which I
can confirm have different revision numbers.

Errors never appear with tags set to 1 for each drive (effectively
eliminating NCQ as I understand it). My brief understanding is that a
higher tag count allows the SAS adapter to send more commands to the
drive in parallel, allowing the drive to make the decisions about
command ordering. If that is accurate, and the controller firmware was
bad, I assume this would be a far more common bug that would have been
fixed already.

On the other hand, if it only happens during heavy SYNCHRONIZE CACHE
commands in parallel on certain Intel SSD's and only on controllers
(maybe 12gbps?) that can outrun the drive firmware or cause a race
condition (my suspicions here). It seems far more likely this would
have gone unnoticed by Intel.

-Kyle

On 6/7/16 2:02 PM, Steven Hartland wrote:
> Have you tried direct attaching the drives?
>

list-news

unread,

Jun 7, 2016, 4:19:39 PM6/7/16

to

Sure Steve:

# cat /boot/loader.conf | grep trim
vfs.zfs.trim.enabled=0

# sysctl vfs.zfs.trim.enabled
vfs.zfs.trim.enabled: 0

# uptime
3:14PM up 11 mins, 3 users, load averages: 6.58, 11.31, 7.07

# tail -f /var/log/messages:
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b
d8 86 50 00 00 b0 00 length 90112 SMID 1016 command timeout cm
0xfffffe0001375580 ccb 0xfffff8039895f800 target 16, handle(0x0010)
Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 7 15:13:50 s18 kernel: mpr0: timedout cm 0xfffffe0001375580
allocated tm 0xfffffe0001322150
Jun 7 15:13:50 s18 kernel: (noperiph:mpr0:0:4294967295:0): SMID 1
Aborting command 0xfffffe0001375580
Jun 7 15:13:50 s18 kernel: mpr0: Sending reset from mprsas_send_abort
for target ID 16
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 command timeout cm
0xfffffe00013627a0 ccb 0xfffff8039851e800 target 16, handle(0x0010)
Jun 7 15:13:50 s18 kernel: mpr0: At enclosure level 0, slot 8,
connector name ( )
Jun 7 15:13:50 s18 kernel: mpr0: queued timedout cm 0xfffffe00013627a0

for processing by tm 0xfffffe0001322150

Jun 7 15:13:50 s18 kernel: mpr0: EventReply :
Jun 7 15:13:50 s18 kernel: EventDataLength: 2
Jun 7 15:13:50 s18 kernel: AckRequired: 0
Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
Jun 7 15:13:50 s18 kernel: EventContext: 0x0
Jun 7 15:13:50 s18 kernel: Flags: 1<InProgress>
Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Started
Jun 7 15:13:50 s18 kernel: PhysicalPort: 0
Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b
43 a8 00 00 00 10 00 length 8192 SMID 624 completed cm
0xfffffe0001355300 ccb 0xfffff803984d4800 during recovery ioc 804b scsi

0 state c xfer 0

Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b
43 a8 00 00 00 10 00 length 8192 SMID 624 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b
43 a7 f0 00 00 10 00 length 8192 SMID 633 completed cm
0xfffffe0001355ed0 ccb 0xfffff803987f0000 during recovery ioc 804b scsi

0 state c xfer 0

Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0b
43 a7 f0 00 00 10 00 length 8192 SMID 633 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a
25 3f f0 00 00 08 00 length 4096 SMID 133 completed cm
0xfffffe000132ce90 ccb 0xfffff803985fc000 during recovery ioc 804b scsi

0 state c xfer 0

Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): READ(10). CDB: 28 00 0a
25 3f f0 00 00 08 00 length 4096 SMID 133 terminated ioc 804b scsi 0
state c xfer 0
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b
d8 86 50 00 00 b0 00 length 90112 SMID 1016 completed timedout cm
0xfffffe0001375580 ccb 0xfffff8039895f800 during recovery ioc 8048 scsi
0 state c (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10). CDB: 35 00 00 00
00 00 00 00 00 00 length 0 SMID 786 completed timedout cm
0xfffffe(da6:mpr0:0:16:0): WRITE(10). CDB: 2a 00 2b d8 86 50 00 00 b0 00
Jun 7 15:13:50 s18 kernel: 00013627a0 ccb 0xfffff8039851e800 during
recovery ioc 804b scsi 0 (da6:mpr0:0:16:0): CAM status: Command timeout
Jun 7 15:13:50 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10).
CDB: 35 00 00 00 00 00 00 00 00 00 length 0 SMID 786 terminated ioc 804b
scsi 0 sta(da6:te c xfer 0
Jun 7 15:13:50 s18 kernel: mpr0:0: (xpt0:mpr0:0:16:0): SMID 1 abort
TaskMID 1016 status 0x0 code 0x0 count 5
Jun 7 15:13:50 s18 kernel: 16: (xpt0:mpr0:0:16:0): SMID 1 finished
recovery after aborting TaskMID 1016
Jun 7 15:13:50 s18 kernel: 0): mpr0: Retrying command
Jun 7 15:13:50 s18 kernel: Unfreezing devq for target ID 16
Jun 7 15:13:50 s18 kernel: mpr0: EventReply :
Jun 7 15:13:50 s18 kernel: EventDataLength: 4
Jun 7 15:13:50 s18 kernel: AckRequired: 0
Jun 7 15:13:50 s18 kernel: Event: SasTopologyChangeList (0x1c)
Jun 7 15:13:50 s18 kernel: EventContext: 0x0
Jun 7 15:13:50 s18 kernel: EnclosureHandle: 0x2
Jun 7 15:13:50 s18 kernel: ExpanderDevHandle: 0x9
Jun 7 15:13:50 s18 kernel: NumPhys: 31
Jun 7 15:13:50 s18 kernel: NumEntries: 1
Jun 7 15:13:50 s18 kernel: StartPhyNum: 8
Jun 7 15:13:50 s18 kernel: ExpStatus: Responding (0x3)
Jun 7 15:13:50 s18 kernel: PhysicalPort: 0
Jun 7 15:13:50 s18 kernel: PHY[8].AttachedDevHandle: 0x0010
Jun 7 15:13:50 s18 kernel: PHY[8].LinkRate: 12.0Gbps (0xbb)
Jun 7 15:13:50 s18 kernel: PHY[8].PhyStatus: PHYLinkStatusChange
Jun 7 15:13:50 s18 kernel: mpr0: (0)->(mprsas_fw_work) Working on
Event: [16]
Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Event Free: [16]
Jun 7 15:13:50 s18 kernel: mpr0: (1)->(mprsas_fw_work) Working on
Event: [1c]
Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Event Free: [1c]
Jun 7 15:13:50 s18 kernel: mpr0: EventReply :
Jun 7 15:13:50 s18 kernel: EventDataLength: 2
Jun 7 15:13:50 s18 kernel: AckRequired: 0
Jun 7 15:13:50 s18 kernel: Event: SasDiscovery (0x16)
Jun 7 15:13:50 s18 kernel: EventContext: 0x0
Jun 7 15:13:50 s18 kernel: Flags: 0
Jun 7 15:13:50 s18 kernel: ReasonCode: Discovery Complete
Jun 7 15:13:50 s18 kernel: PhysicalPort: 0
Jun 7 15:13:50 s18 kernel: DiscoveryStatus: 0
Jun 7 15:13:50 s18 kernel: mpr0: (2)->(mprsas_fw_work) Working on
Event: [16]
Jun 7 15:13:50 s18 kernel: mpr0: (3)->(mprsas_fw_work) Event Free: [16]
Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SYNCHRONIZE CACHE(10).

CDB: 35 00 00 00 00 00 00 00 00 00

Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): CAM status: SCSI Status Error
Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI status: Check Condition
Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): SCSI sense: UNIT

ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)

Jun 7 15:13:51 s18 kernel: (da6:mpr0:0:16:0): Retrying command (per
sense data)

-Kyle

On 6/7/16 2:53 PM, Steven Hartland wrote:
> CDB: 85 is a TRIM command IIRC, I know you tried it before using BIO
> delete but assuming your running ZFS can you set the following in
> loader.conf and see how you get on.
> vfs.zfs.trim.enabled=0
>
> Regards
> Steve

Steven Hartland

unread,

Jun 7, 2016, 4:44:24 PM6/7/16

to

On 07/06/2016 20:53, list-news wrote:
> I don't believe the mainboard has any SATA ports. It does have a PCIe
> slot IIRC though, and I may be able to rig something up with another
> LSI adapter I have laying around. If I can get it to fit and find a
> way to power the drives.
>
> Although, this seems unlikely unless you are seeing something I'm not?

Nope but your assuming that the backplane doesn't have designed issue,
and unfortunately that's more common than most people know so my process
it to always fall back to lowest common denominator and directly attach
the disks to the controller.

>
> With that last test: If it's the SAS controller, 3 different ones
> running two different firmware versions are all causing the issue. If
> it's the backplane, I have now tested 3 of them as well, two of which
> I can confirm have different revision numbers.
>
> Errors never appear with tags set to 1 for each drive (effectively
> eliminating NCQ as I understand it). My brief understanding is that a
> higher tag count allows the SAS adapter to send more commands to the
> drive in parallel, allowing the drive to make the decisions about
> command ordering. If that is accurate, and the controller firmware
> was bad, I assume this would be a far more common bug that would have
> been fixed already.
>
> On the other hand, if it only happens during heavy SYNCHRONIZE CACHE
> commands in parallel on certain Intel SSD's and only on controllers
> (maybe 12gbps?) that can outrun the drive firmware or cause a race
> condition (my suspicions here). It seems far more likely this would
> have gone unnoticed by Intel.

All possible, but discount the easy first. If you have access to 2008
based controller try that, they have always been solid here not used
3008 yet.

Steven Hartland

unread,

Jun 7, 2016, 5:22:53 PM6/7/16

to

Always da6?

list-news

unread,

Jun 7, 2016, 6:43:32 PM6/7/16

to

No, it threw errors on both da6 and da7 and then I stopped it.

Your last e-mail gave me thoughts though. I have a server with 2008
controllers (entirely different backplane design, cpu, memory, etc).
I've moved the 4 drives to that and I'm running the test now.

# uname = FreeBSD 10.2-RELEASE-p12 #1 r296215
# sysctl dev.mps.0
dev.mps.0.spinup_wait_time: 3
dev.mps.0.chain_alloc_fail: 0
dev.mps.0.enable_ssu: 1
dev.mps.0.max_chains: 2048
dev.mps.0.chain_free_lowwater: 1176
dev.mps.0.chain_free: 2048
dev.mps.0.io_cmds_highwater: 510
dev.mps.0.io_cmds_active: 0
dev.mps.0.driver_version: 20.00.00.00-fbsd
dev.mps.0.firmware_version: 17.00.01.00
dev.mps.0.disable_msi: 0
dev.mps.0.disable_msix: 0
dev.mps.0.debug_level: 3
dev.mps.0.%parent: pci5
dev.mps.0.%pnpinfo: vendor=0x1000 device=0x0072 subvendor=0x1000
subdevice=0x3020 class=0x010700
dev.mps.0.%location: slot=0 function=0
dev.mps.0.%driver: mps
dev.mps.0.%desc: Avago Technologies (LSI) SAS2008

About 1.5 hours has passed at full load, no errors.

gstat drive busy% seems to hang out around 30-40 instead of ~60-70.
Overall throughput seems to be 20-30% less with my rough benchmarks.

I'm not sure if this gets us closer to the answer, if this doesn't
time-out on the 2008 controller, it looks like one of these:
1) The Intel drive firmware is being overloaded somehow when connected
to the 3008.
or
2) The 3008 firmware or driver has an issue reading drive responses,
sporadically thinking the command timed-out (when maybe it really didn't).

Puzzle pieces:
A) Why does setting tags of 1 on drives connected to the 3008 fix the
problem?
B) With tags of 255. Why does postgres (and assuming a large fsync
count), seem to cause the problem within minutes? While running other
heavy i/o commands (zpool scrub, bonnie++, fio), all of which show
similarly high or higher iops take hours to cause the problem (if ever).

I'll let this continue to run to further test.

Thanks again for all the help.

-Kyle

Steven Hartland

unread,

Jun 7, 2016, 7:28:46 PM6/7/16

to

If that works I'd switch the 3008 into the machine with 2008 in
currently and retest. That will help to confirm the 3008 card and
driver is or isn't a potential issue.

Steven Hartland

unread,

Jun 7, 2016, 7:30:27 PM6/7/16

to

Oh another thing to test is iirc 3008 is supported by mrsas so you might
want to try adding the following into loader.conf to switch drivers:
hw.mfi.mrsas_enable="1"

David Gwynne

unread,

Jun 7, 2016, 7:39:58 PM6/7/16

to

> On 8 Jun 2016, at 09:30, Steven Hartland <kil...@multiplay.co.uk> wrote:
>
> Oh another thing to test is iirc 3008 is supported by mrsas so you might want to try adding the following into loader.conf to switch drivers:
> hw.mfi.mrsas_enable="1"

i believe the 3008s can run two different firmwares, one that provides the mpt2 interface and the other than provides the megaraid sas fusion interface. you have to flash them to switch though, you cant just point a driver at it and hope for the best.

each fw presents different pci ids. eg, in http://pciids.sourceforge.net/v2.2/pci.ids you can see:

005f MegaRAID SAS-3 3008 [Fury]
0097 SAS3008 PCI-Express Fusion-MPT SAS-3

dlg