Re: Device timeouts(?) with LSI SAS3008 on mpr(4)

Message has been deleted

Yamagi Burmeister

unread,

Jul 13, 2015, 5:02:17 AM7/13/15

to

Hello,
after some fiddling and testing I managed to track this down. TRIM is
the culprit:

- With vfs.zfs.trim.enabled set to 1 timeouts occure. Regardless of
cabeling, of a backplane or direct connection. It doesn't matter if
Intel DC S3500 oder S3700 SSDs are connected, but on the other hand
both share the same controller. I don't have enough onboard S-ATA
ports to test the whole setup without the 9300-8i HBA, but a short
(maybe too short and without enough load) test with 6 SSDs didn't show
any timeouts.

- With vfs.zfs.trim.enabled set to 0 I havn't seen a single timeout
for ~56 hours.

Regards,
Yamagi

On Tue, 7 Jul 2015 13:24:16 +0200
Yamagi Burmeister <li...@yamagi.org> wrote:

> Hello,
> I've got 3 new Supermicro servers based upon the X10DRi-LN4+ platform.
> Each server is equiped with 2 LSI SAS9300-8i-SQL SAS adapters. Each
> adapter serves 8 Intel DC S3700 SSDs. Operating system is 10.1-STABLE
> as of r283938 on 2 servers and r285196 on the last one.
>
> The controller identify themself as:
>
> ----
>
> mpr0: <Avago Technologies (LSI) SAS3008> port 0x6000-0x60ff mem
> 0xc7240000-0xc724ffff,0xc7200000-0xc723ffff irq 32 at device 0.0 on
> pci2 mpr0: IOCFacts : MsgVersion: 0x205
> HeaderVersion: 0x2300
> IOCNumber: 0
> IOCExceptions: 0x0
> MaxChainDepth: 128
> NumberOfPorts: 1
> RequestCredit: 10240
> ProductID: 0x2221
> IOCRequestFrameSize: 32
> MaxInitiators: 32
> MaxTargets: 1024
> MaxSasExpanders: 42
> MaxEnclosures: 43
> HighPriorityCredit: 128
> MaxReplyDescriptorPostQueueDepth: 65504
> ReplyFrameSize: 32
> MaxVolumes: 0
> MaxDevHandle: 1106
> MaxPersistentEntries: 128
> mpr0: Firmware: 08.00.00.00, Driver: 09.255.01.00-fbsd
> mpr0: IOCCapabilities:
> 7a85c<ScsiTaskFull,DiagTrace,SnapBuf,EEDP,TransRetry,EventReplay,MSIXIndex,HostDisc>
>
> ----
>
> 08.00.00.00 is the last available firmware.
>
>
> Since day one 'dmesg' is cluttered with CAM errors:
>
> ----
>
> mpr1: Sending reset from mprsas_send_abort for target ID 5
> (da11:mpr1:0:5:0): WRITE(10). CDB: 2a 00 4c 15 1f 88 00 00 08
> 00 length 4096 SMID 554 terminated ioc 804b scsi 0 state c xfer 0
> (da11:mpr1:0:5:0): ATA COMMAND PASS THROUGH(16). CDB: 85 0d 06 00 01 00
> 01 00 00 00 00 00 00 40 06 00 length 512 SMID 506 ter(da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00 minated ioc 804b scsi 0
> state c xfer 0 (da11:mpr1:0:5:0): CAM status: Command timeout mpr1:
> (da11:Unfreezing devq for target ID 5 mpr1:0:5:0): Retrying command
> (da11:mpr1:0:5:0): READ(10). CDB: 28 00 4c 2b 95 c0 00 00 10 00
> (da11:mpr1:0:5:0): CAM status: SCSI Status Error (da11:mpr1:0:5:0):
> SCSI status: Check Condition (da11:mpr1:0:5:0): SCSI sense: UNIT
> ATTENTION asc:29,0 (Power on, reset, or bus device reset occurred)
> (da11:mpr1:0:5:0): Retrying command (per sense data) (da11:mpr1:0:5:0):
> READ(10). CDB: 28 00 4c 22 b5 b8 00 00 18 00 (da11:mpr1:0:5:0): CAM
> status: SCSI Status Error (da11:mpr1:0:5:0): SCSI status: Check
> Condition (da11:mpr1:0:5:0): SCSI sense: UNIT ATTENTION asc:29,0 (Power
> on, reset, or bus device reset occurred) (da11:mpr1:0:5:0): Retrying
> command (per sense data) (noperiph:mpr1:0:4294967295:0): SMID 2
> Aborting command 0xfffffe0001601a30
>
> mpr1: Sending reset from mprsas_send_abort for target ID 2
> (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00 00 30 00
> length 24576 SMID 898 terminated ioc 804b scsi 0 state c xfer 0
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 77 cc e0 00 00 18 00 length
> 12288 SMID 604 terminated ioc 804b scsi 0 state c xfer 0 mpr1:
> Unfreezing devq for target ID 2 (da8:mpr1:0:2:0): ATA COMMAND PASS
> THROUGH(16). CDB: 85 0d 06 00 01 00 01 00 00 00 00 00 00 40 06 00
> (da8:mpr1:0:2:0): CAM status: Command timeout (da8:mpr1:0:2:0):
> Retrying command (da8:mpr1:0:2:0): WRITE(10). CDB: 2a 00 59 81 ae 18 00
> 00 30 00 (da8:mpr1:0:2:0): CAM status: SCSI Status Error
> (da8:mpr1:0:2:0): SCSI status: Check Condition (da8:mpr1:0:2:0): SCSI
> sense: UNIT ATTENTION asc:29,0 (Power on, reset, or bus device reset
> occurred) (da8:mpr1:0:2:0): Retrying command (per sense data)
> (da8:mpr1:0:2:0): READ(10). CDB: 28 00 59 41 3d 08 00 00 10 00
> (da8:mpr1:0:2:0): CAM status: SCSI Status Error (da8:mpr1:0:2:0): SCSI
> status: Check Condition (da8:mpr1:0:2:0): SCSI sense: UNIT ATTENTION
> asc:29,0 (Power on, reset, or bus device reset occurred)
> (da8:mpr1:0:2:0): Retrying command (per sense data)
> (noperiph:mpr1:0:4294967295:0): SMID 3 Aborting command
> 0xfffffe000160b660
>
> ----
>
> ZFS doesn't like this and sees read errors or even write errors. In
> extreme cases the device is marked as FAULTED:
>
> ----
>
> pool: examplepool
> state: DEGRADED
> status: One or more devices are faulted in response to persistent
> errors. Sufficient replicas exist for the pool to continue functioning
> in a degraded state.
> action: Replace the faulted device, or use 'zpool clear' to mark the
> device repaired.
> scan: none requested
> config:
>
> NAME STATE READ WRITE CKSUM
> examplepool DEGRADED 0 0 0
> raidz1-0 ONLINE 0 0 0
> da3p1 ONLINE 0 0 0
> da4p1 ONLINE 0 0 0
> da5p1 ONLINE 0 0 0
> logs
> da1p1 FAULTED 3 0 0 too many errors
> cache
> da1p2 FAULTED 3 0 0 too many errors
> spares
> da2p1 AVAIL
>
> errors: No known data errors
>
> ----
>
> The problems arise on all 3 machines all all SSDs nearly daily. So I
> highly suspect a software issue. Has anyone an idea what's going on and
> what I can do to solve this problems? More information can be provided
> if necessary.
>
> Regards,
> Yamagi
>
> --
> Homepage: www.yamagi.org
> XMPP: yam...@yamagi.org
> GnuPG/GPG: 0xEFBCCBCB
> _______________________________________________
> freebs...@freebsd.org mailing list
> http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
> To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

--
Homepage: www.yamagi.org
XMPP: yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB
_______________________________________________
freebs...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Steven Hartland

unread,

Jul 13, 2015, 5:13:55 AM7/13/15

to

That would indicate that TRIM on your disks is causing a problem,
possibly a firmware bug causing TRIM requests to take an excessively
long time to complete.

What do you see from:
sysctl -a | grep -E '(delete|trim)'

Also while your seeing time-outs what does the output from gstat -d -p
look like?

Regards
Steve

Yamagi Burmeister

unread,

Jul 13, 2015, 5:26:11 AM7/13/15

to

On Mon, 13 Jul 2015 10:13:32 +0100
Steven Hartland <kil...@multiplay.co.uk> wrote:

> What do you see from:
> sysctl -a | grep -E '(delete|trim)'

% sysctl -a | grep -E '(delete|trim)'
kern.geom.dev.delete_max_sectors: 262144
kern.cam.da.1.delete_max: 8589803520
kern.cam.da.1.delete_method: ATA_TRIM
kern.cam.da.8.delete_max: 12884705280
kern.cam.da.8.delete_method: ATA_TRIM
kern.cam.da.9.delete_max: 12884705280
kern.cam.da.9.delete_method: ATA_TRIM
kern.cam.da.3.delete_max: 12884705280
kern.cam.da.3.delete_method: ATA_TRIM
kern.cam.da.12.delete_max: 12884705280
kern.cam.da.12.delete_method: ATA_TRIM
kern.cam.da.7.delete_max: 12884705280
kern.cam.da.7.delete_method: ATA_TRIM
kern.cam.da.2.delete_max: 12884705280
kern.cam.da.2.delete_method: ATA_TRIM
kern.cam.da.11.delete_max: 12884705280
kern.cam.da.11.delete_method: ATA_TRIM
kern.cam.da.6.delete_max: 12884705280
kern.cam.da.6.delete_method: ATA_TRIM
kern.cam.da.10.delete_max: 12884705280
kern.cam.da.10.delete_method: ATA_TRIM
kern.cam.da.5.delete_max: 12884705280
kern.cam.da.5.delete_method: ATA_TRIM
kern.cam.da.0.delete_max: 8589803520
kern.cam.da.0.delete_method: ATA_TRIM
kern.cam.da.4.delete_max: 12884705280
kern.cam.da.4.delete_method: ATA_TRIM
vfs.zfs.trim.max_interval: 1
vfs.zfs.trim.timeout: 30
vfs.zfs.trim.txg_delay: 32
vfs.zfs.trim.enabled: 1
vfs.zfs.vdev.trim_max_pending: 10000
vfs.zfs.vdev.bio_delete_disable: 0
vfs.zfs.vdev.trim_max_active: 64
vfs.zfs.vdev.trim_min_active: 1
vfs.zfs.vdev.trim_on_init: 1
kstat.zfs.misc.arcstats.deleted: 289783817
kstat.zfs.misc.zio_trim.failed: 431
kstat.zfs.misc.zio_trim.unsupported: 0
kstat.zfs.misc.zio_trim.success: 6457142235
kstat.zfs.misc.zio_trim.bytes: 88207753330688

> Also while your seeing time-outs what does the output from gstat -d -p
> look like?

I'll try to get that data but it may take a while.

Thank you,

Yamagi

--
Homepage: www.yamagi.org
XMPP: yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB
_______________________________________________
freebs...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

--

Steven Hartland

unread,

Jul 13, 2015, 5:55:09 AM7/13/15

to

I assume da0 and da1 are a different disk then?

With regards your disk setup are all of you disks SSD's if so why do you
have separate log and cache devices?

One thing you could try is to limit the delete size.

kern.geom.dev.delete_max_sectors limits the single request size allowed
by geom but then individual requests can be built back up in cam so I
don't think this will help you too much.

Instead I would try limiting the individual device delete_max, so add
one line per disk into /boot/loader.conf of the form:
kern.cam.da.X.delete_max=1073741824

You can actually change these on the fly using sysctl, but in order to
catch an cleanup done on boot loader.conf is the best place to tune them
permanently.

I've attached a little c util which you can use to do direct disk
deletes if you have a spare disk you can play with.

Be aware that most controller optimise delete's out if they know the
cells are empty hence you do need to have written data to the sectors
each time you test a delete.

As the requests go through geom anything over
kern.geom.dev.delete_max_sectors will be split but then may well be
recombined in CAM.

Another relevant setting is vfs.zfs.vdev.trim_max_active which can be
used to limit the number of outstanding geom delete requests to the each
device.

Oh one other thing, it would be interesting to see the output from
camcontrol identify <device> e.g.
camcontrol identify da8
camcontrol identify da0

Regards
Steve

ioctl-delete.c

Yamagi Burmeister

unread,

Jul 27, 2015, 6:02:52 AM7/27/15

to

Hello,
let me appologise for my late answer. My colleagues were in vacation
and I had no time to pursue this problem. But now another round:

- da0 and da1 are 80G Intel DC S3500 SSDs
- All other devices are 800G Intel DC S3700 SSDs

kern.cam.da.X.delete_max seem very high for all devices. Forcing the
tuneable to a very conservative value of 65536 helps, no timeout in 72
hours and no measurable performance impact. So my guess is:

- ZFS tries to TRIM too many blocks in one operation
- The SSD blocks for some time while processing the TRIM command
- The controller thinks that the SSD crashed and sends a reset

I did some tests with the attached tool. I'm able to reproduce the
timeouts when "enough" data was written to the device. The question is
what's "enough" data. Sometimes 50G are enough and sometimes 75G can be
trimmed without any problem.

Nevertheless. A lower kern.cam.da.X.delete_max value helps to work
around the problem. And everything else is just speculation. So:
Problem solved. Thank you for your help and input.

Regards,
Yamagi

> > Yamagi
> >
>

--
Homepage: www.yamagi.org
XMPP: yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB
_______________________________________________
freebs...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-scsi
To unsubscribe, send any mail to "freebsd-scsi...@freebsd.org"

--

Steven Hartland

unread,

Jul 28, 2015, 4:45:58 AM7/28/15

to

65536 is really small and will likely cause bottleneck when TRIMing
large areas.

I'd suggest a larger value 2 - 4 MB but at least matching
kern.geom.dev.delete_max_sectors (262144)

Regards
Steve