Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.

Dismiss

CAM Target over FC and UNMAP problem

134 views

Skip to first unread message

Emil Muratov

unread,

Mar 5, 2015, 12:10:24 PM3/5/15

I've got an issue with CTL UNMAP and zvol backends.
Seems that UNMAP from the initiator passed to the underlying disks
(without trim support) causes IO blocking to the whole pool. Not sure
where to address this problem.

My setup:
- plain SATA 7.2 krpm drives attached to Adaptec aacraid SAS controller
- zfs raidz pool over plain drives, no partitioning
- zvol created with volmode=dev
- Qlogic ISP 2532 FC HBA in target mode
- FreeBSD 10.1-STABLE #1 r279593

Create a new LUN with a zvol backend

ctladm realsync off
ctladm port -o on -p 5
ctladm create -b block -o file=/dev/zvol/wd/tst1 -o unmap=on -l 0 -d
wd.tst1 -S tst1

Both target an initiator hosts connected to the FC fabric. Initiator is
Win2012 server, actually it is a VM with RDM LUN to the guest OS.
Formating, reading and writing large amounts of data (file copy/IOmeter)
- so far so good.
But as soon as I've tried to delete large files all IO to the LUN
blocks, initiator system just iowaits. gstat on target shows that
underlying disk load bumped to 100%, queue up to 10, but no iowrites
actually in progress, only decent amount of ioreads. After a minute or
so IO unblocks for a second or two than blocks again and so on again
until all UNMAPs are done, it could take up to 5 minutes to delete 10Gb
file. I can see that 'logicalused' property of a zvol shows that the
deleted space was actually released. System log is filled with CTL msgs:

kernel: (ctl2:isp1:0:0:3): ctlfestart: aborted command 0x12aaf4 discarded
kernel: (2:5:3/3): WRITE(10). CDB: 2a 00 2f d4 74 b8 00 00 08 00
kernel: (2:5:3/3): Tag: 0x12ab24, type 1
kernel: (2:5:3/3): ctl_process_done: 96 seconds
kernel: (ctl2:isp1:0:0:3): ctlfestart: aborted command 0x12afa4 discarded
kernel: (ctl2:isp1:0:0:3): ctlfestart: aborted command 0x12afd4 discarded
kernel: ctlfedone: got XPT_IMMEDIATE_NOTIFY status 0x36 tag 0xffffffff
seq 0x121104
kernel: (ctl2:isp1:0:0:3): ctlfe_done: returning task I/O tag 0xffffffff
seq 0x1210d4

I've tried to tackle some sysctls, but no success so far.

vfs.zfs.vdev.bio_flush_disable: 1
vfs.zfs.vdev.bio_delete_disable: 1
vfs.zfs.trim.enabled=0

Disabling UNMAP in CTL (-o unmap=off) resolves the issue completely but
than there is no space reclamation for zvol.

Any hints would be appreciated.

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Alexander Motin

unread,

Mar 5, 2015, 2:16:52 PM3/5/15

Hi.

On 05.03.2015 19:10, Emil Muratov wrote:
> I've got an issue with CTL UNMAP and zvol backends.
> Seems that UNMAP from the initiator passed to the underlying disks
> (without trim support) causes IO blocking to the whole pool. Not sure
> where to address this problem.

There is no direct relations between UNMAP sent to ZVOl and UNMAP/TRIM
to underlying disks. ZVOL UNMAP only frees some pool space, that may
later be trimmed if disks support it.

> My setup:
> - plain SATA 7.2 krpm drives attached to Adaptec aacraid SAS controller
> - zfs raidz pool over plain drives, no partitioning
> - zvol created with volmode=dev
> - Qlogic ISP 2532 FC HBA in target mode
> - FreeBSD 10.1-STABLE #1 r279593

> Create a new LUN with a zvol backend
>
> ctladm realsync off

Are you sure you need this? Your data are so uncritical to ignore even
explicit cache flushes?

> ctladm port -o on -p 5
> ctladm create -b block -o file=/dev/zvol/wd/tst1 -o unmap=on -l 0 -d
> wd.tst1 -S tst1

Just for note, this configuration can now be alternatively done via ctld
and /etc/ctl.conf.

There were number of complains on UNMAP performance in Illumos lists
too. Six month ago there were some fixes committed and merged to
stable/10 that substantially improved the situation. Since that time I
haven't observed problems with that on my tests.

What's about the large amount of reads during UNMAP, I have two guesses:
1) it may be read of metadata absent in ARC. Though I doubt that there
are so much metadata to read them during several minutes.
2) if UNMAP ranges were not aligned to ZVOL block, I guess ZFS could try
to read blocks that need partial "unmap". I've made experiment with
unmapping 512 bytes of 8K ZVOL block, and it indeed zeroed specified 512
bytes, from SCSI perspective while it would be fine to just ignore the
request.

--
Alexander Motin

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-...@muc.de

Emil Muratov

unread,

Mar 6, 2015, 4:49:46 AM3/6/15

On 05.03.2015 22:16, Alexander Motin wrote:
> Hi.
>
> On 05.03.2015 19:10, Emil Muratov wrote:

>> I've got an issue with CTL UNMAP and zvol backends.
>> Seems that UNMAP from the initiator passed to the underlying disks
>> (without trim support) causes IO blocking to the whole pool. Not sure
>> where to address this problem.

> There is no direct relations between UNMAP sent to ZVOl and UNMAP/TRIM
> to underlying disks. ZVOL UNMAP only frees some pool space, that may
> later be trimmed if disks support it.

So as far as I understood it must be only a zfs issue not related to CTL
at all?

>
>> Create a new LUN with a zvol backend
>>
>> ctladm realsync off

> Are you sure you need this? Your data are so uncritical to ignore even
> explicit cache flushes?

No, it's just for the test lab scenario. I'm not sure if UNMAP commands
implies sync or not, so decided to take a chance, but no success anyway.

>
>> ctladm port -o on -p 5
>> ctladm create -b block -o file=/dev/zvol/wd/tst1 -o unmap=on -l 0 -d
>> wd.tst1 -S tst1

> Just for note, this configuration can now be alternatively done via ctld
> and /etc/ctl.conf.
>

> There were number of complains on UNMAP performance in Illumos lists
> too. Six month ago there were some fixes committed and merged to
> stable/10 that substantially improved the situation. Since that time I
> haven't observed problems with that on my tests.

Have you tried unmap on zvols with non-ssd backeds too? Now I'm actively
testing this scenario, but this issues makes it impossible to use UNMAP
in production, blocking timeouts turns into IO failures for initiator OS.

> What's about the large amount of reads during UNMAP, I have two guesses:
> 1) it may be read of metadata absent in ARC. Though I doubt that there
> are so much metadata to read them during several minutes.

Just to be sure I setup SSD card, made L2 ARC cache over it and set the
vol properties to 'secondarycache=metadata'. Then run the tests again -
acording to gstat ssd is almost idle both for reads and writes but hdds
are still heavily loaded for reads.

> 2) if UNMAP ranges were not aligned to ZVOL block, I guess ZFS could try
> to read blocks that need partial "unmap". I've made experiment with
> unmapping 512 bytes of 8K ZVOL block, and it indeed zeroed specified 512
> bytes, from SCSI perspective while it would be fine to just ignore the
> request.

Maybe I should take a closer look into this. Although I've tried to do
best to align upper layer fs to zvol blocks, I've put GPT over LUN,
win2012 should align it to 1M boundaries, than formatted NTFS partition
with 8K cluster. As far as I can see during heavy writes there is no
reads at the same time from the zvol, but I will do some more tests
investigating this point.
Besides this why there should be so a lot of reads at the first place?
Isn't it enough to just update metadata to mark unmapped blocks as free?
And what is the most annoying is that all IO blocks for a time, I'm not
an expert in this area but isn't there any way to reorder or delay those
unmap op's or even drop it out if there are a lot of other pending IOs?

Will be back with more test results later.

_______________________________________________
freeb...@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Alexander Motin

unread,

Mar 6, 2015, 5:28:23 AM3/6/15

On 06.03.2015 11:49, Emil Muratov wrote:
> On 05.03.2015 22:16, Alexander Motin wrote:
>> On 05.03.2015 19:10, Emil Muratov wrote:
>>> I've got an issue with CTL UNMAP and zvol backends.
>>> Seems that UNMAP from the initiator passed to the underlying disks
>>> (without trim support) causes IO blocking to the whole pool. Not sure
>>> where to address this problem.
>> There is no direct relations between UNMAP sent to ZVOl and UNMAP/TRIM
>> to underlying disks. ZVOL UNMAP only frees some pool space, that may
>> later be trimmed if disks support it.
> So as far as I understood it must be only a zfs issue not related to CTL
> at all?

I think so. CTL just tells ZFS to free specified range of ZVOL, and so
far nobody shown that it does it incorrectly.

>> There were number of complains on UNMAP performance in Illumos lists
>> too. Six month ago there were some fixes committed and merged to
>> stable/10 that substantially improved the situation. Since that time I
>> haven't observed problems with that on my tests.
> Have you tried unmap on zvols with non-ssd backeds too? Now I'm actively
> testing this scenario, but this issues makes it impossible to use UNMAP
> in production, blocking timeouts turns into IO failures for initiator OS.

My primary test system is indeed all-SSD. But I do some testing on
HDD-based system and will do this more for UNMAP.

>> What's about the large amount of reads during UNMAP, I have two guesses:
>> 1) it may be read of metadata absent in ARC. Though I doubt that there
>> are so much metadata to read them during several minutes.
> Just to be sure I setup SSD card, made L2 ARC cache over it and set the
> vol properties to 'secondarycache=metadata'. Then run the tests again -
> acording to gstat ssd is almost idle both for reads and writes but hdds
> are still heavily loaded for reads.

L2ARC is empty on boot and filled at limited rate. You may need to read
the file several times before deleting it to make metadata get into L2ARC.

>> 2) if UNMAP ranges were not aligned to ZVOL block, I guess ZFS could try
>> to read blocks that need partial "unmap". I've made experiment with
>> unmapping 512 bytes of 8K ZVOL block, and it indeed zeroed specified 512
>> bytes, from SCSI perspective while it would be fine to just ignore the
>> request.
> Maybe I should take a closer look into this. Although I've tried to do
> best to align upper layer fs to zvol blocks, I've put GPT over LUN,
> win2012 should align it to 1M boundaries, than formatted NTFS partition
> with 8K cluster. As far as I can see during heavy writes there is no
> reads at the same time from the zvol, but I will do some more tests
> investigating this point.

You should check for reads not only during writes, but also during
REwrites. If initiator actively practices UNMAP, then even misaligned
initial write may not cause read-modify-write cycle, since there is just
nothing to read.

> Besides this why there should be so a lot of reads at the first place?
> Isn't it enough to just update metadata to mark unmapped blocks as free?

As I can see in ZFS code, if UNMAP is not aligned to zvol blocks, then
first and last blocks are not unmapped, but instead affected parts are
written with zeroes. Those partial writes may trigger read-modify-write
cycle, if data are not already in cache. SCSI spec allows device to skip
such zero writes, and I am thinking about implementing such filtering on
CTL level.

> And what is the most annoying is that all IO blocks for a time, I'm not
> an expert in this area but isn't there any way to reorder or delay those
> unmap op's or even drop it out if there are a lot of other pending IOs?

That was not easy to do, but CTL should be clever about this now. It
should now block only access to blocks that are affected by specific
UNMAP command. From the other side after fixing this issue on CTL level
I've noticed that in ZFS UNMAP also significantly affects performance of
other commands to the same zvol.

To check possible CTL role in this blocking you may try to add to your
LUN configuration `option reordering unrestricted`. It makes CTL to not
track any potential request collisions. If after that UNMAP will still
block other I/Os, then all questions to ZFS.

--
Alexander Motin

Message has been deleted

Emil Muratov

unread,

Mar 18, 2015, 9:13:07 AM3/18/15

On 06.03.2015 13:28, Alexander Motin wrote:
>>> There were number of complains on UNMAP performance in Illumos lists
>>> too. Six month ago there were some fixes committed and merged to
>>> stable/10 that substantially improved the situation. Since that time I
>>> haven't observed problems with that on my tests.
>> Have you tried unmap on zvols with non-ssd backeds too? Now I'm actively
>> testing this scenario, but this issues makes it impossible to use UNMAP
>> in production, blocking timeouts turns into IO failures for initiator OS.
> My primary test system is indeed all-SSD. But I do some testing on
> HDD-based system and will do this more for UNMAP.
>
>

Hi!
I've made some progress with this issue using iSCSI transport and
sniffing initiator/target command-responses traffic.
I found that initiator sends request VPD 0xb0 page and than UNMAP
command with long LBA range and then timeouts while waiting response.
That was interesting, ctladm option 'ublocksize' doesn't make any
difference, so I've tried to tackle other values.
I'm not sure how this should work in the first place but I found if not
a solution for ZFS than at least a workaround for CTL.
I looked through ctl code and changed hardcoded values for 'unmap LBA
count' and 'unmap block descr count' to 8Mb and 128.
With this values UNMAP works like a charm! No more IO blocks, IO
timeouts, log error, high disk loads or anything, only a medium
performance drop-down during even very large unmaps. But this
performance drop is nothing compared with those all-blocking issues. No
problems over FiberChannel transport too.

I think it would be nice to have ctl options tunable for this VPD values
at least (and maybe others), if not changing the hard-coded default.

Here are the options I came to:
ctladm create -o file=/dev/zvol/wd/zvol/zvl02 -o unmap=on -o
pblocksize=8k -o ublocksize=1m

From initiators side disk VPD page:

$sg_vpd -p bl /dev/sdb
Block limits VPD page (SBC):
Write same no zero (WSNZ): 0
Maximum compare and write length: 255 blocks
Optimal transfer length granularity: 0 blocks
Maximum transfer length: 4294967295 blocks
Optimal transfer length: 2048 blocks
Maximum prefetch length: 0 blocks
Maximum unmap LBA count: 16384
Maximum unmap block descriptor count: 128
Optimal unmap granularity: 2048
Unmap granularity alignment valid: 1
Unmap granularity alignment: 0
Maximum write same length: 0xffffffffffffffff blocks

A patch for ctl.c

--- ./sys/cam/ctl/ctl.c.orig 2015-03-01 19:35:53.000000000 +0300
+++ ./sys/cam/ctl/ctl.c 2015-03-17 11:05:53.000000000 +0300
@@ -10327,9 +10327,11 @@
if (lun != NULL) {
bs = lun->be_lun->blocksize;
scsi_ulto4b(lun->be_lun->opttxferlen,
bl_ptr->opt_txfer_len);
+ // set Block limits VPD Maximum unmap LBA count to
0x4000 (8Mbytes)
+ // set Block limits VPD Maximum unmap block descriptor
count to 128 (1Gb combined with max lba cnt)
if (lun->be_lun->flags & CTL_LUN_FLAG_UNMAP) {
- scsi_ulto4b(0xffffffff, bl_ptr->max_unmap_lba_cnt);
- scsi_ulto4b(0xffffffff, bl_ptr->max_unmap_blk_cnt);
+ scsi_ulto4b(0x4000, bl_ptr->max_unmap_lba_cnt);
+ scsi_ulto4b(0x80, bl_ptr->max_unmap_blk_cnt);
if (lun->be_lun->ublockexp != 0) {
scsi_ulto4b((1 << lun->be_lun->ublockexp),
bl_ptr->opt_unmap_grain);

Alexander Motin

unread,

Mar 19, 2015, 10:02:32 AM3/19/15

Hi.

On 18.03.2015 15:12, Emil Muratov wrote:
> I've made some progress with this issue using iSCSI transport and
> sniffing initiator/target command-responses traffic.
> I found that initiator sends request VPD 0xb0 page and than UNMAP
> command with long LBA range and then timeouts while waiting response.
> That was interesting, ctladm option 'ublocksize' doesn't make any
> difference, so I've tried to tackle other values.
> I'm not sure how this should work in the first place but I found if not
> a solution for ZFS than at least a workaround for CTL.
> I looked through ctl code and changed hardcoded values for 'unmap LBA
> count' and 'unmap block descr count' to 8Mb and 128.
> With this values UNMAP works like a charm! No more IO blocks, IO
> timeouts, log error, high disk loads or anything, only a medium
> performance drop-down during even very large unmaps. But this
> performance drop is nothing compared with those all-blocking issues. No
> problems over FiberChannel transport too.

In my present understanding of SBC-4 specification, implemented also in
FreeBSD initiator, MAXIMUM UNMAP LBA COUNT is measured not per segment,
but per command. From such perspective limiting it to 8MB per UNMAP
command is IMHO an overkill. Could you try to increase it to 2097152,
which is 1GB, while decrease MAXIMUM UNMAP BLOCK DESCRIPTOR COUNT from
128 to 64? Will it give acceptable results?

--
Alexander Motin

Message has been deleted

0 new messages