Summary/Questions:
1. Is the extremely large minimum-IO value of 256KB for the dom0 block devices representing Q4 VM volume in the thin pool ... intentional?
2. And if so, to what purpose (e.g. performance, etc.)?
3. And if so, has the impact of this value on depending on discards for returning unused disk space to the pool been factored in?
---discussion and supporting cmd output follows---
As you can see below, the MIN-IO (minimum IO size for R/W/D) and DISC-GRAN (minimum size allowed for discard/trim commands) on most of the thin pool volumes are both set to 256KB. Shown below, you can see this is the case for the debian 9 VM and the dom0 root volume. Same for all the VM volumes that I cut out of the output below for brevity/privacy.
Everything else in the stack, the drives, partitions, luks/crypt container and even some of the non-VM filesystem pool volumes and/or metadata have much more reasonable MIN-IO and DISC-GRAN values of 512 bytes or 4K...including dom0 swap!
The result is that turning on automatic trimming of the filesystems within VMs requires large holes to be created on the virtual disk before triggering discards that can be transmitted down the stack during deletions. To rephrase: in the default configuration, for data to be recovered from VM volumes back into the pool after deletions, the deletions must include files with large contiguous sections. Also, this negatively impacts physical disk trimming, if the user has configured it.
The 256K value may explain why folks have only found that manually invoking 'sudo fstrim /av' is the only guaranteed way to trigger full release of storage back into the pool from VMs, leaving users who do not regularly trim from inside their VMs at risk of the pool running out of room.
--command-to-create-output--
% lsblk -o TYPE,ROTA,FSTYPE,PHY-SEC,LOG-SEC,MIN-IO,DISC-GRAN,DISC-ZERO,VENDOR,MODEL,REV,NAME -x NAME -x TYPE|sort|uniq
--trimmed-output--
TYPE ROTA FSTYPE PHY-SEC LOG-SEC MIN-IO DISC-GRAN DISC-ZERO VENDOR MODEL REV NAME
crypt 0 LVM2_member 512 512 512 512B 0 luks-xxxxxxxx-xxxx-xxxx-xxxx
-xxxxxxxxxxxx
disk 0 4096 512 4096 4K 0 ATA xxxxxxxxxxxxxxxx xxxx sdc
disk 0 4096 512 4096 4K 0 ATA xxxxxxxxxxxxxxxx xxxx sdb
disk 0 512 512 512 512B 0 ATA xxxxxxxxxxxxxxxx xxxx sda
loop 1 ext3 512 512 512 4K 0 loop0
...
lvm 0 512 512 262144 256K 0 qubes_dom0-vm--debian--9--private
lvm 0 512 512 262144 256K 0 qubes_dom0-vm--debian--9--root
...
lvm 0 512 512 262144 512B 0 qubes_dom0-pool00
lvm 0 512 512 262144 512B 0 qubes_dom0-pool00-tpool
lvm 0 512 512 512 512B 0 qubes_dom0-pool00_tdata
lvm 0 512 512 512 512B 0 qubes_dom0-pool00_tmeta
lvm 0 ext4 512 512 262144 256K 0 qubes_dom0-root
lvm 0 swap 512 512 512 512B 0 qubes_dom0-swap
...
part 0 crypto_LUKS 512 512 512 512B 0 sda2
part 0 ext4 512 512 512 512B 0 sda1
For fun, here's a one-liner I put together to keep an eye on discards on /dev/sd* or /dev/xvd* volumes in dom0 or a VM as I was exploring this issue.
watch -n 1 -d \
"if [ -d /sys/block/sda ] ; then pat=sd ; else pat=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat*/stat|awk 'BEGIN {print \"DRIVE IOS QMERGES SECTORS MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\\n\",NR,\$12,\$13,\$14,\$15}'"
Brendan
It would be interesting if thin-lvm min transfer were the reason for
this difference in behavior between fstrim and the filesystem.
However, I think you're wrong to assume that any free block at any scale
should be discarded at the lvm level. This behavior is probably a
feature designed to prevent pool metadata use from exploding to the
point where the volume becomes slow or unmanageable. Controlling
metadata size is a serious issue with COW storage systems and at some
point compromises must be made between data efficiency and metadata
efficiency.
On thin-lvm volumes, maxing-out the allocated metadata space can have
serious consequences including loss of the entire pool. I experienced
this myself several weeks ago and I was just barely able to manage
recovery without reinstalling the whole system – it involved deleting
and re-creating the thin-pool, then restoring all the volumes from backup.
Run the 'lvs' command and look at the Meta% column for pool00. If its
much more than 50% there is reason for concern, because if you put the
system through a flurry of activity including cloning/snapshotting
and/or modifying many small files then that figure could balloon close
to 100% in a very short period.
Did you mean to say "optimal" or did the docs really say that larger cluster sizes are optional?
In any case, I think the docs I read had a bit of a hedge: they stated a larger value is better for performance, but a smaller value may be better for heavy COW usage.
> This isnt a Qubes choice - it's Fedora, (and, I think, dependent on the
> size of the pool.)
Sure.
Sounds like a lot of folks around here want to get out from under the thumb of Big Red. :P
Brendan
PS - opened three new issues...
#5053 - Qubes Disk Widget - please inform about/alert on pool metadata space
#5054 - LVM Pool metadata default size is too small
#5055 - Add fstrim on shutdown of templates and template-based VMs
Yup.
One could argue that the same solution could be *actively* applied to prevent running out of free space. :) My recollection is that my old Drobo used to do this (for free space, though presumably both).
> This simply makes
> de-fragmentation maintenance issue (defrag to shrink metadata and get
> performance back). This is what Microsoft did with NTFS and it was the
> right choice; clinging to fixed metadata sizes is merely a state of
> denial that leads to peoples' disks suddenly becoming unusable.
Lack of COW aside, NTFS's odd "separation plus mixing" of storage and metadata is fascinating. I mean, it works! And works pretty well! And is ancient!
It *does* keep you on your toes, though, mitigating for forensics..."NTFS: oh, you have a small file? Well, I'll just store that over here in the metadata stream. You want to delete it? Sure, I'll mark it deleted. Erasing free space? go right ahead, I'll be over here waiting. Oh, it's still here? Well...better talk to Mr. Russinovich if you want to figure out how to really destroy that file..."*
-Brendan
* upon review, I read that in the Q voice, for maximum nerdiness.