Q4.0 - LVM Thin Pool volumes - lsblk returns very large (256kb) MIN-IO and DISC-GRAN values

86 views
Skip to first unread message

brenda...@gmail.com

unread,
May 24, 2019, 10:00:14 PM5/24/19
to qubes-users
Hi folks,

Summary/Questions:

1. Is the extremely large minimum-IO value of 256KB for the dom0 block devices representing Q4 VM volume in the thin pool ... intentional?
2. And if so, to what purpose (e.g. performance, etc.)?
3. And if so, has the impact of this value on depending on discards for returning unused disk space to the pool been factored in?

---discussion and supporting cmd output follows---

As you can see below, the MIN-IO (minimum IO size for R/W/D) and DISC-GRAN (minimum size allowed for discard/trim commands) on most of the thin pool volumes are both set to 256KB. Shown below, you can see this is the case for the debian 9 VM and the dom0 root volume. Same for all the VM volumes that I cut out of the output below for brevity/privacy.

Everything else in the stack, the drives, partitions, luks/crypt container and even some of the non-VM filesystem pool volumes and/or metadata have much more reasonable MIN-IO and DISC-GRAN values of 512 bytes or 4K...including dom0 swap!

The result is that turning on automatic trimming of the filesystems within VMs requires large holes to be created on the virtual disk before triggering discards that can be transmitted down the stack during deletions. To rephrase: in the default configuration, for data to be recovered from VM volumes back into the pool after deletions, the deletions must include files with large contiguous sections. Also, this negatively impacts physical disk trimming, if the user has configured it.

The 256K value may explain why folks have only found that manually invoking 'sudo fstrim /av' is the only guaranteed way to trigger full release of storage back into the pool from VMs, leaving users who do not regularly trim from inside their VMs at risk of the pool running out of room.

--command-to-create-output--
% lsblk -o TYPE,ROTA,FSTYPE,PHY-SEC,LOG-SEC,MIN-IO,DISC-GRAN,DISC-ZERO,VENDOR,MODEL,REV,NAME -x NAME -x TYPE|sort|uniq

--trimmed-output--
TYPE ROTA FSTYPE PHY-SEC LOG-SEC MIN-IO DISC-GRAN DISC-ZERO VENDOR MODEL REV NAME
crypt 0 LVM2_member 512 512 512 512B 0 luks-xxxxxxxx-xxxx-xxxx-xxxx
-xxxxxxxxxxxx
disk 0 4096 512 4096 4K 0 ATA xxxxxxxxxxxxxxxx xxxx sdc
disk 0 4096 512 4096 4K 0 ATA xxxxxxxxxxxxxxxx xxxx sdb
disk 0 512 512 512 512B 0 ATA xxxxxxxxxxxxxxxx xxxx sda
loop 1 ext3 512 512 512 4K 0 loop0
...
lvm 0 512 512 262144 256K 0 qubes_dom0-vm--debian--9--private
lvm 0 512 512 262144 256K 0 qubes_dom0-vm--debian--9--root
...
lvm 0 512 512 262144 512B 0 qubes_dom0-pool00
lvm 0 512 512 262144 512B 0 qubes_dom0-pool00-tpool
lvm 0 512 512 512 512B 0 qubes_dom0-pool00_tdata
lvm 0 512 512 512 512B 0 qubes_dom0-pool00_tmeta
lvm 0 ext4 512 512 262144 256K 0 qubes_dom0-root
lvm 0 swap 512 512 512 512B 0 qubes_dom0-swap
...
part 0 crypto_LUKS 512 512 512 512B 0 sda2
part 0 ext4 512 512 512 512B 0 sda1

For fun, here's a one-liner I put together to keep an eye on discards on /dev/sd* or /dev/xvd* volumes in dom0 or a VM as I was exploring this issue.

watch -n 1 -d \
"if [ -d /sys/block/sda ] ; then pat=sd ; else pat=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat*/stat|awk 'BEGIN {print \"DRIVE IOS QMERGES SECTORS MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\\n\",NR,\$12,\$13,\$14,\$15}'"

Brendan

brenda...@gmail.com

unread,
May 24, 2019, 11:03:23 PM5/24/19
to qubes-users
Looks like the chunksize of the pool is the controlling factor (256kb) here.

% lvs -o name,chunksize|grep pool

Docs say the default value is 64kb (that’s also the minimum for a thin pool). Not sure why qubesos value is higher.

Chris Laprise

unread,
May 25, 2019, 12:09:25 PM5/25/19
to brenda...@gmail.com, qubes-users
On 5/24/19 10:00 PM, brenda...@gmail.com wrote:
> Hi folks,
>
> Summary/Questions:
>
> 1. Is the extremely large minimum-IO value of 256KB for the dom0 block devices representing Q4 VM volume in the thin pool ... intentional?
> 2. And if so, to what purpose (e.g. performance, etc.)?
> 3. And if so, has the impact of this value on depending on discards for returning unused disk space to the pool been factored in?
>
> ---discussion and supporting cmd output follows---
>
> As you can see below, the MIN-IO (minimum IO size for R/W/D) and DISC-GRAN (minimum size allowed for discard/trim commands) on most of the thin pool volumes are both set to 256KB. Shown below, you can see this is the case for the debian 9 VM and the dom0 root volume. Same for all the VM volumes that I cut out of the output below for brevity/privacy.
>
> Everything else in the stack, the drives, partitions, luks/crypt container and even some of the non-VM filesystem pool volumes and/or metadata have much more reasonable MIN-IO and DISC-GRAN values of 512 bytes or 4K...including dom0 swap!
>
> The result is that turning on automatic trimming of the filesystems within VMs requires large holes to be created on the virtual disk before triggering discards that can be transmitted down the stack during deletions. To rephrase: in the default configuration, for data to be recovered from VM volumes back into the pool after deletions, the deletions must include files with large contiguous sections. Also, this negatively impacts physical disk trimming, if the user has configured it.
>
> The 256K value may explain why folks have only found that manually invoking 'sudo fstrim /av' is the only guaranteed way to trigger full release of storage back into the pool from VMs, leaving users who do not regularly trim from inside their VMs at risk of the pool running out of room.

Hi Brendan,

It would be interesting if thin-lvm min transfer were the reason for
this difference in behavior between fstrim and the filesystem.

However, I think you're wrong to assume that any free block at any scale
should be discarded at the lvm level. This behavior is probably a
feature designed to prevent pool metadata use from exploding to the
point where the volume becomes slow or unmanageable. Controlling
metadata size is a serious issue with COW storage systems and at some
point compromises must be made between data efficiency and metadata
efficiency.

On thin-lvm volumes, maxing-out the allocated metadata space can have
serious consequences including loss of the entire pool. I experienced
this myself several weeks ago and I was just barely able to manage
recovery without reinstalling the whole system – it involved deleting
and re-creating the thin-pool, then restoring all the volumes from backup.

Run the 'lvs' command and look at the Meta% column for pool00. If its
much more than 50% there is reason for concern, because if you put the
system through a flurry of activity including cloning/snapshotting
and/or modifying many small files then that figure could balloon close
to 100% in a very short period.

--

Chris Laprise, tas...@posteo.net
https://github.com/tasket
https://twitter.com/ttaskett
PGP: BEE2 20C5 356E 764A 73EB 4AB3 1DC4 D106 F07F 1886

Brendan Hoar

unread,
May 25, 2019, 12:45:28 PM5/25/19
to Chris Laprise, qubes-users
On Sat, May 25, 2019 at 12:09 PM Chris Laprise <tas...@posteo.net> wrote:

It would be interesting if thin-lvm min transfer were the reason for
this difference in behavior between fstrim and the filesystem.

Indeed. Pretty sure that is the case for some workloads.

However, I think you're wrong to assume that any free block at any scale
should be discarded at the lvm level. This behavior is probably a
feature designed to prevent pool metadata use from exploding to the
point where the volume becomes slow or unmanageable. Controlling
metadata size is a serious issue with COW storage systems and at some
point compromises must be made between data efficiency and metadata
efficiency.

Agreed. I started with that assumption but as I read through the docs I realized there was some performance-related balancing going on.

On thin-lvm volumes, maxing-out the allocated metadata space can have
serious consequences including loss of the entire pool. I experienced
this myself several weeks ago and I was just barely able to manage
recovery without reinstalling the whole system – it involved deleting
and re-creating the thin-pool, then restoring all the volumes from backup.

Ouch!

I’m going to add an Issue/Feature request to add metadata store monitoring and alerts to the disk space widget. :)

—-

I will note that the docs indicate that lvcreate uses the pool allocation size divided by the chunk size times a multiplier to determine the default metadata store size (assuming you don’t override the final value). So if you specify the chunk size the “default” metadata store is *supposed* to scale...

One can also specify a safer (larger) metadata store during lvcreate at the expense of file storage of course.

I ran across a discussion of chunk size guidance and one thing I’ll note is that for heavy COW workloads the recommendation was to keep the chunk size value at the low end but be sure to increase the metadata store size. I’ll see if I can find it in my browser history.

Run the 'lvs' command and look at the Meta% column for pool00. If its
much more than 50% there is reason for concern, because if you put the
system through a flurry of activity including cloning/snapshotting
and/or modifying many small files then that figure could balloon close
to 100% in a very short period.

Will do!

In the end I am just puzzled why the default chunk is 256k and not 64k, though. I haven’t found a place in the qubes installer iso source where the size is overriden.

I also ran across docs from red hat saying the the 7.4 to 7.5 rhel transition moved from a default of 64KB to 2MB (possibly due to upstream?)...so discard on delete’s usefulness inside VMs may be even more constrained in the future if I read that right.

I’ll probably open a feature ticket asking for auto fstrim of the mounted rw filesystems on templates/templated VM shutdowns. As it is, I already do this manually on templates after every update and from time to time in VMs that see a lot of file churn.

Brendan

awokd

unread,
May 25, 2019, 1:00:02 PM5/25/19
to qubes...@googlegroups.com
Brendan Hoar wrote on 5/25/19 4:45 PM:
> On Sat, May 25, 2019 at 12:09 PM Chris Laprise <tas...@posteo.net> wrote:

> I’m going to add an Issue/Feature request to add metadata store monitoring
> and alerts to the disk space widget. :)

I had the same thought reading Chris' email.

Chris Laprise

unread,
May 25, 2019, 2:28:13 PM5/25/19
to Brendan Hoar, qubes-users
Based on my experience (two metadata meltdowns since moving to Qubes 4)
I would open another issue to have Qubes double or triple the system's
default metadata size after installation. Proportionally, the loss of
data space is small and its easy to implement using 'lvresize
--poolmetadatasize'.

>
> I ran across a discussion of chunk size guidance and one thing I’ll note
> is that for heavy COW workloads the recommendation was to keep the chunk
> size value at the low end but be sure to increase the metadata store
> size. I’ll see if I can find it in my browser history.
>
> Run the 'lvs' command and look at the Meta% column for pool00. If its
> much more than 50% there is reason for concern, because if you put the
> system through a flurry of activity including cloning/snapshotting
> and/or modifying many small files then that figure could balloon close
> to 100% in a very short period.
>
>
> Will do!
>
> In the end I am just puzzled why the default chunk is 256k and not 64k,
> though. I haven’t found a place in the qubes installer iso source where
> the size is overriden.

64k is the minimum but this increases when the pool size reaches certain
thresholds. On my system, its 128k. As for Redhat switching to such a
large (2MB) minimum size, I think it should be regarded as throwing up
one's hands and giving up on the subject. IMO, its too large and
shouldn't be used.

FWIW, Redhat's new COW storage system is a frankenstein patchwork using
xfs volumes like some kind of block layer. It looks about as elegant and
comprehensible as their other gift to the world, systemd. They need to
hire better engineers.

I think the only _good_ way to deal with COW metadata expansion, since
its always related to data fragmentation, is to keep expanding it and
let system performance degrade accordingly. This simply makes
de-fragmentation maintenance issue (defrag to shrink metadata and get
performance back). This is what Microsoft did with NTFS and it was the
right choice; clinging to fixed metadata sizes is merely a state of
denial that leads to peoples' disks suddenly becoming unusable.

>
> I also ran across docs from red hat saying the the 7.4 to 7.5 rhel
> transition moved from a default of 64KB to 2MB (possibly due to
> upstream?)...so discard on delete’s usefulness inside VMs may be even
> more constrained in the future if I read that right.

Its a good bet that "upstream" in this case is Redhat.

>
> I’ll probably open a feature ticket asking for auto fstrim of the
> mounted rw filesystems on templates/templated VM shutdowns. As it is, I
> already do this manually on templates after every update and from time
> to time in VMs that see a lot of file churn.

unman

unread,
May 25, 2019, 8:50:57 PM5/25/19
to qubes-users
Docs also say that where a thin pool is used primarily for thin
provisioning a larger value is optional.

This isnt a Qubes choice - it's Fedora, (and, I think, dependent on the
size of the pool.)

brenda...@gmail.com

unread,
May 28, 2019, 8:26:06 AM5/28/19
to qubes-users
On Saturday, May 25, 2019 at 8:50:57 PM UTC-4, unman wrote:
> Docs also say that where a thin pool is used primarily for thin
> provisioning a larger value is optional.

Did you mean to say "optimal" or did the docs really say that larger cluster sizes are optional?

In any case, I think the docs I read had a bit of a hedge: they stated a larger value is better for performance, but a smaller value may be better for heavy COW usage.



> This isnt a Qubes choice - it's Fedora, (and, I think, dependent on the
> size of the pool.)

Sure.

Sounds like a lot of folks around here want to get out from under the thumb of Big Red. :P

Brendan

PS - opened three new issues...

#5053 - Qubes Disk Widget - please inform about/alert on pool metadata space
#5054 - LVM Pool metadata default size is too small
#5055 - Add fstrim on shutdown of templates and template-based VMs

brenda...@gmail.com

unread,
May 28, 2019, 8:42:12 AM5/28/19
to qubes-users
On Saturday, May 25, 2019 at 2:28:13 PM UTC-4, Chris Laprise wrote:
> I think the only _good_ way to deal with COW metadata expansion, since
> its always related to data fragmentation, is to keep expanding it and
> let system performance degrade accordingly.

Yup.

One could argue that the same solution could be *actively* applied to prevent running out of free space. :) My recollection is that my old Drobo used to do this (for free space, though presumably both).

> This simply makes
> de-fragmentation maintenance issue (defrag to shrink metadata and get
> performance back). This is what Microsoft did with NTFS and it was the
> right choice; clinging to fixed metadata sizes is merely a state of
> denial that leads to peoples' disks suddenly becoming unusable.

Lack of COW aside, NTFS's odd "separation plus mixing" of storage and metadata is fascinating. I mean, it works! And works pretty well! And is ancient!

It *does* keep you on your toes, though, mitigating for forensics..."NTFS: oh, you have a small file? Well, I'll just store that over here in the metadata stream. You want to delete it? Sure, I'll mark it deleted. Erasing free space? go right ahead, I'll be over here waiting. Oh, it's still here? Well...better talk to Mr. Russinovich if you want to figure out how to really destroy that file..."*

-Brendan
* upon review, I read that in the Q voice, for maximum nerdiness.

Chris Laprise

unread,
May 28, 2019, 7:12:35 PM5/28/19
to brenda...@gmail.com, qubes-users
On 5/28/19 8:42 AM, brenda...@gmail.com wrote:
> On Saturday, May 25, 2019 at 2:28:13 PM UTC-4, Chris Laprise wrote:
>> I think the only _good_ way to deal with COW metadata expansion, since
>> its always related to data fragmentation, is to keep expanding it and
>> let system performance degrade accordingly.
>
> Yup.
>
> One could argue that the same solution could be *actively* applied to prevent running out of free space. :) My recollection is that my old Drobo used to do this (for free space, though presumably both).
>

It would not be too presumptuous for Qubes to define thin lvm metadata
as being in the same class as, say, filesystem metadata and let the
system consume available vg space as needed. The best way to start this
process is to lay down a rule saying you should only create/use one thin
pool per volume group on a Qubes system. If the vg is dedicated to
housing the one pool, then the gnarly questions around competing for
storage space disappear.

There is already an implicit policy that the user has to keep their eye
on overall used space in the pool because of over-provisioning. Just
extend that policy to 'volume group' and you can keep expanding tmeta
automatically.

The best possible (or at least better) implementation of this is to have
a hook in lvm block layer itself that can briefly halt file operations
while tmeta is expanded.

>> This simply makes
>> de-fragmentation maintenance issue (defrag to shrink metadata and get
>> performance back). This is what Microsoft did with NTFS and it was the
>> right choice; clinging to fixed metadata sizes is merely a state of
>> denial that leads to peoples' disks suddenly becoming unusable.
>
> Lack of COW aside, NTFS's odd "separation plus mixing" of storage and metadata is fascinating. I mean, it works! And works pretty well! And is ancient!

In my view, NTFS has "COW++". Shouldn't we in Unix land change our
terminology and start calling the combination of COW and snapshots....
shadow copy? :)

>
> It *does* keep you on your toes, though, mitigating for forensics..."NTFS: oh, you have a small file? Well, I'll just store that over here in the metadata stream. You want to delete it? Sure, I'll mark it deleted. Erasing free space? go right ahead, I'll be over here waiting. Oh, it's still here? Well...better talk to Mr. Russinovich if you want to figure out how to really destroy that file..."*
>
> -Brendan
> * upon review, I read that in the Q voice, for maximum nerdiness.

Reply all
Reply to author
Forward
0 new messages