Trim/discard unallocated thin pool space

brenda...@gmail.com

unread,

Jun 17, 2019, 10:28:19 AM6/17/19

to qubes-users

There is a tool in dom0 called "thin_trim" which is part of the "device-mapper-persistent-data" package. It issues discards to the unallocated space of a dm-thin device that is not in use. This gets a bit trickier if it is an lvm2 device, as there's another management layer above the dm-thin store, but it is possible to use against one as per guidance here: https://github.com/jthornber/thin-provisioning-tools/issues/76 )

However, as you can see at the link: thin_trim can only be invoked against a *deactivated* thin pool. This is not possible in Qubes normal operation as dom0 lives *within* the thin pool with the other VMs** and therefore the pools cannot be deactivated.

There seems to be no other tool available to issue discards directly against a thin pool's unallocated space***. `lvremove` does not do so, no matter what the documentation states. Effectively this means that some contents of past VMs, including disposable VMs, are sitting around in the unallocated storage space.

I have a feature request open for an explicit blkdiscard call against LVs before lvremove is invoked, which addresses many but not all cases of remnant data. It would also be good hygiene to opportunistically issue discards against the unallocated thin pool space on a regular basis (e.g. weekly, on boot perhaps).

If I were to brainstorm a bit, there is presumably a point during boot (before pivoting away from the initial ramdisk) or during shutdown (after unmounting dom0's root) where one could potentially invoke the thin_trim command (if you ensured that it and associated libraries were accessible at that point). Any guidance on how one would do so?

Brendan

** might be an argument for dom0 to live in a separate pool?

*** other than explicitly filling the thin pool up to ~99.9% with random data (directly or via a VM-attached LV), then issuing discards against that before/during removal...which is not an efficient approach for time and wear reasons. I also tried adding a linear LV to the VG, to run blkdiscard again... but the linear LV cannot encroach on the thin pool's allocation, so that wasn't helpful.

Chris Laprise

unread,

Jun 17, 2019, 11:16:05 AM6/17/19

to brenda...@gmail.com, qubes-users

On 6/17/19 10:28 AM, brenda...@gmail.com wrote:
> There is a tool in dom0 called "thin_trim" which is part of the "device-mapper-persistent-data" package. It issues discards to the unallocated space of a dm-thin device that is not in use. This gets a bit trickier if it is an lvm2 device, as there's another management layer above the dm-thin store, but it is possible to use against one as per guidance here: https://github.com/jthornber/thin-provisioning-tools/issues/76 )
>
> However, as you can see at the link: thin_trim can only be invoked against a *deactivated* thin pool. This is not possible in Qubes normal operation as dom0 lives *within* the thin pool with the other VMs** and therefore the pools cannot be deactivated.
>
> There seems to be no other tool available to issue discards directly against a thin pool's unallocated space***. `lvremove` does not do so, no matter what the documentation states. Effectively this means that some contents of past VMs, including disposable VMs, are sitting around in the unallocated storage space.

I would fully expect lvremove to issue discards, if lvm is configured
for it. Did you try changing /etc/lvm/lvm.conf so that "issue_discards =
1" ?

>
> I have a feature request open for an explicit blkdiscard call against LVs before lvremove is invoked, which addresses many but not all cases of remnant data. It would also be good hygiene to opportunistically issue discards against the unallocated thin pool space on a regular basis (e.g. weekly, on boot perhaps).
>
> If I were to brainstorm a bit, there is presumably a point during boot (before pivoting away from the initial ramdisk) or during shutdown (after unmounting dom0's root) where one could potentially invoke the thin_trim command (if you ensured that it and associated libraries were accessible at that point). Any guidance on how one would do so?

>
> Brendan
>
> ** might be an argument for dom0 to live in a separate pool?

I think Marek wanted in the future to move dom0 root to a regular
(static) lv in the same volume group. But also, it should be possible
for a user to perform this switch manually by creating a static lv,
copying dom0 root contents to new root, and then changing the necessary
grub entries. If you don't rename the old thin lv and name the static lv
'root', then you'll have to change fstab as well.

At that point, you might be able to deactivate any configured pools from
/lib/systemd/system-shutdown as this gets executed right after
filesystems become read-only. Although a systemd unit might get you
closer to the point you want to be.

But definitely try lvm.conf "issue_discards = 1" first. ;) There is also
the "thin_pool_discards" setting.

>
> *** other than explicitly filling the thin pool up to ~99.9% with random data (directly or via a VM-attached LV), then issuing discards against that before/during removal...which is not an efficient approach for time and wear reasons. I also tried adding a linear LV to the VG, to run blkdiscard again... but the linear LV cannot encroach on the thin pool's allocation, so that wasn't helpful.

The way to zero-fill would be with a thin lv. If you think of thin pool
as a filesystem, you would zero-fill an fs by creating a file. Just
don't max out the pool completely, or you may end up with an un-bootable
system.

--

Chris Laprise, tas...@posteo.net
https://github.com/tasket
https://twitter.com/ttaskett
PGP: BEE2 20C5 356E 764A 73EB 4AB3 1DC4 D106 F07F 1886

Chris Laprise

unread,

Jun 17, 2019, 11:32:50 AM6/17/19

to brenda...@gmail.com, qubes-users

FWIW, if there were any issues with data not being discarded, it would
be with the (size) mismatch between what ext4 considers a discarded
block and what the thin + lvm layers consider a discardable block or chunk.

If I wanted to solve this issue relatively quickly, I'd first consider
moving to btrfs. Its unified approach is more likely to process discards
completely.

brenda...@gmail.com

unread,

Jun 17, 2019, 11:38:55 AM6/17/19

to qubes-users

Chris - thanks for jumping on this. :)

On Monday, June 17, 2019 at 11:16:05 AM UTC-4, Chris Laprise wrote:
> I would fully expect lvremove to issue discards, if lvm is configured
> for it. Did you try changing /etc/lvm/lvm.conf so that "issue_discards =
> 1" ?

I've got that set (also in dom0 & luks as well). As per my comments near the end of this github issue [ https://github.com/QubesOS/qubes-issues/issues/5077 ], do some experiments putting /dev/urandom data into files in VMs, blkdiscard them, watch the discards occur down at the hardware layer using the one liner below running in dom0, then put more data into the VMs, shutdown the VMs and then delete the VMs using Qubes tools and watch for discards (I don't get any in the latter case).

The one-ish liner to run in dom0 and look for jumps in the discarded blocks count (also works within non-dom0 VMs, but less useful there for this):

#!/bin/bash
watch -n 1 -d \
"if [ -d /sys/block/sda ] ; then pat=sd ; else pat=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat*/stat|awk 'BEGIN {print \"DRIVE IOS QMERGES SECTORS MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\\n\",NR,\$12,\$13,\$14,\$15}'"

> > ** might be an argument for dom0 to live in a separate pool?
>
> I think Marek wanted in the future to move dom0 root to a regular
> (static) lv in the same volume group.

This makes sense tome.

> But also, it should be possible
> for a user to perform this switch manually by creating a static lv,
> copying dom0 root contents to new root, and then changing the necessary
> grub entries. If you don't rename the old thin lv and name the static lv
> 'root', then you'll have to change fstab as well.

Yeah, I'm playing a bit fast and loose with my daily qubes system, so not ready yet to do that. :) Might experiment a bit on another build.

As it is now, thin-pools can't be shrunk so one would need to do some gymnastics to get it working on an existing system.

> At that point, you might be able to deactivate any configured pools from
> /lib/systemd/system-shutdown as this gets executed right after
> filesystems become read-only. Although a systemd unit might get you
> closer to the point you want to be.

I will investigate, thanks.

> But definitely try lvm.conf "issue_discards = 1" first. ;) There is also
> the "thin_pool_discards" setting.

As above already doing so. Let me know if discards happen on VM delete for you. If it *does* then color me confused, as I've examined the stack a few times and I *DO* get discards out to the hardware from large file deletes inside VMs, but not during deletion of a VM that had large amounts of new data added to it.

> > *** other than explicitly filling the thin pool up to ~99.9% with random data (directly or via a VM-attached LV), then issuing discards against that before/during removal...which is not an efficient approach for time and wear reasons. I also tried adding a linear LV to the VG, to run blkdiscard again... but the linear LV cannot encroach on the thin pool's allocation, so that wasn't helpful.
>
> The way to zero-fill would be with a thin lv. If you think of thin pool
> as a filesystem, you would zero-fill an fs by creating a file. Just
> don't max out the pool completely, or you may end up with an un-bootable
> system.

Yeah, with all that you've said about thin-pool fragility (e.g. filling up the metadata), I'm a little skittish there.

Also, zero fill seemed to be not increasing the data in an experiment, whereas /dev/urandom was increasing the data...but slower than I'd like and I don't want to get bored and overshoot past 99% full. Or at least my memory was that urandom and zero gave me different results last week. Let me double check.

Brendan

brenda...@gmail.com

unread,

Jun 17, 2019, 11:44:49 AM6/17/19

to qubes-users

On Monday, June 17, 2019 at 11:32:50 AM UTC-4, Chris Laprise wrote:
> FWIW, if there were any issues with data not being discarded, it would
> be with the (size) mismatch between what ext4 considers a discarded
> block and what the thin + lvm layers consider a discardable block or chunk.

Yes, this is why turning on discards in fstab inside VM templates really only impacts large file deletes inside the VMs, as the pool chunk sizes is >= 64KB while the VM block size inside the VMs is much smaller. Ensuring that one deletes all of the directories/files (big and small) *then* performing fstrim before shutdown can partially address this.

Also having Qubes tooling explicitly issue a blkdiscard against devices *before* performing lvremove partially addresses this as well (for deleted or disposable VMs), for the persistent LVs.

Partial solutions aren't great for this topic, though.

> If I wanted to solve this issue relatively quickly, I'd first consider
> moving to btrfs. Its unified approach is more likely to process discards
> completely.

I keep getting mixed messages on the stability of btrfs. :)

Brendan

Chris Laprise

unread,

Jun 17, 2019, 1:49:01 PM6/17/19

to brenda...@gmail.com, qubes-users

On 6/17/19 11:38 AM, brenda...@gmail.com wrote:
> Chris - thanks for jumping on this. :)
>
> On Monday, June 17, 2019 at 11:16:05 AM UTC-4, Chris Laprise wrote:
>> I would fully expect lvremove to issue discards, if lvm is configured
>> for it. Did you try changing /etc/lvm/lvm.conf so that "issue_discards =
>> 1" ?
>
> I've got that set (also in dom0 & luks as well). As per my comments near the end of this github issue [ https://github.com/QubesOS/qubes-issues/issues/5077 ], do some experiments putting /dev/urandom data into files in VMs, blkdiscard them, watch the discards occur down at the hardware layer using the one liner below running in dom0, then put more data into the VMs, shutdown the VMs and then delete the VMs using Qubes tools and watch for discards (I don't get any in the latter case).
>
> The one-ish liner to run in dom0 and look for jumps in the discarded blocks count (also works within non-dom0 VMs, but less useful there for this):
>
> #!/bin/bash
> watch -n 1 -d \
> "if [ -d /sys/block/sda ] ; then pat=sd ; else pat=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/\$pat*/stat|awk 'BEGIN {print \"DRIVE IOS QMERGES SECTORS MSWAITS\"} {printf \"%5i %-8s %s %15s %11s\\n\",NR,\$12,\$13,\$14,\$15}'"
>
>>> ** might be an argument for dom0 to live in a separate pool?
>>
>> I think Marek wanted in the future to move dom0 root to a regular
>> (static) lv in the same volume group.
>
> This makes sense tome.
>
>> But also, it should be possible
>> for a user to perform this switch manually by creating a static lv,
>> copying dom0 root contents to new root, and then changing the necessary
>> grub entries. If you don't rename the old thin lv and name the static lv
>> 'root', then you'll have to change fstab as well.
>
> Yeah, I'm playing a bit fast and loose with my daily qubes system, so not ready yet to do that. :) Might experiment a bit on another build.
>
> As it is now, thin-pools can't be shrunk so one would need to do some gymnastics to get it working on an existing system.

You know what I did recently to get free space in the volume group...
The unnecessarily huge swap space the installer allocated was staring at
me, so I resized it down to 2GB. Depending on your RAM config at install
time, that could get you in excess of 8GB.

>
>> At that point, you might be able to deactivate any configured pools from
>> /lib/systemd/system-shutdown as this gets executed right after
>> filesystems become read-only. Although a systemd unit might get you
>> closer to the point you want to be.
>
> I will investigate, thanks.

I'll try after I reboot (had to enable 'issue_discards').

>
>> But definitely try lvm.conf "issue_discards = 1" first. ;) There is also
>> the "thin_pool_discards" setting.
>
> As above already doing so. Let me know if discards happen on VM delete for you. If it *does* then color me confused, as I've examined the stack a few times and I *DO* get discards out to the hardware from large file deletes inside VMs, but not during deletion of a VM that had large amounts of new data added to it.
>
>>> *** other than explicitly filling the thin pool up to ~99.9% with random data (directly or via a VM-attached LV), then issuing discards against that before/during removal...which is not an efficient approach for time and wear reasons. I also tried adding a linear LV to the VG, to run blkdiscard again... but the linear LV cannot encroach on the thin pool's allocation, so that wasn't helpful.
>>
>> The way to zero-fill would be with a thin lv. If you think of thin pool
>> as a filesystem, you would zero-fill an fs by creating a file. Just
>> don't max out the pool completely, or you may end up with an un-bootable
>> system.
>
> Yeah, with all that you've said about thin-pool fragility (e.g. filling up the metadata), I'm a little skittish there.

This is where I have to cast another vote for Btrfs. If the data fills
up it acts like a normal filesystem. It has metadata filling issues
also, but recovery is more robust than thin lvm.

As for the stability of Btrfs, it might not be up to ZFS' level, but I
consider it to be a step up from Ext4 + thin lv or even Ext4 alone.

> Also, zero fill seemed to be not increasing the data in an experiment, whereas /dev/urandom was increasing the data...but slower than I'd like and I don't want to get bored and overshoot past 99% full. Or at least my memory was that urandom and zero gave me different results last week. Let me double check.

I guess there is zero-detection going on to help save space. My approach
would be to create a limited amount of urandom data and just repeat it
in different files.

Reply all

Reply to author

Forward