Block device (LVM) as VM's disk image?

891 views
Skip to first unread message

Vít Šesták

unread,
May 27, 2015, 3:30:41 PM5/27/15
to qubes...@googlegroups.com
Hello,
is it possible to use a block device as a VM's image? I am interested in using LVM (on LUKS) as TemplateVM's root.img image and AppVM's private.img.

Reasons:

First, I have some idea about online backups. LVM allows atomic cloning…

Second, I'd like to reduce the overhead to minimum and I believe that LVM was better optimized for this purpose than ext4.

I've tried to achieve it by two ways:

a) $ qvm-prefs -s debian-8 root_img=/dev/ssd/template-debian-8-root
ERROR: Wrong property name: 'root_img=/dev/ssd/template-debian-8-root'

b) By symlinking /dev/ssd/template-debian-8-root/root.img to /dev/ssd/template-debian-8-root (and by dd-ing all the data to that device). It also does not work, even after chmod om /dev/dm-<id>:
$ qvm-start untrusted                 
--> Creating volatile image: /var/lib/qubes/appvms/untrusted/volatile.img...
--> Loading the VM (type = AppVM)...
Traceback (most recent call last):
  File "/usr/bin/qvm-start", line 125, in <module>
    main()
  File "/usr/bin/qvm-start", line 109, in main
    xid = vm.start(verbose=options.verbose, preparing_dvm=options.preparing_dvm, start_guid=not options.noguid, notify_function=tray_notify_generic if options.tray else None)
  File "/usr/lib64/python2.7/site-packages/qubes/modules/000QubesVm.py", line 1719, in start
    self.libvirt_domain.createWithFlags(libvirt.VIR_DOMAIN_START_PAUSED)
  File "/usr/lib64/python2.7/site-packages/libvirt.py", line 1037, in createWithFlags
    if ret == -1: raise libvirtError ('virDomainCreateWithFlags() failed', dom=self)
libvirt.libvirtError: internal error: libxenlight failed to create new domain 'untrusted'

Is there any way to get it working?

Regards,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
May 27, 2015, 3:40:30 PM5/27/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
Hmm, symlink idea theoretically should work. Check
/var/log/xen/xen-hotplug.log and /var/log/libvirt/libxl/VMNAME.log.

Generally I like this idea and we have plans for easier support for such
setup - especially using LVM thin provisioning. But it will not happen
soon, probably not before R4.0...

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVZh2lAAoJENuP0xzK19cstlgH/3KCnRczirmZZ5nsZifYQGm1
lF9uWpylUYovNgrxIiAXybyoxwJkd3Rfd9Or+cK9tOunj7NppuaZIDxAhTgABT7V
SkIQ2YWHTwSJhC9uwOPwhshoEyexiifuNnl9UrPxaUqXSBYsTbVXq5rUBT7jPRre
PcLEzPK9VK2daFyIGpv7BAC2bByvqkpjqAbWLK2yHiwX8TVz6hyhrlUBu2OTo6CP
Bk8Au0n7OcHqbHvkcSXovXXErHXLWJJm9bd48kzmznJtCe5K8u7JXTZ6FGQfjNvV
YaQ2/zsvBFV8t48iCGn50WW4/B1QGlQZ81hjqCEs0GBQMp6/08c11jI1hwpqF5Q=
=LVjQ
-----END PGP SIGNATURE-----

Vít Šesták

unread,
May 27, 2015, 3:54:00 PM5/27/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com
there is nothing added to the libxl VM's log, but there is something interesting added to the xen-hotplug.log:

==> /var/log/xen/xen-hotplug.log <==

Usage:
 blockdev -V
 blockdev --report [devices]
 blockdev [-v|-q] commands devices

Available commands:
 --getsz                   get size in 512-byte sectors
 --setro                   set read-only
 --setrw                   set read-write
 --getro                   get read-only
 --getdiscardzeroes        get discard zeroes support status
 --getss                   get logical block (sector) size
 --getpbsz                 get physical block (sector) size
 --getiomin                get minimum I/O size
 --getioopt                get optimal I/O size
 --getalignoff             get alignment offset in bytes
 --getmaxsect              get max sectors per request
 --getbsz                  get blocksize
 --setbsz <bytes>          set blocksize on file descriptor opening the block device
 --getsize                 get 32-bit sector count (deprecated, use --getsz)
 --getsize64               get size in bytes
 --setra <sectors>         set readahead
 --getra                   get readahead
 --setfra <sectors>        set filesystem readahead
 --getfra                  get filesystem readahead
 --flushbufs               flush buffers
 --rereadpt                reread partition table


Not sure what exactly happened, but it seems like blockdev got some wrong args and it is complaining about that.

Regards,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
May 27, 2015, 4:02:05 PM5/27/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

I see. Try this change in /etc/xen/scripts/block-snapshot:
- --- block-snapshot.orig 2015-05-27 22:01:07.986000000 +0200
+++ block-snapshot 2015-05-27 22:01:16.859000000 +0200
@@ -48,6 +48,7 @@
else
test -e "$dev" || fatal "$dev does not exist."
test -b "$dev" || fatal "$dev is not a block device nor file."
+ echo "$dev"
fi
}

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVZiK1AAoJENuP0xzK19cstbwH/13RXN4apewTHzR3R7wstdzg
tPKC8WBCbTLd20Yh7g4FqxXoggPHBUs2p2RArnUglzloUNU59hqNvo5jQdbc50EI
AMHyt+JjeIx4oz4I3Q15o8em0cN1m6QzE7VSMC8dJZ1MQJ10m+atLMa8m7BgL6uT
TKdbREl90ryggULeNgkqhfgF2vbsOElm9v/hz+IANdlU4vckFqfijS7cR8eLJKx7
SVKjJ1upwqrkXPy2jyQ2/CA6i6ye3egkTrdeKl1UMMLk6kRCwZvyKlbMJmcrWsfF
HRyUGpSJ5l2St9ovu+EckFYLPgMcnCuc+z4XJOJkIZxakj+GhcBss/uYhEH3Ha4=
=7NLx
-----END PGP SIGNATURE-----

Vít Šesták

unread,
May 27, 2015, 5:31:25 PM5/27/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com
Great! This one-liner fix seems to work well, at least for TemplateVM's root.img :) I will hopefuly try it also for AppVM's private.img.

The fstrim on / does not work, though, I assume that it is caused by some COW mechanisms used in Qubes and their limitations. I know that a similar limitation also holds for img-file based TemplateVMs. For LVM-based TemplateVMs, it might be solved by some fstrim-vm and temporary attaching the block device to them. (Well, a DisposableVM might be useful for security reasons…)

Are there any known issues with the symlink hack other than broken Qubes backup? (I don't care about that, as I'd like to use a custom LVM-based backup process.)

Unfortunately, I can't compare directly images on ext4 to LVM in performance, because it is also connected to SSD->HDD migration, which has likely much higher performance impact. Despite my former opinions, I will try LVM also on HDD if it proves to work correctly and be good for backup purposes.

Regards,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
May 27, 2015, 6:00:03 PM5/27/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, May 27, 2015 at 02:31:25PM -0700, Vít Šesták wrote:
> Great! This one-liner fix seems to work well, at least for TemplateVM's
> root.img :)

Added for next qubes-core-dom0 package version.

> I will hopefuly try it also for AppVM's private.img.

To save some space, it would be good idea to create thin volumes
(lvcreate -T ...). This should work well with private.img as it is
mounted with -o discard.

> The fstrim on / does not work, though, I assume that it is caused by some
> COW mechanisms used in Qubes and their limitations. I know that a similar
> limitation also holds for img-file based TemplateVMs. For LVM-based
> TemplateVMs, it might be solved by some fstrim-vm and temporary attaching
> the block device to them. (Well, a DisposableVM might be useful for
> security reasons…)

There is already tool which does exactly that: qvm-trim-template.

> Are there any known issues with the symlink hack other than broken Qubes
> backup? (I don't care about that, as I'd like to use a custom LVM-based
> backup process.)

I don't know any, but probably you will find some soon... I hope all of
them will be similar easy to fix.

> Unfortunately, I can't compare directly images on ext4 to LVM in
> performance, because it is also connected to SSD->HDD migration, which has
> likely much higher performance impact. Despite my former opinions, I will
> try LVM also on HDD if it proves to work correctly and be good for backup
> purposes.

It would be interesting to results of such comparison.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVZj5dAAoJENuP0xzK19csty8H/AsbjepnECcvo1p/h0is0vDK
tahs3w/bM3/3Chhrp1YdTVz8SOekRuAKVNCSmDFjX/w6ZaHSbHJKdlg+51gjaW2Q
vmVBHbaPYj3zHWM6YGW5FD4JwawYPLTaYasdByrZ/lmXwHN1k3sWZIhiy4HLXF9W
uCyufLZD8IRywCap0L3PEXbwtW5rBVH61xwznh7ETLOmrkdxSSbPX351j7Rv3Jdx
z3fiUQMOcjnHC56vRRI3guPYpOHweGM3Y1l8lA01QtAMhjVZIKIMGUe51JUWofIy
5JSnGGZ42pteAAr8eUbf6KVs9h/+XOIofVF7ueEuJTZHD1zcJMBT7bOIrz/1lEc=
=1WPS
-----END PGP SIGNATURE-----

Vít Šesták

unread,
May 28, 2015, 5:43:41 AM5/28/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com

Added for next qubes-core-dom0 package version.
Great!
 
To save some space, it would be good idea to create thin volumes
(lvcreate -T ...). This should work well with private.img as it is
mounted with -o discard.
 
Thin provisioning brings some potential issues. I am not sure (and can't find any details) how is low real space handled:
1. It should be handled somehow proactively, i.e. warn the user when running out of space. For imgs on an ordinary filesystem, this is usually monitored by some user process, which shows some notification about low space. I am however unsure if there is something similar for LVM thin provisioning.
2. I don't know how is this handled reactively, i.e. if it just pauses the VM until you make some space in the LVM volume group (which would be similar to the VirtualBox approach) or if it is handled somehow differently.

I'd like to use thin provisioning for private.img on SSD, as it saved space and fragmentation does not hurt so much. For root.img on SSD, I can usualy predict the space I need. For HDD, I have plenty of space, so I don't need to save some space there.

On -o discard: I've seen two reasons to prefer fstrim over discards (based on SSD features), but both of them might be invalid today:
1. The discard option makes a huge performance penalty when you are deleting many small files and potentially in some other similar scenarions. But it might work better with modern kernels or more modern SSDs (e.g. thanks to queued TRIM).
2. Immediate discarding might reportedly be bad for wear leveling. Maybe TRIM with some bad controller just always requires to erase one erasure-block and relocates the other data? But this seems to deeply depend on SSD. A non-DRAT (DRAT=deterministic read after TRIM) SSD might somehow defer such TRIM commands and even DRAT SSD might just relocate (or deallocate) the “erased” block without actually erasing it.

There is already tool which does exactly that: qvm-trim-template.

I thought it works differently (maybe just confused by some howto with copy --sparse) and it would not work with symlinks and LVM at all, but I just looked at the source and it seems to work exactly the way I described.

Now, I just have to make DispVMs working. (I had them broken even before using LVM, not sure why. I did not experiment with them, just switched the template form Fedora 21 to Debian 8.)

I don't know any, but probably you will find some soon... I hope all of
them will be similar easy to fix.

I assume that you are interested in them (if I encounter any), as you plan LVM (or something similar) in the future.


Vít Šesták

unread,
May 28, 2015, 2:44:48 PM5/28/15
to qubes...@googlegroups.com
What will happen when I try to increase the volume size in Qubes Manager? I have not dared to try…

I created a first volume on SSD without thin provisioning (as it seems I have to create some pool first for TP and as the one VM has relatively fixed-size data) and it works, but fstrim on /rw does not work (fstrim: /rw: the discard operation is not supported). It seems I have to do something like this: https://www.qubes-os.org/doc/DiskTRIM/ .

Except the fstrim, which I'll hopefully fix, it works fine now.

Regards,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
May 28, 2015, 5:36:12 PM5/28/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, May 28, 2015 at 11:44:47AM -0700, Vít Šesták wrote:
> On Thursday, May 28, 2015 at 11:43:41 AM UTC+2, Vít Šesták wrote:
> >
> >
> > Added for next qubes-core-dom0 package version.
> >>
> > Great!
> >
> >
> >> To save some space, it would be good idea to create thin volumes
> >> (lvcreate -T ...). This should work well with private.img as it is
> >> mounted with -o discard.
> >>
> >
> > Thin provisioning brings some potential issues. I am not sure (and can't
> > find any details) how is low real space handled:
> > 1. It should be handled somehow proactively, i.e. warn the user when
> > running out of space. For imgs on an ordinary filesystem, this is usually
> > monitored by some user process, which shows some notification about low
> > space. I am however unsure if there is something similar for LVM thin
> > provisioning.

There is lvmetad daemon which can send events, but I'm not sure if there
is any GUI indication of that.

> > 2. I don't know how is this handled reactively, i.e. if it just pauses the
> > VM until you make some space in the LVM volume group (which would be
> > similar to the VirtualBox approach) or if it is handled somehow differently.

It looks like it crashes badly... I was able to recover test volume
using lvconvert --repair, but some data were lost.

- From lvmthin manual:
(...) Writes to thin LVs are accepted and queued,
with the expectation that pool data space will be extended soon.
Once data space is extended, the queued writes will be processed, and
the thin pool will return to normal operation.

While waiting to be extended, the thin pool will queue writes for up
to 60 seconds (the default). If data space has not been extended
after this time, the queued writes will return an error to the
caller, e.g. the file system. This can result in file system
corruption for non-journaled file systems that may require fsck.
When a thin pool returns errors for writes to a thin LV, any file
system is subject to losing unsynced user data.

There is an option to return an error immediately when volume is full
(so there will be no lost writes from a queue). I'm attaching terminal
output from test session at the end.

> > I'd like to use thin provisioning for private.img on SSD, as it saved
> > space and fragmentation does not hurt so much. For root.img on SSD, I can
> > usualy predict the space I need. For HDD, I have plenty of space, so I
> > don't need to save some space there.
> >
> > On -o discard: I've seen two reasons to prefer fstrim over discards (based
> > on SSD features), but both of them might be invalid today:
> > 1. The discard option makes a huge performance penalty when you are
> > deleting many small files and potentially in some other similar scenarions.
> > But it might work better with modern kernels or more modern SSDs (e.g.
> > thanks to queued TRIM).
> > 2. Immediate discarding might reportedly be bad for wear leveling. Maybe
> > TRIM with some bad controller just always requires to erase one
> > erasure-block and relocates the other data? But this seems to deeply depend
> > on SSD. A non-DRAT (DRAT=deterministic read after TRIM) SSD might somehow
> > defer such TRIM commands and even DRAT SSD might just relocate (or
> > deallocate) the “erased” block without actually erasing it.

Yes, those probably still apply. Note that -o discard in a VM (enabled
by default) is something different than in dom0 (disabled by default).
In VM it is used to notify dom0 about blocks not used by VM anymore, so
sparse file private.img (or thin volume) can release them. In dom0 it
could be used to tell your SSD of unused space - so this is real TRIM
operation. To make use of it in dom0, you also need to add
"allow_discards" option to dm-crypt.

> >> There is already tool which does exactly that: qvm-trim-template.
> >>
> >
> > I thought it works differently (maybe just confused by some howto with
> > copy --sparse) and it would not work with symlinks and LVM at all, but I
> > just looked at the source and it seems to work exactly the way I described.
> >
> > Now, I just have to make DispVMs working. (I had them broken even before
> > using LVM, not sure why. I did not experiment with them, just switched the
> > template form Fedora 21 to Debian 8.)
> >
> > I don't know any, but probably you will find some soon... I hope all of
> >> them will be similar easy to fix.
> >>
> >
> > I assume that you are interested in them (if I encounter any), as you plan
> > LVM (or something similar) in the future.

Yes. We plan to use libvirt storage pool API for that, so some details
will differ, but generally the final state will be very similar.

> What will happen when I try to increase the volume size in Qubes Manager? I
> have not dared to try…

It will probably fail as it will try to call "truncate ..." on that
image (which isn't valid operation for block device)

> I created a first volume on SSD without thin provisioning (as it seems I
> have to create some pool first for TP and as the one VM has relatively
> fixed-size data) and it works, but fstrim on /rw does not work (fstrim:
> /rw: the discard operation is not supported). It seems I have to do
> something like this: https://www.qubes-os.org/doc/DiskTRIM/ .
>
> Except the fstrim, which I'll hopefully fix, it works fine now.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVZ4pCAAoJENuP0xzK19cskAYH/33IdnNlYuttBg6tX4Z6a7Pf
0lsS3OPmlP+kG8J+EaxKwNaWqKxWLsu5hm55k9euXPYoZ8L35YG2eh40oOvckMw8
72HYTNl0DwMHBMf3+CCADR0UD1Ilal4LVXqBgwyEf0Xk9eUv1TaMvVzwZN438BBC
GMETQ4FYiC2ecSsM1HBZMRHjkG+zJs0aXxIz/wBCQZjpgY9BXu1Tr8CzEKBT3b6I
zQ3flpy/sUeOtVMH53HbJSDwfBBbmb8pjA7N/n0Re6Fnj0tIjwL576pYrHhZ5Sdi
Utzv3s5/UjjcdyVdieB/XzCKc1ck8pCLk+zU/10wt/bto3ykZAD36E/TEblOgGs=
=bY1B
-----END PGP SIGNATURE-----

Marek Marczykowski-Górecki

unread,
May 28, 2015, 6:14:10 PM5/28/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Thu, May 28, 2015 at 11:36:02PM +0200, Marek Marczykowski-Górecki wrote:
> On Thu, May 28, 2015 at 11:44:47AM -0700, Vít Šesták wrote:
> > On Thursday, May 28, 2015 at 11:43:41 AM UTC+2, Vít Šesták wrote:
> > >
> > >
> > > Added for next qubes-core-dom0 package version.
> > >>
> > > Great!
> > >
> > >
> > >> To save some space, it would be good idea to create thin volumes
> > >> (lvcreate -T ...). This should work well with private.img as it is
> > >> mounted with -o discard.
> > >>
> > >
> > > Thin provisioning brings some potential issues. I am not sure (and can't
> > > find any details) how is low real space handled:
> > > 1. It should be handled somehow proactively, i.e. warn the user when
> > > running out of space. For imgs on an ordinary filesystem, this is usually
> > > monitored by some user process, which shows some notification about low
> > > space. I am however unsure if there is something similar for LVM thin
> > > provisioning.
>
> There is lvmetad daemon which can send events, but I'm not sure if there
> is any GUI indication of that.
>
> > > 2. I don't know how is this handled reactively, i.e. if it just pauses the
> > > VM until you make some space in the LVM volume group (which would be
> > > similar to the VirtualBox approach) or if it is handled somehow differently.
>
> It looks like it crashes badly... I was able to recover test volume
> using lvconvert --repair, but some data were lost.
>
> From lvmthin manual:
> (...) Writes to thin LVs are accepted and queued,
> with the expectation that pool data space will be extended soon.
> Once data space is extended, the queued writes will be processed, and
> the thin pool will return to normal operation.
>
> While waiting to be extended, the thin pool will queue writes for up
> to 60 seconds (the default). If data space has not been extended
> after this time, the queued writes will return an error to the
> caller, e.g. the file system. This can result in file system
> corruption for non-journaled file systems that may require fsck.
> When a thin pool returns errors for writes to a thin LV, any file
> system is subject to losing unsynced user data.
>
> There is an option to return an error immediately when volume is full
> (so there will be no lost writes from a queue). I'm attaching terminal
> output from test session at the end.

Oops, forgot that:
[root@testvm user]# truncate -s 128M test.img
[root@testvm user]# losetup /dev/loop0 test.img
[root@testvm user]# vgcreate test /dev/loop0
Physical volume "/dev/loop0" successfully created
Volume group "test" successfully created
[root@testvm user]# lvcreate -T --thinpool pool -V 10G -L 32M -n vol1 test
Logical volume "vol1" created.
[root@testvm user]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
pool test twi-aotz-- 32.00m 0.00 0.98
vol1 test Vwi-a-tz-- 10.00g pool 0.00
[root@testvm user]# yes | dd of=/dev/test/vol1 bs=1M count=48 iflag=fullblock
48+0 records in
48+0 records out
50331648 bytes (50 MB) copied, 60.7506 s, 828 kB/s

[root@testvm user]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
pool test twi-aotzD- 32.00m 100.00 1.37
vol1 test Vwi-aotz-- 10.00g pool 0.31


May 29 00:07:06 testvm kernel: device-mapper: thin: 253:3: reached low water mark for data device: sending event.
May 29 00:07:06 testvm lvm[9814]: Thin test-pool-tpool is now 100% full.
May 29 00:07:06 testvm kernel: device-mapper: thin: 253:3: switching pool to out-of-data-space mode
(...)
May 29 00:08:06 testvm kernel: device-mapper: thin: 253:3: switching pool to read-only mode
May 29 00:08:06 testvm kernel: buffer_io_error: 6135 callbacks suppressed
May 29 00:08:06 testvm kernel: Buffer I/O error on dev dm-5, logical block 8192, lost async page write
May 29 00:08:06 testvm kernel: device-mapper: thin: 253:3: metadata operation 'dm_pool_commit_metadata' failed: error = -1
May 29 00:08:06 testvm kernel: device-mapper: thin: 253:3: aborting current metadata transaction
May 29 00:08:06 testvm kernel: Buffer I/O error on dev dm-5, logical block 8193, lost async page write
May 29 00:08:06 testvm kernel: Buffer I/O error on dev dm-5, logical block 8194, lost async page write
(...)

[root@testvm user]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
pool test twi-aotzM- 32.00m 100.00 1.37
vol1 test Vwi-a-tz-- 10.00g pool 0.31

[root@testvm user]# hexdump /dev/test/vol1
0000000 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79
*
hexdump: /dev/test/vol1: Input/output error
2000000

[root@testvm user]# lvconvert --repair test/pool
Only inactive pool can be repaired.
[root@testvm user]# lvchange -an test/vol1
[root@testvm user]# lvchange -an test/pool
[root@testvm user]# lvconvert --repair test/pool
WARNING: If everything works, remove "test/pool_meta0".
WARNING: Use pvmove command to move "test/pool_tmeta" on the best fitting PV.
[root@testvm user]# lvs
LV VG Attr LSize Pool Origin Data% Meta% Move Log Cpy%Sync Convert
pool test twi---tz-- 32.00m
pool_meta0 test -wi------- 4.00m
vol1 test Vwi---tz-- 10.00g pool
[root@testvm user]# lvchange -ay test/vol1
[root@testvm user]# hexdump /dev/test/vol1
0000000 0a79 0a79 0a79 0a79 0a79 0a79 0a79 0a79
*
2000000 0000 0000 0000 0000 0000 0000 0000 0000
*
280000000

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVZ5MpAAoJENuP0xzK19csPEAIAIfoScmVGYq6o+W8XDMHjxAD
XeEullrcKRFm/TPBCfac85LYv5h1j4/HpDX0+xjyvS90Ls3LuudkpGXAmoeSyHMh
J/il75GDmBKaG1lnqLcxcowMKHLIsvf0SK1RhaG0v4X47e9zXZ3/LX6qkVYTfYPd
vow7/pd4+dFIoWNIY8PrQVKHZqiOUne4+jqjgJFgm1+DfcUDcdapuXS3DdlmvMNe
2JO/eMK8u+pIfwDV9HRkC5zqdtn3c6poZQkuLj6igZrItA6O9BqmgyQyX08qdxeX
Ua9E5NxeHz8bFB/2KypVRRvCCluIoIC5Nfy/talN+q2od795G96WzYDpNjqbjE0=
=wwzj
-----END PGP SIGNATURE-----

Vít Šesták

unread,
May 29, 2015, 4:22:08 AM5/29/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com

There is an option to return an error immediately when volume is full
(so there will be no lost writes from a queue). I'm attaching terminal
output from test session at the end.

It does not sound as bad as I originally interpreted “If an overflow occurs within the thin pool metadata, then the pool will be corrupted. LVM cannot recover from this.” from https://wiki.gentoo.org/wiki/LVM#Thin_provisioning . So, it seems we just have to prevent metadata from overflowing, not the volumes itself. I am not sure how metadata overflow happens and if it is easy to prevent.

I am not sure what is the best behavior for Qubes there. BTW, is it handled somehow with sparse file overflow?

VirtualBox pauses the affected VM, but similar behavior in Qubes could lead in pausing all the VMs in a short time, even by writing something to, say bash_history or to a log file. Maybe pausing all VMs some time in advance would prevent that.

Another way to recover from such situation would be adding some extra drive (e.g. large HDD or even something external) to the pool.

Maybe I want no overcommit for dom0, as this seems to be very hard to recover from. Hmm, it is questionable if dom0, sys-net and sys-firewall are desirable on LVM at all, as it might be easy to make the system unbootable. Fortunately, they are rather fixed-size by design.

Yes, those probably still apply. Note that -o discard in a VM (enabled
by default) is something different than in dom0 (disabled by default).
In VM it is used to notify dom0 about blocks not used by VM anymore, so
sparse file private.img (or thin volume) can release them. In dom0 it
could be used to tell your SSD of unused space - so this is real TRIM
operation. To make use of it in dom0, you also need to add
"allow_discards" option to dm-crypt.

That applies for imgs and probably for thin LVM volume, but not for thick LVM volume. So, it depends.

BTW, the “allow_discards” seems to be renamed to “discard” according to man crypttab. I hope that allow_discards is still supported for backward compatibility.

Vít Šesták

unread,
May 29, 2015, 1:45:44 PM5/29/15
to qubes...@googlegroups.com
I haven't found anything about preventing the metadata overflow, but I've found something other interesting: LVM dangers and caveats: http://serverfault.com/a/279577/81680 (Although it is about 3y old, it is updated very recently and told that there were not much changes). Some of the mentioned issues are rather hardware specific, but there is also something XEN specific: Proper config for cache, i.e. writethrough×writeback: https://osdir.com/ml/xen-users/2010-11/msg00557.html

Regards,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
May 30, 2015, 5:00:38 AM5/30/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Fri, May 29, 2015 at 01:22:08AM -0700, Vít Šesták wrote:
>
>
> > There is an option to return an error immediately when volume is full
> > (so there will be no lost writes from a queue). I'm attaching terminal
> > output from test session at the end.
> >
>
> It does not sound as bad as I originally interpreted “If an overflow occurs
> within the thin pool metadata, then the pool will be corrupted. *LVM cannot
> recover from this*.” from
> have to prevent *metadata *from overflowing, not the volumes itself. I am
> not sure how metadata overflow happens and if it is easy to prevent.
>
> I am not sure what is the best behavior for Qubes there. BTW, is it handled
> somehow with sparse file overflow?

When underlying disk is full, writes to sparse file simply return an
error. But ext4 driver doesn't handle such situation nicely, especially
when it happens during flushing some buffers. Anyway journal
theoretically should be enough to recover from such situation.

> VirtualBox pauses the affected VM, but similar behavior in Qubes could lead
> in pausing all the VMs in a short time, even by writing something to, say
> bash_history or to a log file. Maybe pausing all VMs some time in advance
> would prevent that.
>
> Another way to recover from such situation would be adding some extra drive
> (e.g. large HDD or even something external) to the pool.

If you have something like that, you probably wouldn't run into out of
disk space situation in the first place :)

> Maybe I want no overcommit for dom0, as this seems to be very hard to
> recover from. Hmm, it is questionable if dom0, sys-net and sys-firewall are
> desirable on LVM at all, as it might be easy to make the system unbootable.
> Fortunately, they are rather fixed-size by design.
>
> Yes, those probably still apply. Note that -o discard in a VM (enabled
> > by default) is something different than in dom0 (disabled by default).
> > In VM it is used to notify dom0 about blocks not used by VM anymore, so
> > sparse file private.img (or thin volume) can release them. In dom0 it
> > could be used to tell your SSD of unused space - so this is real TRIM
> > operation. To make use of it in dom0, you also need to add
> > "allow_discards" option to dm-crypt.
> >
>
> That applies for imgs and probably for thin LVM volume, but not for thick
> LVM volume. So, it depends.
>
> BTW, the “allow_discards” seems to be renamed to “discard” according to man
> crypttab. I hope that allow_discards is still supported for backward
> compatibility.
>


- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVaXwsAAoJENuP0xzK19cs7BIH/2g6y/u6I+E+JLOTjMqqzXPS
6hVGp28F15NInWkKNnhC/6jjhqj3VXqG8/xHCPhdn1UdRP7u2mkiT5/U42t+ylPg
cG8e8A2m2Gyg40BKSokSdqPL8HVoyEZGWmKO8h6GgM3BnM/P9WFKdEm39Ex0kJLP
slFeWmwzWdHFKfK4uBfkU1MFPbQwEU8axbfv59IY5fZ0oBSvl+IQhVO5ovYMiDuk
q0BgRYBrW6lPzfrxMrMAZh9mZ+vM4ytYKu26y5SfCRz+EPjTiL5aAkdtCM7e4Npr
xRhlRawzq39z9d+5l+H33bEd7a6AHmxVu1xutI1yp3m8Of0AKTIASjciT7CiGOc=
=Q13l
-----END PGP SIGNATURE-----

Bahtiar Gadimov

unread,
May 30, 2015, 4:46:07 PM5/30/15
to qubes...@googlegroups.com
Hi,

i added LVM Support to qubes-core-admin a while ago. I'm running it on
Qubes R2 for more than 6 month without any issues. Have a look
https://github.com/kalkin/qubes-core-admin . You will also need the
patches for qubes-manager https://github.com/kalkin/qubes-manager. I
will port it to Qubes R3 when i have some free time.

Cheers
kalkin-

Vít Šesták

unread,
Jun 1, 2015, 5:42:13 AM6/1/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com

When underlying disk is full, writes to sparse file simply return an
error. But ext4 driver doesn't handle such situation nicely, especially
when it happens during flushing some buffers. Anyway journal
theoretically should be enough to recover from such situation.

The metadata overflow seems to be somehow different from full disk. There is some space reserved for metadata. If it overflows, LVM is said not to be able to recover from this. Googling suggest that this is not a frequent case, but I am still unsure about it.

If you have something like that, you probably wouldn't run into out of
disk space situation in the first place :)
I have a large HDD (well, large for me) and small (~ 111GiB) SSD. Adding HDD to the volume group might be a temporary solution, but it is not desired to have it this way permanently, because it might cause some major performance penalties, as both of them have different performance characteristics.  


Regards,
Vít Šesták

Vít Šesták

unread,
Jun 1, 2015, 5:43:25 AM6/1/15
to qubes...@googlegroups.com
Sounds great. I can't test it now, because I have 3.0-RC1.

Anyway, have you somehow handled the metadata issue?

Regards,
Vít Šesták 'v6ak'

Vít Šesták

unread,
Jun 6, 2015, 12:01:22 PM6/6/15
to qubes...@googlegroups.com, groups-no-private-mail--con...@v6ak.com


On Thursday, May 28, 2015 at 12:00:03 AM UTC+2, Marek Marczykowski-Górecki wrote:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On Wed, May 27, 2015 at 02:31:25PM -0700, Vít Šesták wrote:
> Great! This one-liner fix seems to work well, at least for TemplateVM's
> root.img :)

Added for next qubes-core-dom0 package version.



Marek, is the /etc/xen/scripts/block-snapshot oneliner patch somewhere lost, or just not present in stable version of 3.0-RC1? I updated Qubes's dom0 and got this change reverted, so I had to patch it again.
 
Reagrds,
Vít Šesták 'v6ak'

Marek Marczykowski-Górecki

unread,
Jun 7, 2015, 6:12:10 PM6/7/15
to Vít Šesták, qubes...@googlegroups.com
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

It is currently in testing repository (qubes-core-dom0-3.0.14).

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBAgAGBQJVdMG0AAoJENuP0xzK19csvnoH/1oRME+AnUHBCzjk8XQP0CSf
EwFUwsBjvPycABn4/dNCt7iehUSa0rK5UdhSqvNA+6NwxO9e15yir9h0KxmOpbXj
cfDwpxI8pSgqoWDPc8iEk2XEWwby1o1b8QSxqayJCMwsvf64l1O0eCzXrkYXU2dV
AIMdsJNt5OYeDGEYI4xhTSd5/mj+Z6DuKqw0NyWoIMqAZMqO6oROSq8cGJHil7bV
dl8hLplZvPzXrz9Xi2BPv4cIHr32h4l/ZYqjm01yvvucXahndtJ+IJEQ+5ctwm5Z
Nx0wNa33/+2F10YpbeJXUOy0W2rqIp4gf3kvNnmGWmJ3M+USBGHus/s8IQbUuwA=
=LPnR
-----END PGP SIGNATURE-----

Vít Šesták

unread,
Jul 29, 2015, 11:38:22 AM7/29/15
to qubes-users, groups-no-private-mail--con...@v6ak.com
After using the LVM setup some time, there are some my findings:
* It basically just works.
* I haven't tried the thin provisioning. I have read some experience with metadata pool overflow. The volume was reportedly available in R/O mode. Which may look like “OK, I will create a new volume, copy all the data there and remove the old volume.”. However, the user was reportedly unable to even remove the overflown volume. The only way to get rid of such volume seemed to be removing whole volume group. This implies more too much potential hassle for me. Moreover, it does not seem wise to install Qubes inside the the volume group, even on “thickly-provisioned” volume.
* QSB 20 has somehow deeper impact with the LVM setup, as the device is always available and it is a more standard device than a loopback:
    1. The VM obviously does not have to be running in order to perform such attack.
    2. It might be viable to perform boot-related attack. Attacker may add some arbitrary bootmenu item in order to fool the user. I am almost sure this is possible with unencrypted LVM VG. This is, however, probably also possible with an encrypted LVM VG, but I am not that sure about it.

Moreover, attacker might try to perform some grub.cfg injection and so on through the weakness described in QSB-20. But this is not specific for the LVM setup.

There is one more thing I am worried about: identifier collisions. If the attacker
1. knows my UUID (or maybe another identifier like LVM label) of my root partition or some partition on /etc/fstab and
2. has owned some my VM that runs on LVM,
then he can use the same identifier within the VM and hope that the VM's image will be mounted instead of the correct image. In some cases, this can result in dom0 compromise. Fortunately, such attack has strong preconditions. Basically, if I use unpredictable uuids in /etc/fstab and I keep them secret*, I should be safe.

This attack is not viable with file-based images, as their loopback devices are not available during the boot. Well, theoretically, one might call sudo umount /boot && sudo mount /boot, which is the moment when attacker with access to one of running VMs can do his job.

Mitigations:
a) Use encrypted LVM VG and decrypt it after some time, effectively mitigating the attack during the boot. (Does not mitigate the remount-triggered attack, though.)
b) Add some Qubes-specific header at the very beginning of all the VM images. Not sure how complex or simple it is. Depending on some details, this might mitigate even the remount-triggered attack.

Regards,
Vít Šesták 'v6ak'

*) Recently, I ***-ified my UUIDs in some pastes in this group. I just felt it is better to do that, without realizing this reason.

Vít Šesták

unread,
Aug 18, 2015, 3:42:42 PM8/18/15
to qubes-users, groups-no-private-mail--con...@v6ak.com
An issue related to DVM with such LVM setup: https://groups.google.com/forum/#!topic/qubes-devel/dtdh6cH78wY

Marek Marczykowski-Górecki

unread,
Aug 25, 2015, 10:55:21 PM8/25/15
to Vít Šesták, qubes-users
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256
The UUID collision doesn't applies to LVM volumes (root, swap), because
those are referenced by VG/LV name, which are outside of VM control. But
indeed /boot is referenced by UUID, so this might be a problem. I think
this could be mitigated by using some of /dev/disk/by-id or
/dev/disk/by-path symlink instead.

But rest of your points still stands. Especially about parsing those
untrusted volumes.

> This attack is not viable with file-based images, as their loopback devices
> are not available during the boot. Well, theoretically, one might call sudo
> umount /boot && sudo mount /boot, which is the moment when attacker with
> access to one of running VMs can do his job.

For example during AEM installation. This would be even simpler there -
device label is predictable.

> Mitigations:
> a) Use encrypted LVM VG and decrypt it after some time, effectively
> mitigating the attack during the boot. (Does not mitigate the
> remount-triggered attack, though.)
> b) Add some Qubes-specific header at the very beginning of all the VM
> images. Not sure how complex or simple it is. Depending on some details,
> this might mitigate even the remount-triggered attack.

The header idea theoretically might work, but I don't think it is
doable. I don't see any offset-like option in xen-blkback. This could be
done using additional device-mapper layer, but it seems to be an
overkill. Also AFAIR some metadata could stored at the end of device
(Linux software raid header?) so this space also may be parsed and
Qubes-specific header (at the beginning) would not prevent this.

The other solution would be to use separate VG for VM images. And
configure udev+initramfs+others to not touch LV content there (which
isn't as simple as it seems to be).

Anyway, as noted in QSB-20, the ultimate solution would be to implement
Untrusted Storage Domain.

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1

iQEcBAEBCAAGBQJV3SqSAAoJENuP0xzK19cs3s0IAJp1YnJHml9eiIKWQPBir8dQ
IRcZ46moVtoJ12ke7nJOpUDa5pJKIK4bwjoFmXq4SjeVYVCZu/GdyIVVuda/5aM6
w0PeRUR34FUfd6/0eh+EP5ksdvt0scuSszBg+Hwh3bLnoP3tXQcWpAfvg0ZEp3QP
4L9xQqsWS/l9BPC8K+2eeJnYEb+0lhtveardVCAWJMsPEFEdhMvEZONmU/pzuM0m
43Dp+nxVsRCkVbBUiajpC8CHeZD5fPyOg1DEUmF64jXMuR96b14YtadlxwejulFz
nH97yNpdu/C0TBm+6pn9dRSWAbGNJ90PHFg7fuBVKIqorGpoMStm/TxYkfh7iaQ=
=rgai
-----END PGP SIGNATURE-----

Vít Šesták

unread,
Nov 17, 2015, 8:23:25 AM11/17/15
to qubes-users
Some exposure: udev in dom0 seems to parse all the LVM volumes. Is there any way to block it? I've looked at it, but it seems to appear as generic dm-* device in udev, so I am not sure if I can block it. Maybe I could block all dm-*, but this might have some undesired consequences. (I am not sure…)

One consequence of this seems to be supporting nested LVMs (haven't tested) unless an appropriate filter (non-default) is configured in /etc/lvm/lvm.conf or some other mitigation is applied. In ideal case, it should be a global filter.

Note that this exposure does not seem to be triggerable from standard img files in default LVM config. Loop files don't seem to be processed by LVM (even after a manual call), because the LVM uses devices list from udev by default. The obtain_device_list_from_udev is 1 by default. This might be fragile (because LVM requires to be compiled with udev support, which might at least theoretically change in future), but my experiments suggest that this configuration option works now.

Regards,
Vít Šesták 'v6ak'

On Wednesday, May 27, 2015 at 9:30:41 PM UTC+2, Vít Šesták wrote:

Vít Šesták

unread,
Nov 17, 2015, 2:15:07 PM11/17/15
to qubes-users
OK, I looked at it wrong. I've solved this by the following rule:

ACTION!="remove", SUBSYSTEM=="block", ATTR{dm/uuid}=="LVM-*", ENV{DM_UDEV_DISABLE_DISK_RULES_FLAG}="1"

Maybe this is not the cleanest way to do this, but udevadm info suggests this attempt is successful. I was slightly afraid of boot issues when some mountpoints are on LVM, but it seems to work correctly, at least in my case. It would probably cause issue if I referred the filesystems by their internal UUIDs. However, since I have referred them by /dev/<VG name>/<LV name>, nothing seems to be broken. I haven't tested this with root filesystem on LVM, but I hope this will also work OK.

Regards,
Vít Šesták 'v6ak'

Vít Šesták

unread,
Sep 14, 2016, 2:22:18 PM9/14/16
to qubes-users
Just noting two more pitfalls:

1) When you create a new device, you should overwrite all the content (standard mkfs is not enough) before attaching it to a VM. If you don't do so, the VM might get some old data leaked from another VM. Maybe thin LVs have a different behavior.

2) When booting from Qubes installation image and trying to perform system recovery, it seems to scan all LVs, regardless they are dom0 LVs or domU LVs. This is potentially dangerous (filesystem parsing bugs). And since the installation image is not updated frequently, there is even higher probability of a known unpatched vulnerability. Maybe it could be determined by the name if it should be scanned.

Since LVM thin volumes are to be used in Qubes 4.0, I'd like to ask you if Qubes addresses those two issues there.

Marek Marczykowski-Górecki

unread,
Sep 14, 2016, 5:29:49 PM9/14/16
to Vít Šesták, qubes-users
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

On Wed, Sep 14, 2016 at 11:22:18AM -0700, Vít Šesták wrote:
> Just noting two more pitfalls:
>
> 1) When you create a new device, you should overwrite all the content (standard mkfs is not enough) before attaching it to a VM. If you don't do so, the VM might get some old data leaked from another VM. Maybe thin LVs have a different behavior.

LVM thin don't have this problem, as blocks are allocated at first write
only (reading blocks not written before will yield zeros). But we may
want to do the clear data anyway at VM removal, for various reasons (like
anti-forensics).

> 2) When booting from Qubes installation image and trying to perform system recovery, it seems to scan all LVs, regardless they are dom0 LVs or domU LVs. This is potentially dangerous (filesystem parsing bugs). And since the installation image is not updated frequently, there is even higher probability of a known unpatched vulnerability. Maybe it could be determined by the name if it should be scanned.

Yes, we'll probably include the same udev rules (blacklisting scanning
VM-related devices) also in installer/recovery image.

> Since LVM thin volumes are to be used in Qubes 4.0, I'd like to ask you if Qubes addresses those two issues there.

Thanks for reminding this, I've created an issue to not forget about
this one:
https://github.com/QubesOS/qubes-issues/issues/2319

- --
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v2

iQEcBAEBCAAGBQJX2cFHAAoJENuP0xzK19cslpAH/j7fM3Z03hwBPMVf2OCtrLxL
3tAYxxchi1RDCJ8HaAO5v8orNXnrbSIBhvcTduLEyK7/STsErLeD06Y+arn03gTJ
XwkI07DziBxu/TqtN0ahz6h4ryztplJZf2L8snoPO+OMpUqQZbLuNQvOSk+BEphn
dIne8FrMTKjGerBdDt732qiHt5kdUXYQUFP6GFklXXkyJhlBVO9x+1myDu4FFf34
e4ynaSoOw6x3BH8+kMNhGLGEr1PA03hXV6+Whfj4J0grsGJEkVq8jBKAaHCt0pba
kIBjs0QUJDVPeGKzZccnitx9XJo9Dumbhk+9UYLm6izVBya7x1+jsJQVnWWW64o=
=WBMc
-----END PGP SIGNATURE-----
Reply all
Reply to author
Forward
0 new messages