I thought I'd have a go at using image files on local ZFS with ganeti, à la
sanoid. Specifically I wanted ganeti to create a new zfs dataset for each VM, store the disks as image files within it, and remove the dataset when the VM is removed.
TL;DR: I got it to work using ganeti instance hooks, although I needed a small patch to ganeti to make removal work properly. I'm documenting what I did here in case it's useful to anyone else.
1. CREATING THE ZFS POOL
I decided to make a ZFS pool on top of LVM, in order to use the LVM free space, and so that I could grow the ZFS area incrementally as I migrated instances from LVM to ZFS. Boringly I called the zpool "zfs", and made a dataset "zfs/vm" to hold all the VM images on this node.
apt-get install zfsutils-linux
lvcreate -l 100%FREE --name zpool ganeti
zpool create -oashift=12 zfs /dev/ganeti/zpool
zfs set compression=lz4 zfs
zfs create zfs/vm
2. ENABLING LOCAL FILE STORAGE ON GANETI
This was very frustrating. You have to do *three* separate cluster config changes:
gnt-cluster modify --enabled-disk-templates plain,drbd,file
gnt-cluster modify --ipolicy-disk-templates plain,drbd,file
gnt-cluster modify --file-storage-dir /zfs/vm # was previously /srv/ganeti/file-storage
But that's not sufficient.
It turns out that you *also* have to edit /etc/ganeti/file-storage-paths and add `/zfs/vm` there too, and distribute to all nodes. (I needed to add debug statements into the python code to find out why it wasn't working!)
3. HOOKS
I wrote a small hook using 'printenv' to find out exact what environment variables ganeti passes to instance add and remove
hooks, then used that information to make the scripts:
===> /etc/ganeti/hooks/instance-add-pre.d/zfs-create <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
zfs create "zfs/vm/$GANETI_INSTANCE_NAME"
fi
===> /etc/ganeti/hooks/instance-remove-post.d/zfs-destroy <===
#!/bin/sh -e
if [ "$GANETI_INSTANCE_DISK_TEMPLATE" = "file" -a "$GANETI_INSTANCE_PRIMARY" = "$(hostname)" ]; then
zfs destroy "zfs/vm/$GANETI_INSTANCE_NAME"
fi
(Remember to chmod +x and to distribute to all nodes)
The remove-post script receives GANETI_INSTANCE_DISK0_ID with a full file path which could have been used for removing the dataset, but the add-pre script does not; my only option in the add script was to use the instance name. Therefore for symmetry and simplicity I used the same logic in the removal script too.
Note: if you specify --file-storage-dir when adding an instance, sadly this does *not* appear in the environment variables for the instance add-pre hook at all, so I couldn't make it work with this option at all.
This nearly worked, except for one problem: on gnt-instance remove, ganeti tries to remove the enclosing directory /zfs/vm/<instance-name>, but that fails with EBUSY because it's a mounted dataset, and then doesn't run the post script. I had to make a small patch to ganeti itself to allow this:
--- /usr/share/ganeti/2.15/ganeti/backend.py.orig 2016-03-26 07:28:51.000000000 +0000
+++ /usr/share/ganeti/2.15/ganeti/backend.py 2018-04-29 16:40:48.000000000 +0100
@@ -4551,7 +4551,10 @@
try:
os.rmdir(file_storage_dir)
except OSError, err:
- _Fail("Cannot remove file storage directory '%s': %s",
+ if isinstance(err, OSError) and err.errno == os.errno.EBUSY:
+ pass
+ else:
+ _Fail("Cannot remove file storage directory '%s': %s",
file_storage_dir, err)
Now you can do the following:
gnt-instance add -n <nodename> -t file -o noop --no-start --no-install --no-name-check --no-ip-check --disk 0:size=1G --disk 1:size=2G foobar
...
gnt-instance remove foobar
This creates the zfs/vm/foobar dataset with the sparse image files inside it, and similarly removes the dataset when the image is destroyed.
4. MIGRATING INSTANCES FROM LVM TO ZFS
This is made awkward because:
a. ganeti does not
yet support instance conversion from -t plain to -t file
b. ganeti -t file does not support adoption of existing files
So what I did was:
- stop the instance
- if drbd, change to plain (gnt-instance modify -t plain xxx)
- rename it from xxx to xxx-old
- create a new instance xxx, copying relevant backend/disk/NIC settings from existing instance, e.g.
gnt-instance add -o <ostype> --no-install --no-start \
-B minmem=NNN,maxmem=NNN,vcpus=NNN \ # copy from existing instance
-t file -s <disk size>G \
-n <node> xxx
- copy the instance image disk(s) from LVM to overwrite the sparse file(s) (qemu-img is useful for this); get the old LVM path with UUID from gnt-instance info.
qemu-img convert -f raw -O raw /dev/ganeti/XXXX.disk0_data /zfs/vm/xxx/YYYY.file.disk0
# repeat if image has more than one disk
- start the new instance, check it's all working fine
- now you can release the old storage:
gnt-instance remove xxx-old
and on the node where the storage was used, you can hand it over to ZFS:
lvextend -l +100%FREE /dev/ganeti/zpool
zpool online -e zfs /dev/ganeti/zpool
Tedious, but doable.
5. SNAPSHOTTING AND REPLICATING ZFS DATASETS
This is left as an exercise for the reader. My approach is to have the VM hosts with SSD, but they periodically replicate to a backup host with HDDs.
I tried the following tools which all do the job:
*
sanoid and syncoid I think is the best option. sanoid makes periodic snapshots from a configured policy, and syncoid does replication; it can replicate individual listed datasets or recursively under a parent dataset. syncoid supports both "push" and "pull" replication, but needs to be able to ssh to the remote host as root. It supports nagios health checking flags (`--monitor-health`, `--monitor-snapshots`)
Limitation: it can't take snapshots more often than hourly.
* zfsnap + simplesnap are in the default Ubuntu/Debian repos. zfsnap creates the rotating snapshots, and simplesnap is the replicator. simplesnap supports only "pull" replication, i.e. the target backuphost has to make an outbound ssh connection to the source host; but on the source host it only needs to be able to run a safe wrapper script.
Simplesnap
does not copy any existing snapshots as part of its initial replication. However, any snapshots created subsequently *are* copied - and if they are removed, they are not deleted from the receiving side, so this may leave manual cleanup to be done.
As the manpage says:
> simplesnap will transfer snapshots made by other tools, but will not
> destroy them on either end.
> ...
> Since simplesnap sends all snapshots, it is possible that locally-created
> snapshots made outside of your rotation scheme will also be sent to your
> backuphost
A single property `org.complete.simplesnap:exclude` can be used to exclude certain datasets from replication; but otherwise all datasets are replicated under `destination/hostname/...`
* zfs-auto-snapshot and znapzend. I hated znapzend; it has bits in C and Perl, all its configuration is stored in zfs properties. There's a horribly complex
tool to configure it.
The benefit of this one is that it supports pre- and post- snapshot hooks, so for example you can
quiesce a VM before snapshotting it, and let it continue afterwards, to ensure the filesystem is in a consistent state inside each snapshot. (But I don't think it would be too hard to modify sanoid and syncoid to do that).
----------
TODO: I now need to check how ganeti handles instance moves with -t file, and probably need to add a new hook to support creating and removing datasets as part of the move.
To minimise downtime, what I'd actually rather do is:
- sync the zfs filesystem to the target node while the instance is running
- shut down the instance
- do a final resync to pick up just the latest changed blocks
- start the new instance
But I don't think ganeti will let me do that unless I treat the filesystem as -t sharedfile, and then it would be up to me to perform the copying outside of ganeti itself.
Anyway, I hope there's something useful in the above. It looks like a pretty decent way to run VMs, although I think it would be a little better if ganeti could use qcow2 files rather than sparse raw image files.
Cheers,
Brian.