ZFS as external storage working

215 views
Skip to first unread message

John N.

unread,
May 6, 2016, 5:14:15 AM5/6/16
to ganeti
Hi ganeti fans,

I finally had some time to tackle the topic of ZFS as external storage for ganeti and got it working on Debian 8 with Ganeti 2.12 (official Debian packages). For that purpose I used the ZFS extstorage provider mentioned in the ganeti wiki (https://github.com/ffzg/ganeti-extstorage-zfs) but unfortunately this provider is experimental and unmaintained. Also my many attempts to contact the author failed. So I just decided to fork it and will maintain it in my own fork which is available here:

https://github.com/hostingnuggets/ganeti-extstorage-zfs

For now I added and fixed the following:

- detailed logging of all commands (useful for debugging and dev/test)
- fixed "gnt-instance replace-disks" and "gnt-instance activate-disks" which was not working at all

So in case you also would like to use ZFS with ganeti it is possible and I am hoping to put my first ganeti on ZFS cluster out in production very soon. Feel free to also use this extstorage provider test it and comment back if you find any issues. If I remember correctly I few other people in this group like me were very interested in using ZFS with ganeti.

Regards,
J.

Phil Regnauld

unread,
May 6, 2016, 6:38:13 AM5/6/16
to gan...@googlegroups.com
John N. (hostingnuggets) writes:
>
> I finally had some time to tackle the topic of ZFS as external storage for
> ganeti and got it working on Debian 8 with Ganeti 2.12 (official Debian
> packages). For that purpose I used the ZFS extstorage provider mentioned in
> the ganeti wiki (https://github.com/ffzg/ganeti-extstorage-zfs) but
> unfortunately this provider is experimental and unmaintained. Also my many
> attempts to contact the author failed.

Hi John,

I tried too and got no response.

> So I just decided to fork it and
> will maintain it in my own fork which is available here:
>
> https://github.com/hostingnuggets/ganeti-extstorage-zfs
>
> For now I added and fixed the following:
>
> - detailed logging of all commands (useful for debugging and dev/test)
> - fixed "gnt-instance replace-disks" and "gnt-instance activate-disks"
> which was not working at all

That's awesome.

> So in case you also would like to use ZFS with ganeti it is possible and I
> am hoping to put my first ganeti on ZFS cluster out in production very
> soon. Feel free to also use this extstorage provider test it and comment
> back if you find any issues. If I remember correctly I few other people in
> this group like me were very interested in using ZFS with ganeti.

I will definitely be looking at it, with 16.04 LTS offering ZFS nearly
out of the box. I've done performance testing with LZ4 compression, and
it's impressive.

John N.

unread,
May 6, 2016, 8:23:13 AM5/6/16
to ganeti


On Friday, May 6, 2016 at 12:38:13 PM UTC+2, Phil Regnauld wrote:
        I will definitely be looking at it, with 16.04 LTS offering ZFS nearly
        out of the box. I've done performance testing with LZ4 compression, and
        it's impressive.

Great, and I also read that Debian 9 should get ZFS out of the box which will makes things easier... I am using LZ4 as compression too with SSD enterprise disks on a ZFS RAIDZ volume and get roughly 950 MB/s read speed and around 200 MB/s write speed. These speeds are within a ganeti instance so reading/writting through a ZFS zvol.

Regards
J.

insrc

unread,
Oct 15, 2016, 11:34:19 AM10/15/16
to gan...@googlegroups.com


On Fri, May 6, 2016 at 11:14 AM, John N. <hosting...@gmail.com> wrote:
Hi ganeti fans,


Hi John,

 
I finally had some time to tackle the topic of ZFS as external storage for ganeti and got it working on Debian 8 with Ganeti 2.12 (official Debian packages). For that purpose I used the ZFS extstorage provider mentioned in the ganeti wiki (https://github.com/ffzg/ganeti-extstorage-zfs) but unfortunately this provider is experimental and unmaintained. Also my many attempts to contact the author failed. So I just decided to fork it and will maintain it in my own fork which is available here:

https://github.com/hostingnuggets/ganeti-extstorage-zfs

For now I added and fixed the following:

- detailed logging of all commands (useful for debugging and dev/test)
- fixed "gnt-instance replace-disks" and "gnt-instance activate-disks" which was not working at all


Thanks for your hard work on this fork ! 
 
So in case you also would like to use ZFS with ganeti it is possible and I am hoping to put my first ganeti on ZFS cluster out in production very soon. Feel free to also use this extstorage provider test it and comment back if you find any issues. If I remember correctly I few other people in this group like me were very interested in using ZFS with ganeti.

 
I have yet to test it extensively but i'm just wondering if you put out  in production the ZFS cluster you're talking about  and if so, what's your experience so far with it please :-) ? 

Is the  ext storage interface in general and the ZFS ext storage in particular as robust as the standard plain and drbd backend ? 

Any gotcha to look for please ? 


I'm running a small Ganeti cluster in production for years but i only used so far with the classic plain (aka lvm) and drbd storage backend. 
And as ZFS on Linux seems production ready, i'm thinking that some of the more demanding instance regarding disk i/o could take advantage of the tiering storage capability of ZFS by using SSDs  for the l2arc & SLOG
And if ZFS could be also used as backend for DRBD, that would be terrific 

Thanks !

Regards,

John N.

unread,
Oct 15, 2016, 4:58:06 PM10/15/16
to ganeti
Hi insrc,

Glad this can also help others.

So yes I am using my fork of ganeti-extstorage-zfs with a small production cluster of two nodes running Debian 8 with around 15 instances and this since a few months now. I find the ext-storage backend robust and you can do pretty much everything as with the other backends, it's just up to you to implement it. Below I have listed listed my gotchas and experience based on this few months in production.

1) Two DRBD resources suddenly got disconnected for 2 seconds for no apparent reasons, on the one node I got:

[Mon Aug  1 19:10:13 2016] drbd resource0: meta connection shut down by peer.

and then on the other node:

[Mon Aug  1 19:10:19 2016] drbd resource0: PingAck did not arrive in time.

Could never explain why but it happened only once.

2) ZFS L2ARC does not bring any performance advantages in my specific case of using exclusively SSD disks. I got an additional expensive NVMe SSD disk just for L2ARC but it showed me that it was just a waste of money. In fact I even believe that I had slightly worse performance due to all of the context switches and "l2arc_feed" kernel threads running all the time to populate the L2ARC. Furthermore I first made the mistake of having a huge L2ARC of 100 GB, in that specific case my ARC was rendered useless as it got full by containing pointers to the L2ARC and nothing else. When this happens take great care, your whole systems starts to get very slow and you have a high load on the server. Once it was so bad I had to reboot the server. So if you still want to use a L2ARC make sure it is not much bigger than your ARC, I would say your L2ARC should not be larger than 2-3x your ARC. I had around 8 GB reserved for the ARC and having an L2ARC of 100 GB was just plain naive. Finally I simply recommend not to use an L2ARC at all, which is my case now.

3) If you are running Xen and ZFS together there is a rule of thumb (I think I found that one in the Wiki of ZFS on Linux) which says that you should allocate maximum 40% of your hypervisor RAM to your the ARC. As such my hypervisor has now 16 GB of RAM and I have reserved 6 GB of RAM to the ARC. It seems to be a bit of a waste but memory is cheap and with that specific rule of thumb I am on the safe side.

4) There was around 3 months ago a Debian upgrade for the LVM package. This upgrade was a nasty one for me as it for some reasons deleted all the symbolic links to the /dev/zd* devices from the /dev/ffzgvg directory. Luckily enough I found a way through a gnt-instance or gnt-node command to recreate them for me (maybe it was repair-disks, can't remember exactly). So now if I see an new package upgrade for LVM I will be more alert.

5) I never really got to understand why the zvol device numbers increment by 16, such as /dev/zd0, /dev/zd16, /dev/zd32...

6) I have the general feeling that ZFS generates a lot of context switches, on a node with 8 instances I have a daily average of 2k context switches. It is not uncommon to see spikes at 30k context switches. Not sure really what generates these huge spikes but I don't have so many context switches on an equivalent ganeti cluster using hardware RAID.

7) Again regarding ARC I would like to try out switching the ZFS primarycache parameter for a zvol to "metadata" instead of "all". This means the ARC would just contain metadata and not the whole data. I think this makes sense as the data is already cached instance an instance by the OS. I was even thinking of maybe not having any ARC at all. But never tested it.

I would be interested to read about your experience with ganeti and ZFS as storage using my fork of ganeti-extstorage-zfs. Let me know how it works for you once you set it up.

Good luck and enjoy!

Regards,
J.

insrc

unread,
Oct 16, 2016, 11:26:08 AM10/16/16
to gan...@googlegroups.com
On Sat, Oct 15, 2016 at 10:58 PM, John N. <hosting...@gmail.com> wrote:
Hi insrc,

Glad this can also help others.

Hi John,

Thanks a lot for taking the time to share your feedback :-)
 
So yes I am using my fork of ganeti-extstorage-zfs with a small production cluster of two nodes running Debian 8 with around 15 instances and this since a few months now. I find the ext-storage backend robust and you can do pretty much everything as with the other backends, it's just up to you to implement it. Below I have listed listed my gotchas and experience based on this few months in production.

Oh no, Implement things ? I'm not good at that :-/
Did you have to implement things that was available by default on the classic storage backend please ? 

1) Two DRBD resources suddenly got disconnected for 2 seconds for no apparent reasons, on the one node I got:


Good, that means DRBD over Zvol is actually supported :) 
 (...)

2) ZFS L2ARC does not bring any performance advantages in my specific case of using exclusively SSD disks. I got an additional expensive NVMe SSD disk just for L2ARC but it showed me that it was just a waste of money. In fact I even believe that I had slightly worse performance due to all of the context switches and "l2arc_feed" kernel threads running all the time to populate the L2ARC. Furthermore I first made the mistake of having a huge L2ARC of 100 GB, in that specific case my ARC was rendered useless as it got full by containing pointers to the L2ARC and nothing else. When this happens take great care, your whole systems starts to get very slow and you have a high load on the server. Once it was so bad I had to reboot the server. So if you still want to use a L2ARC make sure it is not much bigger than your ARC, I would say your L2ARC should not be larger than 2-3x your ARC. I had around 8 GB reserved for the ARC and having an L2ARC of 100 GB was just plain naive. Finally I simply recommend not to use an L2ARC at all, which is my case now.


Oh ? Now that you mention it, even without an all flash storage, it seems that ZFS L2ARC is pretty much useless according to this discussion on the zfs ML as most of the caching is done at the guest level: http://list.zfsonlinux.org/pipermail/zfs-discuss/2016-September/026318.html 

As for the benefit of an SSD SLOG device for a Ganeti cluster with mechanical drive as primary storage, i guess that it would only help guest write performance in a really narrow use case of a cluster with VMs doing a lot of synchronous write...which is only a tiny fraction of the guest i'm running now (basically database VMs)

Hum, thanks to your remarks, i was really naive thinking that ZFS would magically boost the performance of my current cluster and its all mechanical drive storage and i'm just wondering if building a new cluster based on the ZFS ext-storage backend is worth the time and effort it would require :-/ 

I'm no performance analyst but i guess that the workload of a virtualization node is mostly  random reads & writes  and if so, ZFS tiered storage feature won't make any difference performance wise and  the best and easiest way to boost guest disk I/O operation is to invest in flash drives like you did. 

 
3) If you are running Xen and ZFS together there is a rule of thumb (I think I found that one in the Wiki of ZFS on Linux) which says that you should allocate maximum 40% of your hypervisor RAM to your the ARC. As such my hypervisor has now 16 GB of RAM and I have reserved 6 GB of RAM to the ARC. It seems to be a bit of a waste but memory is cheap and with that specific rule of thumb I am on the safe side.


I'll be running KVM but the zfs folks using Xen as hypervisor on the previously linked discussion agree with you 


4) There was around 3 months ago a Debian upgrade for the LVM package. This upgrade was a nasty one for me as it for some reasons deleted all the symbolic links to the /dev/zd* devices from the /dev/ffzgvg directory. Luckily enough I found a way through a gnt-instance or gnt-node command to recreate them for me (maybe it was repair-disks, can't remember exactly). So now if I see an new package upgrade for LVM I will be more alert.

 
Something to be aware of for sure, thanks :-) !
 
5) I never really got to understand why the zvol device numbers increment by 16, such as /dev/zd0, /dev/zd16, /dev/zd32...

6) I have the general feeling that ZFS generates a lot of context switches, on a node with 8 instances I have a daily average of 2k context switches. It is not uncommon to see spikes at 30k context switches. Not sure really what generates these huge spikes but I don't have so many context switches on an equivalent ganeti cluster using hardware RAID.

7) Again regarding ARC I would like to try out switching the ZFS primarycache parameter for a zvol to "metadata" instead of "all". This means the ARC would just contain metadata and not the whole data. I think this makes sense as the data is already cached instance an instance by the OS. I was even thinking of maybe not having any ARC at all. But never tested it.


primarycache="metadata" is also what was recommended for this kind of workload in the linked ZFS ML thread :-)

 
I would be interested to read about your experience with ganeti and ZFS as storage using my fork of ganeti-extstorage-zfs. Let me know how it works for you once you set it up.


Well, as it was the performance gains (i thought) more than the manageability features (like snapshotting) that made me think about zfs on linux for my new Ganeti cluster, i'm not really sure that'll be brave enough to run it in production now that you make me realize that ZFS won't be of any help on this front :-/ 
Running nodes with hybrid storage (one big VG with all mechanical drives and a small second VG with SSD drives dedicated for the VMs requiring solid I/O perf ) would  make more sense in my case i think if i can't afford to go with the all flash nodes setup

Anyway, thanks a lot for you work on the ZFS ext-storage backend and your precious feedback on it.
Regards,


Brian Foley

unread,
Oct 17, 2016, 6:33:45 AM10/17/16
to gan...@googlegroups.com
On Sun, Oct 16, 2016 at 05:25:46PM +0200, insrc wrote:
> On Sat, Oct 15, 2016 at 10:58 PM, John N. <[1]hosting...@gmail.com>
> wrote:
>
> Hi insrc,
> Glad this can also help others.
>
> Hi John,
> Thanks a lot for taking the time to share your feedback :-)
>
> So yes I am using my fork of ganeti-extstorage-zfs with a small
> production cluster of two nodes running Debian 8 with around 15
> instances and this since a few months now. I find the ext-storage
> backend robust and you can do pretty much everything as with the other
> backends, it's just up to you to implement it. Below I have listed
> listed my gotchas and experience based on this few months in
> production.

Hi John, Phil,

this is great. Would you be interested in adding this into the stable-2.16
branch in the main ganeti repo rather than maintaining a separate repo?

We don't have the engineer bandwidth to officially support it at the moment,
but as a starting point we could create a contribs/extstorage/zfs dir, and put
all these scripts there.

Cheers,
Brian.

John N.

unread,
Oct 17, 2016, 1:34:26 PM10/17/16
to ganeti


On Sunday, October 16, 2016 at 5:26:08 PM UTC+2, insrc wrote:

Oh no, Implement things ? I'm not good at that :-/
Did you have to implement things that was available by default on the classic storage backend please ? 


I might have scared you here for nothing... I thought you were asking about how good the extstorage backend is if you have to create a new one on your own but of course by using the gnt-exstorage-zfs you don't need to implement anything as it is already implemented. It's basically just a matter of installing the files from gnt-extstorage-zfs in the right directories of ganeti on all your ganeti nodes.


Good, that means DRBD over Zvol is actually supported :) 

Yep, I can confirm that.
 

Oh ? Now that you mention it, even without an all flash storage, it seems that ZFS L2ARC is pretty much useless according to this discussion on the zfs ML as most of the caching is done at the guest level: http://list.zfsonlinux.org/pipermail/zfs-discuss/2016-September/026318.html 

Actually one of the persons writing in that specific thread is me ;-) So yes IMHO ZFS L2ARC is useless when you use your zvols for hosting virtual machines. Better give your virtual machine more RAM as Linux has its own caching. The nearest the cache the better and like that you avoid many unnecessary context switches.
 

As for the benefit of an SSD SLOG device for a Ganeti cluster with mechanical drive as primary storage, i guess that it would only help guest write performance in a really narrow use case of a cluster with VMs doing a lot of synchronous write...which is only a tiny fraction of the guest i'm running now (basically database VMs)

I can't say much about ZFS SLOG as I never did any performance tests regarding the slog but in general I would always use an SSD SLOG device (mirrored of course).
 

Hum, thanks to your remarks, i was really naive thinking that ZFS would magically boost the performance of my current cluster and its all mechanical drive storage and i'm just wondering if building a new cluster based on the ZFS ext-storage backend is worth the time and effort it would require :-/ 

As mentioned I might have scared you before. Using ZFS with ganeti is not much more effort. You basically have to setup your ZFS zpool as you would setup your RAID array with a hardware RAID card, then deploy the gnt-extstorage-zfs on all your nodes and finally when initializing your ganeti cluster you would enable the ext storage backend (e.g. using the following parameters "--enabled-disk-templates plain,drbd,ext --ipolicy-disk-templates plain,drbd,ext").

 
I'm no performance analyst but i guess that the workload of a virtualization node is mostly  random reads & writes  and if so, ZFS tiered storage feature won't make any difference performance wise and  the best and easiest way to boost guest disk I/O operation is to invest in flash drives like you did. 

In general I try to avoid any mechanical drives anymore, except for big data storage or backup storage. We are now in the all SSD/Flash era...
 

I'll be running KVM but the zfs folks using Xen as hypervisor on the previously linked discussion agree with you 


Exactly this also applies to KVM.
 

Well, as it was the performance gains (i thought) more than the manageability features (like snapshotting) that made me think about zfs on linux for my new Ganeti cluster, i'm not really sure that'll be brave enough to run it in production now that you make me realize that ZFS won't be of any help on this front :-/ 
Running nodes with hybrid storage (one big VG with all mechanical drives and a small second VG with SSD drives dedicated for the VMs requiring solid I/O perf ) would  make more sense in my case i think if i can't afford to go with the all flash nodes setup


I don't use ZFS snapshotting myself yet and this wasn't my first criteria of choice for using ZFS with ganeti. Don't forget that with ZFS you don't need an expensive RAID card. I would typically use expensive LSI MegaRAID cards with lots of memory, battery backup unit and maybe even license for SSD caching. With that kind of RAID card you would easily reach 1000 EUR. ZFS does all that for you. So the money you save by buying a RAID card you could invest in SSD disks.
 
Anyway, thanks a lot for you work on the ZFS ext-storage backend and your precious feedback on it.


You're welcome, let us all know if you change mind and give ZFS a try :)

Regards
J.

John N.

unread,
Oct 17, 2016, 1:38:53 PM10/17/16
to ganeti


On Monday, October 17, 2016 at 12:33:45 PM UTC+2, Brian Foley wrote:

Hi John, Phil,

this is great. Would you be interested in adding this into the stable-2.16
branch in the main ganeti repo rather than maintaining a separate repo?

We don't have the engineer bandwidth to officially support it at the moment,
but as a starting point we could create a contribs/extstorage/zfs dir, and put
all these scripts there.

Cheers,
Brian.

Hi Brian,

Sure, I would definitely be happy to contribute. I'm quite busy in general but I would try to find some time if you guide me through the steps of adding the gnt-extstorage-zfs into a contrib dir of ganeti.

Cheers,
J.

Phil Regnauld

unread,
Oct 17, 2016, 1:43:04 PM10/17/16
to gan...@googlegroups.com
John N. (hostingnuggets) writes:
>
> > Good, that means DRBD over Zvol is actually supported :)
>
> Yep, I can confirm that.

I'm curious - how do you go about creating a drbd type instance when
the storage is ZFS ? I'm sure the answer is in the documentation, so I'll
go and read it (again - it was a while since I last tried zfs ext storage),
and report here unless someone beats me to it :)


John N.

unread,
Oct 17, 2016, 4:00:26 PM10/17/16
to ganeti

On Monday, October 17, 2016 at 7:43:04 PM UTC+2, Phil Regnauld wrote:
        I'm curious - how do you go about creating a drbd type instance when
        the storage is ZFS ? I'm sure the answer is in the documentation, so I'll
        go and read it (again - it was a while since I last tried zfs ext storage),
        and report here unless someone beats me to it :)

Hi Phil,

Because the ganeti-extstorage-zfs provider actually "hijacks" the LVM commands you simply create an instance as you are used to do (for example with type DRBD). Nothing changes for for the gnt-instance commands. As an example have a look at the lvcreate shell script which wraps around the LVM's lvcreate binary:

https://github.com/hostingnuggets/ganeti-extstorage-zfs/blob/master/sbin/lvcreate

So it all boils down to installing the provider and enabling extstorage disk template in your ganeti cluster as done here:

https://github.com/hostingnuggets/ganeti-extstorage-zfs/blob/master/install/1-enable-ext-template.sh

Regards,
J.

candlerb

unread,
Nov 9, 2017, 5:58:35 AM11/9/17
to ganeti
On Monday, 17 October 2016 21:00:26 UTC+1, John N. wrote:

Because the ganeti-extstorage-zfs provider actually "hijacks" the LVM commands you simply create an instance as you are used to do (for example with type DRBD). Nothing changes for for the gnt-instance commands. As an example have a look at the lvcreate shell script which wraps around the LVM's lvcreate binary:

https://github.com/hostingnuggets/ganeti-extstorage-zfs/blob/master/sbin/lvcreate


I saw that.

I thought: "Ganeti really just shells out to lvm commands to do work? Ewwww."   But I suppose the native APIs are just too messy to deal with.

John N.

unread,
Nov 10, 2017, 4:29:21 PM11/10/17
to ganeti
You are totally right, ganeti just execs various LVM commands (lvs, lvcreate, etc) just as it does for the hypervisor (like running "xl" for Xen). You can observe the commands being run by tailing the ganeti node-daemon.log log file.

Best,
J.

Phil Regnauld

unread,
Nov 14, 2017, 4:45:14 PM11/14/17
to ganeti
I've just tested this on a couple of nested VMs - very neat. It works as documented. Everything I tested worked out smoothly. Only thing I couldn't get working
(and probably not supported) was to have LVM on one side and DRBD on the other - even with the "volume group" name being identical, it complained about not being able to calculate space:

Node A:

# zfs list

NAME                                                         USED  AVAIL  REFER  MOUNTPOINT
zfs                                                          719M  24.1G    19K  /zfs
zfs/testvg                                                   718M  24.1G    19K  /zfs/testvg
zfs/testvg/3c800107-6d4f-45cf-9e2e-a39b2742503b.disk0_data   718M  24.1G   718M  -

Node B:

# vgs

  VG     #PV #LV #SN Attr   VSize  VFree
  testvg   1   0   0 wz--n- 21.00g 21.00g


# gnt-instance modify -t drbd -n nodeb t-drbd
Failure: prerequisites not met for this operation:
error type: environment_error, error details:
Can't compute free disk space on node nodeb for vg testvg, result was 'None'

... but I guess that's a bit pushing it :)

John N.

unread,
Nov 22, 2017, 12:32:24 PM11/22/17
to ganeti
Hi Phil,

Nice you gave it a shot and happy to hear your tests worked fine.

That's correct currently you can't have mixed nodes with some having ZFS and others LVM. I think it should not be too difficult to adapt the ext storage provider for that but I am not sure this is a good idea to have such a heterogeneous cluster in terms of storage backend. Of course for testing purposes that's just fine :)

In case you are anyone else fancies fixing/extending/etc feel free to fork https://github.com/brigriffin/ganeti-extstorage-zfs

Best,
John

Phil Regnauld

unread,
Nov 22, 2017, 3:00:32 PM11/22/17
to gan...@googlegroups.com
John N. (hostingnuggets) writes:
>
> In case you are anyone else fancies fixing/extending/etc feel free to fork
> https://github.com/brigriffin/ganeti-extstorage-zfs

Yep - I'm thinking of making it possible to enable zfs as an ext storage
type, and the LVM emulation bit, separately - or at least see if there's
a less intrusive way than by replacing lv* commands, which will be
overwritten by a package upgrade.

John N.

unread,
Nov 30, 2017, 1:29:18 PM11/30/17
to ganeti
That would be a great initiative. I personally also find it annoying and confusing at the same time that all LVM commands are replaced with a wrapper. As you mentioned very well a package upgrade of LVM will remove these and have to be manually re-installed. Happened to me once last year and now I always check if the lvm package has a pending upgrade.

Best,
J.
Reply all
Reply to author
Forward
0 new messages