Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

RFC: new zfs based sync/failover tool

137 views
Skip to first unread message

Philip Brown

unread,
Feb 21, 2012, 2:09:25 PM2/21/12
to
I've finally got sick of the lacking in the commercial space, and also
the surprising dearth in the free space, to write my own zfs
replication tool.
What I have seen out there already, I consider "good, but not good
enough". There are a few "zfs replication" type scripts out there, but
theh ones I've seen dont close the loop fully, for someone considering
setting up some kind of active/passive failover system pair.
So... I figured I'd write one myself.

Before I get TOO deep into the coding guts, I'm going to post my
feature plan, and "request comments" on it.

A bit of background, for those who dont know me due to my long absence
from here: I'm the author of pkg-get, among other things. So, you can
expect this to be handled as serious software, rather than some fluff
piece.

I may have forgotten to put some of my ideas down in words, but I
think I've written down most of the "big ideas".

Details
follow:::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::

WIP: feature design doc for "zrep", a zfs based replication program.
This goes one step beyond other replication utils I've seen, in that
it is explicitly targetting the concept of production "failover".
This is meant to be "enterprise product" quality, rather than merely
a sysadmin's tool.

# Design goals:
# 1. Easy to configure
# 2. Easy to use
# 3. As robust as possible
# 3.1 will not be harmful to run every minute, even when WAN is
down.
# (Will need safety limits on # of snapshots and filesystem space
free?)
# 4. Well documented

# Limitations(mostly for ease-of-use reasons):
# Uses "short hostname", not FQDN, in snapshot names. automatically
truncates.
# Only one copy destination per filesystem-remotehost combination
allowed
# Stores configuration in filesystem properties of snapshots.
# Need to figure out some sort of "locking". Possibly via filesystem
properties

Usage:

zrep -i/init ZFSfs remotehost destfs == create initial snapshot.
should do lots of sanitychecks. both local and
remote.
SHOULD it actually do first sync as well? ....
Should it allow hand-created snapshot,?
If so, specify snap as ZFSfs arg.

zrep -S/sync ZFSfs remote destfs # copy/sync after initial snapshot
created
zrep -S/sync all #special case, copies all zfs fs's that have been
# initialized.

zrep -C/changedest ZFSfs remotehost destfs #changes configs for given
ZFS
zrep -l/list (ZFSfs ...)#list existing configured filesystems, and
their config
# Should also somehow list INCOMING zrep synced
stuff?
# or use separate option for that? Possibly -L

zrep clear ZFSfs #clear all configured replication for that fs.
zrep clear ZFSfs remotehost #clear configs for just that remotehost

zrep failover ZFSfs@snapname # Changes sync direction to non-master
# Can be run from EITHER side? or should make it context-sensitive?

Initial concept of "failover"
Ensures first of all, that that snapshot exists on both sides.
(should it allow hand-created snapshots?)
Then configures snapshot on non-master side, with proper naming/
properties.
Renames snapshot pair to reflect new direction.
REMOVES other snapshots for old outoging direction.
At completion of this operation ,there will be only 1 zrep-
recognized
snapshot on either side, that will serve as the initial point of
synch.


###########################################
# snapshot format:
#
# fs@zrep_host1_host2_#seq#
# fs@zrep_host1_host2_#seq#_sent
# a snapshot will be one or the other of the above.
# Once a snapshot has been successfully copied, it should be auto-
renamed,
# so you can know without seeing the other side, whether something has
been
# synced.
# After initialization, when normal operations has started, there
should
# always be at least TWO snapshots:
# the latest "full", and the most recently sent incremental.
# There can also be some number of "just in case" incrementals
#

Andrew Gabriel

unread,
Feb 22, 2012, 10:44:05 AM2/22/12
to
In article <6909160c-7fe8-4f53...@p7g2000yqk.googlegroups.com>,
Philip Brown <ph...@bolthole.com> writes:
> I've finally got sick of the lacking in the commercial space, and also
> the surprising dearth in the free space, to write my own zfs
> replication tool.

I would suggest joining the zfs-discuss mailing list at opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
and bring this up there.

--
Andrew Gabriel
[email address is not usable -- followup in the newsgroup]

Philip Brown

unread,
Feb 22, 2012, 12:14:55 PM2/22/12
to
On Feb 22, 7:44 am, and...@cucumber.demon.co.uk (Andrew Gabriel)
wrote:
> In article <6909160c-7fe8-4f53-99cc-28549e25e...@p7g2000yqk.googlegroups.com>,
>         Philip Brown <p...@bolthole.com> writes:
>
> > I've finally got sick of the lacking in the commercial space, and also
> > the surprising dearth in the free space, to write my own zfs
> > replication tool.
>
> I would suggest joining the zfs-discuss mailing list at opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
> and bring this up there.
>

I dont want to join a mailing list. If they (re)enabled some kind of
web access/posting for the discussion group, or if the had a zfs
specific forum (ie: under forums.oracle.com), I would do so.
But they dont. So I wont.

PS: if you were going to suggest it, even the "just 'subscribe' for
posting access" option is distasteful, since oracle has been lazy and
used the default mailman archive interface, instead of using the
mailman hooks to replace it with a decent one.

John D Groenveld

unread,
Feb 22, 2012, 12:24:43 PM2/22/12
to
In article <d00aca17-728b-4f7d...@a15g2000yqf.googlegroups.com>,
Philip Brown <ph...@bolthole.com> wrote:
>I dont want to join a mailing list. If they (re)enabled some kind of

Seems to be the place to reach Solaris and Illumos ZFS experts.

>PS: if you were going to suggest it, even the "just 'subscribe' for
>posting access" option is distasteful, since oracle has been lazy and
>used the default mailman archive interface, instead of using the
>mailman hooks to replace it with a decent one.

I found sending "subscribe" to zfs-discu...@opensolaris.org
to be simple.

John
groe...@acm.org

Andrew Gabriel

unread,
Feb 22, 2012, 12:32:16 PM2/22/12
to
In article <d00aca17-728b-4f7d...@a15g2000yqf.googlegroups.com>,
I suggested it because I think just about everyone working on
ZFS (inside and outside Oracle) is on it, and it's very active.
Most of them won't see you here, but it's entirely your choice,
of course.

Chris Ridd

unread,
Feb 22, 2012, 12:39:46 PM2/22/12
to
On 2012-02-22 17:32:16 +0000, Andrew Gabriel said:

> I suggested it because I think just about everyone working on
> ZFS (inside and outside Oracle) is on it, and it's very active.
> Most of them won't see you here, but it's entirely your choice,
> of course.

There was a talk given at LOSUG in July 2010 which described a
home-brew ZFS replication setup. It was pretty neat if memory serves.
It may be worth pinging the author (Luke Marsden).

<http://hub.opensolaris.org/bin/view/User+Group+losug/v%2D2010>

--
Chris

Philip Brown

unread,
Feb 22, 2012, 12:39:27 PM2/22/12
to
On Feb 22, 9:24 am, groen...@cse.psu.edu (John D Groenveld) wrote:
> In article <d00aca17-728b-4f7d-aa8a-0a287b3a4...@a15g2000yqf.googlegroups.com>,
> Philip Brown  <p...@bolthole.com> wrote:
>
> >I dont want to join a mailing list. If they (re)enabled some kind of(...)
>
> Seems to be the place to reach Solaris and Illumos ZFS experts.

My goal isnt actually to reach "zfs experts". My goal is to reach
interested enterprise-grade sysadmins.
I've got the tech down. The tricky part is designing the best
enterprise-grade interface for admins.
You could say I'm specifically targetting people who are NOT zfs
experts, and dont want to be... they just want the replication
capability that zfs has the potential to provide.
Just like pkg-get is for people who dont want to become pkgadd
experts; they just want to install packages and get on with things.


> >PS: if you were going to suggest it, even the "just 'subscribe' for
> >posting access" option is distasteful, since oracle has been lazy and
> >used the default mailman archive interface, instead of using the
> >mailman hooks to replace it with a decent one.
>
> I found sending "subscribe" to zfs-discuss-requ...@opensolaris.org
> to be simple.

You misunderstand the gist of my objection. I dont have difficulty
subscribing. The issue is in simply and easily checking back on just
what *I* care about, while ignoring other junk I dont care about.

For example, in this newsgroup, if I dont want to read anything else,
I can just bookmark this "thread" page, and come straight back to it
whenever I feel like it.
Cant do that with zfs-discuss archives.
First off, I dont think there's a single thread-specific summary page.
You can only bookmark message by mesage.
Yes, there technically is a "sort by thread" option for the monthly
overview. But (apart from showing all the other junk), the minute you
cross a month boundary, you have to go hunt up another starting point.

There are, so I'm told, many other archiver options. They are
standalone products, which is supposedly why the mailman people didnt
put much effort into improving their own. It's too bad they provided
that lackluster default excuse of an archiver, rather than just saying
in the docs, "if you want a decent archiver, go try out XYZ, from
XYZ.org"

Philip Brown

unread,
Feb 22, 2012, 12:58:34 PM2/22/12
to
On Feb 22, 9:39 am, Chris Ridd <chrisr...@mac.com> wrote:
>
> There was a talk given at LOSUG in July 2010 which described a
> home-brew ZFS replication setup. It was pretty neat if memory serves.
> It may be worth pinging the author (Luke Marsden).
>
> <http://hub.opensolaris.org/bin/view/User+Group+losug/v%2D2010>
>


Interesting. Thanks for the link.
however,
"Luke Marsden talked to us about an open source project which builds
on ZFS and DTrace to replicate data seamlessly across a cluster. "
aka "HCFS: n-redundant storage for distributed systems with ZFS"
( http://www.hybrid-cluster.com/talk/ )
(huh. also seems that it is no longer open source/free)

Wow. That sounds like it goes a bit deeper than my target; all the way
into the application. it also seems like it targets LAN based
clustering. I'm targetting WAN replication.

Message has been deleted

John D Groenveld

unread,
Feb 22, 2012, 2:00:49 PM2/22/12
to
In article <15426163-73b1-4d4c...@p21g2000yqm.googlegroups.com>,
Philip Brown <ph...@bolthole.com> wrote:
>My goal isnt actually to reach "zfs experts". My goal is to reach
>interested enterprise-grade sysadmins.
>I've got the tech down. The tricky part is designing the best
>enterprise-grade interface for admins.
>You could say I'm specifically targetting people who are NOT zfs
>experts, and dont want to be... they just want the replication
>capability that zfs has the potential to provide.

There are sysadmins on zfs-discuss.
I think your project would be of interest to those folks.

>You misunderstand the gist of my objection. I dont have difficulty
>subscribing. The issue is in simply and easily checking back on just
>what *I* care about, while ignoring other junk I dont care about.
>
>For example, in this newsgroup, if I dont want to read anything else,
>I can just bookmark this "thread" page, and come straight back to it
>whenever I feel like it.
>Cant do that with zfs-discuss archives.
>First off, I dont think there's a single thread-specific summary page.
>You can only bookmark message by mesage.
>Yes, there technically is a "sort by thread" option for the monthly
>overview. But (apart from showing all the other junk), the minute you
>cross a month boundary, you have to go hunt up another starting point.

<URL:http://dir.gmane.org/gmane.os.solaris.opensolaris.zfs>
I haven't used Gmane's m/l to NNTP gateway but there archives
are useful in case a m/l goes tits-up.

John
groe...@acm.org

Andrew Gabriel

unread,
Feb 22, 2012, 2:16:58 PM2/22/12
to

Chris Ridd

unread,
Feb 22, 2012, 2:32:53 PM2/22/12
to
On 2012-02-22 17:58:34 +0000, Philip Brown said:

> On Feb 22, 9:39 am, Chris Ridd <chrisr...@mac.com> wrote:
>>
>> There was a talk given at LOSUG in July 2010 which described a
>> home-brew ZFS replication setup. It was pretty neat if memory serves.
>> It may be worth pinging the author (Luke Marsden).
>>
>> <http://hub.opensolaris.org/bin/view/User+Group+losug/v%2D2010>
>>
>
>
> Interesting. Thanks for the link.
> however,
> "Luke Marsden talked to us about an open source project which builds
> on ZFS and DTrace to replicate data seamlessly across a cluster. "
> aka "HCFS: n-redundant storage for distributed systems with ZFS"
> ( http://www.hybrid-cluster.com/talk/ )
> (huh. also seems that it is no longer open source/free)

That's a shame, though perhaps good for Luke :-)

> Wow. That sounds like it goes a bit deeper than my target; all the way
> into the application. it also seems like it targets LAN based
> clustering. I'm targetting WAN replication.

I vaguely remember him discussing what would happen with network
splits, which is something that's more likely to happen on the WAN than
LAN.
--
Chris

Philip Brown

unread,
Feb 22, 2012, 11:56:28 PM2/22/12
to
On Feb 22, 11:00 am, groen...@cse.psu.edu (John D Groenveld) wrote:

> <URL:http://dir.gmane.org/gmane.os.solaris.opensolaris.zfs>
> I haven't used Gmane's m/l to NNTP gateway but there archives
> are useful in case a m/l goes tits-up.


so, I tried this tack now.
zfs-discuss is an odd sort of mailing list. I've had 3 private replies
so far, but zero on the actual "discussion list" :-}

Yay for usenet ;-)

Philip Brown

unread,
Feb 29, 2012, 1:52:52 PM2/29/12
to
Planning and testing continues. "tech testing" of ZFS capabilities is
going well, so I'm starting coding on individual modules at the
moment. I havent gotten much in the way of additional feature
requests, so I'll take that to mean that I'm on the right track :)

FYI, I figure this is going to be homed at

http://www.bolthole.com/solaris/zrep/

Right now, its just txt file docs and plans there. I'll only put up
the code when it really does something useful, safely.
Hopefully, should have something within 7 days.

For the curious, here's a copy of "zrep.overview.txt" at the moment.
As you can see I'm a big fan of the "KISS" philosophy.

-----------------------------------------------

This document is a highly simplified overview of what should be the
"most common use cases" for zrep

Please note that all examples below, presume that you have the
following setup:

host1 - solaris 10 update 9+
zfs pool "pool1"
root ssh trust to/from host2

host2 - solaris 10 update 9+
zfs pool "pool2"
root ssh trust to/from host1

host1 and host2 are able to "ping" and "ssh" to/from each other



* Initialization of zrep replicated filesystem "prodfs"

host1# zrep -i pool1/prodfs host2 pool

This will create an initial snapshot on prodfs.
It will then a copy of "prodfs" on host2, and set
"readonly" on there.


* Replication

host1# zrep -S pool1/prodfs

You can call this manually, or from a cron job as frequently
as once a minute.
It will know from initialization, where to replicate the
filesystem to, and do so.

If you have more than one filesystem to sync, you may also use

# zrep -S all

You can safely set up a cronjob on both host1 and host2 to do
"all", and it will "do the right thing".

* Failover

host1# zrep failover pool1/prodfs

This will configure each side to know that the flow of data
should now be host2 -> host1, and flip readonly bits appropriately.

Running "zrep -S all" on host1 will then ignore pool1/prodfs
Running "zrep -S all" on host2 will sync
pool2/prodfs to pool1/prodfs

* Takeover

host2# zrep takeover pool2/prodfs

Same as failover example, but required syntax for running on
the non-active host.


* Status

hostX# zrep status

Will give a list of all filesystems the host is "master" for,
and the date of the last successfully replicated snapshot.

Philip Brown

unread,
Mar 2, 2012, 9:18:28 PM3/2/12
to
On Wednesday, February 29, 2012 10:52:52 AM UTC-8, Philip Brown wrote:
>
> http://www.bolthole.com/solaris/zrep/
>


I havent gotten much feedback from the last update, and am starting coding now.
Probably wont get much feedback from a post on a friday :D but wanted to mention, for anyone who cares, that I have changed my mind about a major design feature:

I think I shall no longer put the src and dest hostname in the snapshot name
Prior spec was
@zrep_host1_host2_#seq#

now it is just

@zrep_#seq#(plus possible trailers)

to find the destination, you would now have to do a
zfs get zrep:dest-host fs

This means that it might be potentially more difficult to make a multi-destination stream down the road.
(Currently, implementation is targetted towards "one filesystem, one replication destination", but it was semi-easy to target multiple destinations in later design)

If you feel strongly about having multi-destination replication in this free tool, please speak up to me Very Soon!

Philip Brown

unread,
Mar 8, 2012, 8:19:34 PM3/8/12
to
FYI, I have a very alpha version of "zrep" in my possession now.
It currently does only very basic "init, sync, list, status" commands.

I dont like releasing it to the general public in its current state, but if anyone is interested in being an "early adopter" and will give me feedback on it, please email me directly.

0 new messages