zfs pool mirror /unmirror

Philip Brown

unread,

Feb 15, 2012, 1:53:08 PM2/15/12

to

Maybe this belongs more on the "zfs discussion" mailing list, but I've
always preferred USENET :)

I have a bit of a hairy technical challege, to do with remote site
replication.
Warning: the description is a LITTLE simplified.. but not much :)

Site A: two disk arrays of equal size X, and a solaris 10 server
Site B: one disk array of size X, and a solaris 10 server

We have automatic remote replication capabilities, that asynchronously
(but near real time) copy changes from ONE site A array, to the site B
array.
We currently have a zfs pool on one of the site A arrays. It gets
replicated very nicely to the site B array.
All that stuff is configured happily, and we're happy with failover,
etc, etc. yay happy happy.

That being said.. we'd really like to use a MIRRORED zpool at site A.
There's two problems with that though.
1. We're only licensed for 'X' size of data replication. not 2*X
2. the remote side only has a disk array of size X. not 2*X

more money for extra disk array and replication license is not
forthcoming. that's off the table.
but I was wondering if there is a way to (somewhat) cleanly handle
splitting the mirroring of a zpool.

If anything, I figure that importing "one half" of the pool at site B,
should be relatively easy... it's the "fail back" side that maybe a
bit hairy.
"zpool online cxxxxx" looks promising. But does anyone have any
"gotchas" we should be aware of?

Cydrome Leader

unread,

Feb 15, 2012, 2:01:25 PM2/15/12

to

Philip Brown <ph...@bolthole.com> wrote:
> Maybe this belongs more on the "zfs discussion" mailing list, but I've
> always preferred USENET :)
>
> I have a bit of a hairy technical challege, to do with remote site
> replication.
> Warning: the description is a LITTLE simplified.. but not much :)
>
> Site A: two disk arrays of equal size X, and a solaris 10 server
> Site B: one disk array of size X, and a solaris 10 server
>
> We have automatic remote replication capabilities, that asynchronously
> (but near real time) copy changes from ONE site A array, to the site B
> array.
> We currently have a zfs pool on one of the site A arrays. It gets
> replicated very nicely to the site B array.
> All that stuff is configured happily, and we're happy with failover,
> etc, etc. yay happy happy.
>
> That being said.. we'd really like to use a MIRRORED zpool at site A.
> There's two problems with that though.
> 1. We're only licensed for 'X' size of data replication. not 2*X

so have solaris replicate the data at site A, not the SAN.

> 2. the remote side only has a disk array of size X. not 2*X
>
> more money for extra disk array and replication license is not
> forthcoming. that's off the table.
> but I was wondering if there is a way to (somewhat) cleanly handle
> splitting the mirroring of a zpool.

not sure about this part- ZFS is not a complete and mature volume
manager/filesystem yet.

Philip Brown

unread,

Feb 15, 2012, 2:09:57 PM2/15/12

to

On Feb 15, 11:01 am, Cydrome Leader <prese...@MUNGEpanix.com> wrote:

> Philip Brown <p...@bolthole.com> wrote:
> > Maybe this belongs more on the "zfs discussion" mailing list, but I've
> > always preferred USENET :)
>
> > I have a bit of a hairy technical challege, to do with remote site
> > replication.
> > Warning: the description is a LITTLE simplified.. but not much :)
>
> > Site A: two disk arrays of equal size X, and a solaris 10 server
> > Site B: one disk array of size X, and a solaris 10 server
>
> > We have automatic remote replication capabilities, that asynchronously
> > (but near real time) copy changes from ONE site A array, to the site B
> > array.
> > We currently have a zfs pool on one of the site A arrays. It gets
> > replicated very nicely to the site B array.
> > All that stuff is configured happily, and we're happy with failover,
> > etc, etc. yay happy happy.
>
> > That being said.. we'd really like to use a MIRRORED zpool at site A.
> > There's two problems with that though.
> > 1. We're only licensed for 'X' size of data replication. not 2*X
>
> so have solaris replicate the data at site A, not the SAN.

Solaris does not have a nice clean prepackaged option for "replicate
data in near real time, continuously", that I am aware of.
zfs snapshots/zsend is a "close but no cigar" alternative.

If Oracle put more effort into expanding its capabilities to offer
something more like the old Veritas Volume Replicator type of thing,
then it would be more in the running.

cindy

unread,

Feb 15, 2012, 3:02:29 PM2/15/12

to

I'd like you to use a mirrored pool too. You might consider zpool
split,
a way to split a mirrored pool so that the pool that is split off has
identical
contents to the original pool. Sort of like (remote) hardware
replication,
only really easy and cheap (free).

http://docs.oracle.com/cd/E23823_01/html/819-5461/gayrd.html#gjooc

See the example below.

Thanks,

Cindy

- Consider that the newly created pool could be imported on site A or
site B.
- Consider that the contents on beta could be copied, beta could be
destroyed, and the disk added back to mothership, and repeated on
a weekly basis.

# zpool create mothership c1t1d0
# cp /usr/dict/words /mothership/file.1
# zpool attach mothership c1t1d0 c1t2d0
# cp /usr/dict/words /mothership/file.2
# zpool split mothership beta
# zpool import beta
# zpool list mothership beta
NAME SIZE ALLOC FREE CAP HEALTH ALTROOT
beta 68G 672K 68.0G 0% ONLINE -
mothership 68G 678K 68.0G 0% ONLINE -
# tail /mothership/file.1
zoology
zoom
Zorn
Zoroaster
Zoroastrian
zounds
z's
zucchini
Zurich
zygote
# tail /beta/file.2
zoology
zoom
Zorn
Zoroaster
Zoroastrian
zounds
z's
zucchini
Zurich
zygote
#

Ron

unread,

Feb 15, 2012, 3:11:18 PM2/15/12

to

Cindy, why split the pool. The OP is worried about needing an
additional license for the
(supposed) additional storage. A mirror does not increase the amount
of storage, it only
adds data safety.

--ron

Philip Brown

unread,

Feb 15, 2012, 3:32:06 PM2/15/12

to

On Feb 15, 12:02 pm, cindy <cindy.swearin...@oracle.com> wrote:
>
>
> > > Philip Brown <p...@bolthole.com> wrote:
>
> > Solaris does not have a nice clean prepackaged option for "replicate
> > data in near real time, continuously", that I am aware of.
> > zfs snapshots/zsend is a "close but no cigar" alternative.
>
> > If Oracle put more effort into expanding its capabilities to offer
> > something more like the old Veritas Volume Replicator type of thing,
> > then it would be more in the running.
>
> I'd like you to use a mirrored pool too. You might consider zpool
> split,
> a way to split a mirrored pool so that the pool that is split off has
> identical
> contents to the original pool. Sort of like (remote) hardware
> replication,
> only really easy and cheap (free).

I dont understand how that would meet our goals.
It doesnt sound like complete replacement of our ongoing near-time
replication appliance, so that's out.

and last I heard, zpool split, is a one-time operation. so not sure
how that qualifies as "remote hardware replication" either.
one-time replication, doesnt cut it.

For what it's worth, my experiments so far, seem to indicate that just
blindly replicating one side of the local mirror, to the remote DR
site (site B), would work well.
ZFS doesnt complain much about missing one side of the mirror; it just
imports it with a warning.

And if (after zfs export) it is replicated back to site A, and
imported... it notices "hey this stuff is out of sync, I should make
it happy again", and automatically kicks off a resilvering of the
second disk. Which completes fairly quickly.
(slight update: if I attempt to import both in a single operation, I
also have to use zpool clear, to get the resilvering to kick in)

The only issue so far, is that my testing has only used one machine.
(with manual disable of the second disk).
I'm now wondering what happens if the machine notices that, while the
disks are still in the "same zfs pool", the most recently modified
one was last used on a different machine.

Ian Collins

unread,

Feb 15, 2012, 3:55:27 PM2/15/12

to

On 02/16/12 07:53 AM, Philip Brown wrote:
> Maybe this belongs more on the "zfs discussion" mailing list, but I've
> always preferred USENET :)
>
> I have a bit of a hairy technical challege, to do with remote site
> replication.
> Warning: the description is a LITTLE simplified.. but not much :)
>
> Site A: two disk arrays of equal size X, and a solaris 10 server
> Site B: one disk array of size X, and a solaris 10 server
>
> We have automatic remote replication capabilities, that asynchronously
> (but near real time) copy changes from ONE site A array, to the site B
> array.
> We currently have a zfs pool on one of the site A arrays. It gets
> replicated very nicely to the site B array.
> All that stuff is configured happily, and we're happy with failover,
> etc, etc. yay happy happy.
>
> That being said.. we'd really like to use a MIRRORED zpool at site A.
> There's two problems with that though.
> 1. We're only licensed for 'X' size of data replication. not 2*X
> 2. the remote side only has a disk array of size X. not 2*X

That doesn't make sense, a mirror does not increase the size of a pool,
so you will still have X size of data.

> If anything, I figure that importing "one half" of the pool at site B,
> should be relatively easy... it's the "fail back" side that maybe a
> bit hairy.
> "zpool online cxxxxx" looks promising. But does anyone have any
> "gotchas" we should be aware of?

How fat is your pipe between sites in relation to you data churn? If
you have enough bandwidth, you could export an iSCSI volume from your
remote site and mirror that with a local iSCSI volume.

--
Ian Collins

cindy

unread,

Feb 15, 2012, 4:19:25 PM2/15/12

to

What I illustrated is flexibility if he doesn't want to pay any
licensing
fees. He can mirror the pool, split it to replicate at another site,
and then re-attach the mirror. If the pool is only split once a week
to
be copied, its mirrored the remaining time.

cs

Philip Brown

unread,

Feb 15, 2012, 4:30:19 PM2/15/12

to

On Feb 15, 1:19 pm, cindy <cindy.swearin...@oracle.com> wrote:
>
> What I illustrated is flexibility if he doesn't want to pay any
> licensing
> fees. He can mirror the pool, split it to replicate at another site,
> and then re-attach the mirror. If the pool is only split once a week
> to
> be copied, its mirrored the remaining time.
>

Seems you are still misunderstanding the setup.
The current situation does a continuous, albeit slightly asynchronous,
mirroring.
Not "once a week mirroring". That's not mirroring, or even
replication. Thats once-a-week offsite backup.

To reply to Ian: We sort of have enough bandwidth, but the latency for
iscsi would kill us. It's in another state.
Also:

>That doesn't make sense, a mirror does not increase the size of a pool,
>so you will still have X size of data.

no, it does increase the size of the pool.
It does not increase the size of actual data being recorded. But
that's a different issue.
Our replication solution, is a "raw bits on disk" replication scheme,
licensed accordingly.

Ian Collins

unread,

Feb 15, 2012, 4:41:49 PM2/15/12

to

On 02/16/12 10:30 AM, Philip Brown wrote:
> On Feb 15, 1:19 pm, cindy<cindy.swearin...@oracle.com> wrote:
>>
>> What I illustrated is flexibility if he doesn't want to pay any
>> licensing
>> fees. He can mirror the pool, split it to replicate at another site,
>> and then re-attach the mirror. If the pool is only split once a week
>> to
>> be copied, its mirrored the remaining time.
>>
>
> Seems you are still misunderstanding the setup.
> The current situation does a continuous, albeit slightly asynchronous,
> mirroring.
> Not "once a week mirroring". That's not mirroring, or even
> replication. Thats once-a-week offsite backup.
>
> To reply to Ian: We sort of have enough bandwidth, but the latency for
> iscsi would kill us. It's in another state.

I see.

Could you setup an iSCSI mirror between the remote site and your second
local array and replicate locally to that pool?

> Also:
>
>> That doesn't make sense, a mirror does not increase the size of a pool,
>> so you will still have X size of data.
>
> no, it does increase the size of the pool.
> It does not increase the size of actual data being recorded. But
> that's a different issue.
> Our replication solution, is a "raw bits on disk" replication scheme,
> licensed accordingly.

Ouch!

--
Ian Collins

cindy

unread,

Feb 15, 2012, 4:45:49 PM2/15/12

to

Sounds like a zpool split request was in here somewhere:

>but I was wondering if there is a way to (somewhat) cleanly handle
>splitting the mirroring of a zpool.

Yes, you can do a forced import of the device on site B and then it
will resync to the mirror (on site A) but this isn't the cleanest
solution,
nor is manually disabling the disk. Something could go wrong
eventually.
Using zpool split is the cleanest way to do this. You can do the
split
over and over as I suggested.

cs

Cydrome Leader

unread,

Feb 15, 2012, 5:04:26 PM2/15/12

to

Philip Brown <ph...@bolthole.com> wrote:

> On Feb 15, 11:01?am, Cydrome Leader <prese...@MUNGEpanix.com> wrote:
>> Philip Brown <p...@bolthole.com> wrote:
>> > Maybe this belongs more on the "zfs discussion" mailing list, but I've
>> > always preferred USENET :)
>>
>> > I have a bit of a hairy technical challege, to do with remote site
>> > replication.
>> > Warning: the description is a LITTLE simplified.. but not much :)
>>
>> > Site A: two disk arrays of equal size X, and a solaris 10 server
>> > Site B: one disk array of size X, and a solaris 10 server
>>
>> > We have automatic remote replication capabilities, that asynchronously
>> > (but near real time) copy changes from ONE site A array, to the site B
>> > array.
>> > We currently have a zfs pool on one of the site A arrays. It gets
>> > replicated very nicely to the site B array.
>> > All that stuff is configured happily, and we're happy with failover,
>> > etc, etc. yay happy happy.
>>
>> > That being said.. we'd really like to use a MIRRORED zpool at site A.
>> > There's two problems with that though.
>> > 1. We're only licensed for 'X' size of data replication. not 2*X
>>
>> so have solaris replicate the data at site A, not the SAN.
>
> Solaris does not have a nice clean prepackaged option for "replicate
> data in near real time, continuously", that I am aware of.
> zfs snapshots/zsend is a "close but no cigar" alternative.

I though you can write mirrors with ZFS. If not, you can mirror data with
disksuite and have ZFS write to a disksuite. It's all a hack either way.

> If Oracle put more effort into expanding its capabilities to offer
> something more like the old Veritas Volume Replicator type of thing,
> then it would be more in the running.

oracle doesn't give a damn about what users want at this point. I'm not
sure why this route was picked, but that's what they've chosen.

Cydrome Leader

unread,

Feb 15, 2012, 5:21:03 PM2/15/12

to

It sounds like the OP wants what we do, just with ZFS.

Our site A must be available at all times. There's tons of mirroring, but
on the SANs, not with the volume manager.

we replicate this data async using the SANs themselves. Hosts in A have no
idea all their block level writes are being replicated over metro ethernet
to site B. Obviously, there's a big license fee for using this feature.

Over at site B we have nearly realtime copies of data from site A being
written to raw disks. If site A blows up or burns down, we just mount the
LUNs in site B and we're back in business with only loss of upto a few
seconds of data.

Here's the cool part. Veritas can mount anything in site B with no
problems. It's even aware that they are replicas of original data. I'm not
sure how it knows, but it does. It's a journaled filesystem so if there
was some loss of writes, they roll back to a sane state and no fsck is
needed. You can't fsck zfs and if it feels the data is corrupt, that's it,
game over.

It's possible to get more crazy over at site A and then use plexes so
veritas itself can mirror data across multiple SANs. It supports this, and
you can break and resync these anytime. It has snapshots too, so you can
mirror a filesystem from a point in time if you want, which makes sense
for local backups.

Philip Brown

unread,

Feb 15, 2012, 5:43:12 PM2/15/12

to

On Feb 15, 2:21 pm, Cydrome Leader <prese...@MUNGEpanix.com> wrote:

Yup,. exactly.

>
> Here's the cool part. Veritas can mount anything in site B with no
> problems. It's even aware that they are replicas of original data. I'm not
> sure how it knows, but it does.

That's interesting.

ZFS is *not* aware of replication.
All three of our replicated copies (yes, 3) have the same zfs pool ID.
Seems like, as far as zfs is concerned, "same pool ID, == same same
pool", no ifs ands or buts.

thats okay for us, though.

> It's a journaled filesystem so if there
> was some loss of writes, they roll back to a sane state and no fsck is
> needed. You can't fsck zfs and if it feels the data is corrupt, that's it,
> game over.

contrariwise, it has other advantages. For example, in my testing of
impolitely yanking out one side of a ZFS mirror, then continuing to
write to the functioning side..... Some multiple-gigabytes of writes
later, if you re-power on the mirror... zfs detects that the disk
'should' belong to the pool..and oh by the way it was previously up to
date, until (this) point in time, hey lets do a resync...
and it resyncs ONLY the data that is "out of date".

> It's possible to get more crazy over at site A and then use plexes so
> veritas itself can mirror data across multiple SANs.

Yeah, but if you have high latency, aka 1000 miles between sites, you
cant just use veritas volume manager any more (well, if your app has a
low latency response requirement anyways); you need to use veritas
volume replicator.
Which is why we're doing what we're doing.

> It supports this, and
> you can break and resync these anytime. It has snapshots too, so you can
> mirror a filesystem from a point in time if you want, which makes sense
> for local backups.

What Cindy was saying about pool split makes sense for that same
purpose of "local backups" .it would be nice for purposes of isolating
backup I/O from the production usage. but regular zfs snapshots work
pretty well for backups also.

Ian Collins

unread,

Feb 15, 2012, 6:13:47 PM2/15/12

to

On 02/16/12 11:21 AM, Cydrome Leader wrote:

> Ian Collins<ian-...@hotmail.com> wrote:
>>
>> How fat is your pipe between sites in relation to you data churn? If
>> you have enough bandwidth, you could export an iSCSI volume from your
>> remote site and mirror that with a local iSCSI volume.
>
> It sounds like the OP wants what we do, just with ZFS.
>
> Our site A must be available at all times. There's tons of mirroring, but
> on the SANs, not with the volume manager.
>
> we replicate this data async using the SANs themselves. Hosts in A have no
> idea all their block level writes are being replicated over metro ethernet
> to site B. Obviously, there's a big license fee for using this feature.

There's always a gotcha!

> Over at site B we have nearly realtime copies of data from site A being
> written to raw disks. If site A blows up or burns down, we just mount the
> LUNs in site B and we're back in business with only loss of upto a few
> seconds of data.
>
> Here's the cool part. Veritas can mount anything in site B with no
> problems. It's even aware that they are replicas of original data. I'm not
> sure how it knows, but it does. It's a journaled filesystem so if there
> was some loss of writes, they roll back to a sane state and no fsck is
> needed. You can't fsck zfs and if it feels the data is corrupt, that's it,
> game over.

ZFS is is much improved in that regard. You can roll back the the last
transactions on pool import to get to a usable state. Not fool proof,
but neither is fsck.

Don't forget ZFS will know the data is corrupt, Veritas may not...

--
Ian Collins