Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss

HAST + ZFS + NFS + CARP

224 views
Skip to first unread message

Julien Cigar

unread,
Jun 30, 2016, 11:04:16 AM6/30/16
to
Hello,

I'm always in the process of setting a redundant low-cost storage for
our (small, ~30 people) team here.

I read quite a lot of articles/documentations/etc and I plan to use HAST
with ZFS for the storage, CARP for the failover and the "good old NFS"
to mount the shares on the clients.

The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks for the
shared storage.

Assuming the following configuration:
- MASTER is the active node and BACKUP is the standby node.
- two disks in each machine: ada0 and ada1.
- two interfaces in each machine: em0 and em1
- em0 is the primary interface (with CARP setup)
- em1 is dedicated to the HAST traffic (crossover cable)
- FreeBSD is properly installed in each machine.
- a HAST resource "disk0" for ada0p2.
- a HAST resource "disk1" for ada1p2.
- a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is created
on MASTER

A couple of questions I am still wondering:
- If a disk dies on the MASTER I guess that zpool will not see it and
will transparently use the one on BACKUP through the HAST ressource..
is it a problem? could this lead to some corruption? At this stage the
common sense would be to replace the disk quickly, but imagine the
worst case scenario where ada1 on MASTER dies, zpool will not see it
and will transparently use the one from the BACKUP node (through the
"disk1" HAST ressource), later ada0 on MASTER dies, zpool will not
see it and will transparently use the one from the BACKUP node
(through the "disk0" HAST ressource). At this point on MASTER the two
disks are broken but the pool is still considered healthy ... What if
after that we unplug the em0 network cable on BACKUP? Storage is
down..
- Under heavy I/O the MASTER box suddently dies (for some reasons),
thanks to CARP the BACKUP node will switch from standy -> active and
execute the failover script which does some "hastctl role primary" for
the ressources and a zpool import. I wondered if there are any
situations where the pool couldn't be imported (= data corruption)?
For example what if the pool hasn't been exported on the MASTER before
it dies?
- Is it a problem if the NFS daemons are started at boot on the standby
node, or should they only be started in the failover script? What
about stale files and active connections on the clients?
- A catastrophic power failure occur and MASTER and BACKUP are suddently
powered down. Later the power returns, is it possible that some
problem occur (split-brain scenario ?) regarding the order in which the
two machines boot up?
- Other things I have not thought?

Thanks!
Julien

--
Julien Cigar
Belgian Biodiversity Platform (http://www.biodiversity.be)
PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0
No trees were killed in the creation of this message.
However, many electrons were terribly inconvenienced.
signature.asc

InterNetX - Juergen Gotteswinter

unread,
Jun 30, 2016, 11:24:27 AM6/30/16
to


Am 30.06.2016 um 16:45 schrieb Julien Cigar:
> Hello,
>
> I'm always in the process of setting a redundant low-cost storage for
> our (small, ~30 people) team here.
>
> I read quite a lot of articles/documentations/etc and I plan to use HAST
> with ZFS for the storage, CARP for the failover and the "good old NFS"
> to mount the shares on the clients.
>
> The hardware is 2xHP Proliant DL20 boxes with 2 dedicated disks for the
> shared storage.
>
> Assuming the following configuration:
> - MASTER is the active node and BACKUP is the standby node.
> - two disks in each machine: ada0 and ada1.
> - two interfaces in each machine: em0 and em1
> - em0 is the primary interface (with CARP setup)
> - em1 is dedicated to the HAST traffic (crossover cable)
> - FreeBSD is properly installed in each machine.
> - a HAST resource "disk0" for ada0p2.
> - a HAST resource "disk1" for ada1p2.
> - a zpool create zhast mirror /dev/hast/disk0 /dev/hast/disk1 is created
> on MASTER
>
> A couple of questions I am still wondering:
> - If a disk dies on the MASTER I guess that zpool will not see it and
> will transparently use the one on BACKUP through the HAST ressource..

thats right, as long as writes on $anything have been successful hast is
happy and wont start whining

> is it a problem?

imho yes, at least from management view

> could this lead to some corruption?

probably, i never heard about anyone who uses that for long time in
production

At this stage the
> common sense would be to replace the disk quickly, but imagine the
> worst case scenario where ada1 on MASTER dies, zpool will not see it
> and will transparently use the one from the BACKUP node (through the
> "disk1" HAST ressource), later ada0 on MASTER dies, zpool will not
> see it and will transparently use the one from the BACKUP node
> (through the "disk0" HAST ressource). At this point on MASTER the two
> disks are broken but the pool is still considered healthy ... What if
> after that we unplug the em0 network cable on BACKUP? Storage is
> down..
> - Under heavy I/O the MASTER box suddently dies (for some reasons),
> thanks to CARP the BACKUP node will switch from standy -> active and
> execute the failover script which does some "hastctl role primary" for
> the ressources and a zpool import. I wondered if there are any
> situations where the pool couldn't be imported (= data corruption)?
> For example what if the pool hasn't been exported on the MASTER before
> it dies?
> - Is it a problem if the NFS daemons are started at boot on the standby
> node, or should they only be started in the failover script? What
> about stale files and active connections on the clients?

sometimes stale mounts recover, sometimes not, sometimes clients need
even reboots

> - A catastrophic power failure occur and MASTER and BACKUP are suddently
> powered down. Later the power returns, is it possible that some
> problem occur (split-brain scenario ?) regarding the order in which the

sure, you need an exact procedure to recover

> two machines boot up?

best practice should be to keep everything down after boot

> - Other things I have not thought?
>



> Thanks!
> Julien
>


imho:

leave hast where it is, go for zfs replication. will save your butt,
sooner or later if you avoid this fragile combination
_______________________________________________
freeb...@freebsd.org mailing list
https://lists.freebsd.org/mailman/listinfo/freebsd-fs
To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

Ben RUBSON

unread,
Jun 30, 2016, 11:29:04 AM6/30/16
to
I was also replying, and finishing by this :
Why don't you set your slave as an iSCSI target and simply do ZFS mirroring ?
ZFS would then know as soon as a disk is failing.
And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.

Ben

Julien Cigar

unread,
Jun 30, 2016, 11:30:50 AM6/30/16
to
Do you mean a $> zfs snapshot followed by a $> zfs send ... | ssh zfs
receive ... ?
signature.asc

InterNetX - Juergen Gotteswinter

unread,
Jun 30, 2016, 11:34:17 AM6/30/16
to

>> imho:
>>
>> leave hast where it is, go for zfs replication. will save your butt,
>> sooner or later if you avoid this fragile combination
>
> Do you mean a $> zfs snapshot followed by a $> zfs send ... | ssh zfs
> receive ... ?
>

exactly, super simple, super robust, just works without hassle

InterNetX - Juergen Gotteswinter

unread,
Jun 30, 2016, 11:35:18 AM6/30/16
to


Am 30.06.2016 um 17:33 schrieb InterNetX - Juergen Gotteswinter:
>
>>> imho:
>>>
>>> leave hast where it is, go for zfs replication. will save your butt,
>>> sooner or later if you avoid this fragile combination
>>
>> Do you mean a $> zfs snapshot followed by a $> zfs send ... | ssh zfs
>> receive ... ?
>>
>
> exactly, super simple, super robust, just works without hassle
>

check zrep, its a solid script for repliocation and failover (keeps the
replication partner read only, and manages the switch to rw

Matt Churchyard via freebsd-fs

unread,
Jun 30, 2016, 11:36:07 AM6/30/16
to
Happy to be correctly, but last time I looked at this, the NFS filesystem ID was likely to be different on both systems (and cannot be set like on Linux), and so the mounts would be useless on the clients after failover. You'd need to remount the NFS filesystem on the clients.

> two machines boot up?

>best practice should be to keep everything down after boot

> - Other things I have not thought?
>



> Thanks!
> Julien
>


>imho:

>leave hast where it is, go for zfs replication. will save your butt, sooner or later if you avoid this fragile combination

Personally I agree. This sort of functionality is incredibly difficult to get right and I wouldn't want to run anything critical relying on a few HAST scripts I'd put together manually.

Matt

Julien Cigar

unread,
Jun 30, 2016, 11:38:09 AM6/30/16
to
On Thu, Jun 30, 2016 at 05:28:41PM +0200, Ben RUBSON wrote:
>
> > On 30 Jun 2016, at 17:14, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:
> >
> >
> >
> >> two machines boot up?
> >
> > best practice should be to keep everything down after boot
> >
> >> - Other things I have not thought?
> >>
> >
> >
> >
> >> Thanks!
> >> Julien
> >>
> >
> >
> > imho:
> >
> > leave hast where it is, go for zfs replication. will save your butt,
> > sooner or later if you avoid this fragile combination
>
> I was also replying, and finishing by this :
> Why don't you set your slave as an iSCSI target and simply do ZFS mirroring ?

Yes that's another option, so a zpool with two mirrors (local +
exported iSCSI) ?

> ZFS would then know as soon as a disk is failing.
> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
>
> Ben
>
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Ben RUBSON

unread,
Jun 30, 2016, 11:42:29 AM6/30/16
to
Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
Depends on what you need :)

Chris Watson

unread,
Jun 30, 2016, 12:32:39 PM6/30/16
to


Sent from my iPhone 5

>
>>
>> Yes that's another option, so a zpool with two mirrors (local +
>> exported iSCSI) ?
>
> Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
> Depends on what you need :)
>
>>
>>> ZFS would then know as soon as a disk is failing.

So as an aside, but related, for those watching this from the peanut gallery and for the benefit of the OP perhaps those that run with this setup might give some best practices and tips here in this thread on making this a good reliable setup. I can see someone reading this thread and tossing two crappy Ethernet cards in a box and then complaining it doesn't work well.

Perhaps those doing this could listed recommended NICs so write performance of the slave over the network isn't slow as sin, or their relevant part of their config scripts to get this running well, any sysctl's set to boost performance, gotcha's you e seen using this setup, and if a mirror failed were you happy with this. India during recovery? Anything that will help current and future folks stumbling across this thread I think would be a big help in getting this good idea used a little more widely.

Chris

Julien Cigar

unread,
Jun 30, 2016, 12:36:04 PM6/30/16
to
> > Yes that's another option, so a zpool with two mirrors (local +
> > exported iSCSI) ?
>
> Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
> Depends on what you need :)

More a real time replication solution in fact ... :)
Do you have any resource which resume all the pro(s) and con(s) of HAST
vs iSCSI ? I have found a lot of article on ZFS + HAST but not that much
with ZFS + iSCSI ..

>
> >
> >> ZFS would then know as soon as a disk is failing.
> >> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
> >>
> >> Ben
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Julien Cigar

unread,
Jun 30, 2016, 2:57:22 PM6/30/16
to
On Thu, Jun 30, 2016 at 11:32:17AM -0500, Chris Watson wrote:
>
>
> Sent from my iPhone 5
>
> >
> >>
> >> Yes that's another option, so a zpool with two mirrors (local +
> >> exported iSCSI) ?
> >
> > Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
> > Depends on what you need :)
> >
> >>
> >>> ZFS would then know as soon as a disk is failing.
>
> So as an aside, but related, for those watching this from the peanut gallery and for the benefit of the OP perhaps those that run with this setup might give some best practices and tips here in this thread on making this a good reliable setup. I can see someone reading this thread and tossing two crappy Ethernet cards in a box and then complaining it doesn't work well.

It would be more than welcome indeed..! I have the feeling that HAST
isn't that much used (but maybe I am wrong) and it's difficult to find
informations on it's reliability and concrete long-term use cases...

Also the pros vs cons of HAST vs iSCSI

>
> Perhaps those doing this could listed recommended NICs so write performance of the slave over the network isn't slow as sin, or their relevant part of their config scripts to get this running well, any sysctl's set to boost performance, gotcha's you e seen using this setup, and if a mirror failed were you happy with this. India during recovery? Anything that will help current and future folks stumbling across this thread I think would be a big help in getting this good idea used a little more widely.
>
> Chris
signature.asc

Ben RUBSON

unread,
Jun 30, 2016, 5:36:13 PM6/30/16
to
>>> Yes that's another option, so a zpool with two mirrors (local +
>>> exported iSCSI) ?
>>
>> Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
>> Depends on what you need :)
>
> More a real time replication solution in fact ... :)
> Do you have any resource which resume all the pro(s) and con(s) of HAST
> vs iSCSI ? I have found a lot of article on ZFS + HAST but not that much
> with ZFS + iSCSI ..

# No resources, but some ideas :

- ZFS likes to see all the details of its underlying disks, which is possible with local disks (of course) and iSCSI disks, not with HAST.
- iSCSI solution is simpler, you only have ZFS to manage, your replication is made by ZFS itself, not by an additional stack.
- HAST does not seem to be really maintained (I may be wrong), at least compared to DRBD HAST seems to be inspired from.
- You do not have to cross your fingers when you promote your slave to master ("will ZFS be happy with my HAST replicated disks ?"), ZFS mirrored data by itself, you only have to import [-f].

- (auto)reconnection of iSCSI could not be as simple as with HAST, iSCSI could require more administration after a disconnection. But this could easily be done by a script.

# Some "advices" based on my findings (I'm finishing my tests of such a solution) :

Write performance will suffer from network latency, but while your 2 nodes are in the same room, that should be OK.
If you are over a long distance link, you may add several ms to each write IO, which, depending on the use case, may be wrong, ZFS may also be unresponsive.
Max throughput is also more difficult to achieve over a high latency link.

You will have to choose network cards depending on the number of disks and their throughput.
For example, if you need to resilver a SATA disk (180MB/s), then a simple 1GB interface (120MB/s) will be a serious bottleneck.
Think about scrub too.

You should have to perform some network tuning (TCP window size, jumbo frame...) to reach your max bandwidth.
Trying to saturate network link with (for example) iPerf before dealing with iSCSI seems to be a good thing.

Here are some interesting sysctl so that ZFS will not hang (too long) in case of an unreachable iSCSI disk :
kern.iscsi.ping_timeout=5
kern.iscsi.iscsid_timeout=5
kern.iscsi.login_timeout=5
kern.iscsi.fail_on_disconnection=1
(adjust the 5 seconds depending on your needs / on your network quality).

Take care when you (auto)replace disks, you may replace an iSCSI disk with a local disk, which of course would work but would be wrong in terms of master/slave redundancy.
Use nice labels on your disks so that if you have a lot of disks in your pool, you quickly know which one is local, which one is remote.

# send/receive pro(s) :

In terms of data safety, one of the interests of ZFS send/receive is that you have a totally different target pool, which can be interesting if ever you have a disaster with your primary pool.
As a 3rd node solution ? On another site ? (as send/receive does not suffer as iSCSI would from latency)

>>>> ZFS would then know as soon as a disk is failing.
>>>> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
>>>>
>>>> Ben

Julien Cigar

unread,
Jul 1, 2016, 4:59:46 AM7/1/16
to
Thank you very much for those "advices", it is much appreciated!

I'll definitively go with iSCSI (for which I haven't that much
experience) over HAST.

Maybe a stupid question but, assuming on the MASTER with ada{0,1} the
local disks and da{0,1} the exported iSCSI disks from the SLAVE, would
you go with:

$> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
/dev/da1

or rather:

$> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
/dev/da1

I guess the former is better, but it's just to be sure .. (or maybe it's
better to iSCSI export a ZVOL from the SLAVE?)

Correct me if I'm wrong but, from a safety point of view this setup is
also the safest as you'll get the "fullsync" equivalent mode of HAST
(but but it's also the slowest), so I can be 99,99% confident that the
pool on the SLAVE will never be corrupted, even in the case where the
MASTER suddently die (power outage, etc), and that a zpool import -f
storage will always work?

One last thing: this "storage" pool will be exported through NFS on the
clients, and when a failover occur they should, in theory, not notice
it. I know that it's pretty hypothetical but I wondered if pfsync could
play a role in this area (active connections)..?

Thanks!
Julien

>
> >>>> ZFS would then know as soon as a disk is failing.
> >>>> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
> >>>>
> >>>> Ben
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Ben RUBSON

unread,
Jul 1, 2016, 5:19:29 AM7/1/16
to
No, if you loose connection with slave node, your pool will go offline !

> or rather:
>
> $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
> /dev/da1

Yes, each master disk is mirrored with a slave disk.

> I guess the former is better, but it's just to be sure .. (or maybe it's
> better to iSCSI export a ZVOL from the SLAVE?)
>
> Correct me if I'm wrong but, from a safety point of view this setup is
> also the safest as you'll get the "fullsync" equivalent mode of HAST
> (but but it's also the slowest), so I can be 99,99% confident that the
> pool on the SLAVE will never be corrupted, even in the case where the
> MASTER suddently die (power outage, etc), and that a zpool import -f
> storage will always work?

Pool on slave is the same as pool on master, as it uses the same disks :)
Only the physical host will change.
So yes you can be confident.
There is still the case where any ZFS pool could be totally damaged (due to a bug for example).
It "should" not arrive, but we never know :)
This is why I was talking about a third node / second pool made from a delayed send/receive.

> One last thing: this "storage" pool will be exported through NFS on the
> clients, and when a failover occur they should, in theory, not notice
> it. I know that it's pretty hypothetical but I wondered if pfsync could
> play a role in this area (active connections)..?

There will certainly be some small timeouts due to the failover delay.
You should make some tests to analyze NFS behaviour depending on the failover delay.

Good question regarding pfsync, I'm not so familiar with it :)



Of course, make a good POC before going with this into production.
Don't forget to test scrub, resilver, power failure, network failure...

And perhaps one may have additional comments / ideas / reserve on this topic ?

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 5:42:31 AM7/1/16
to
>
> Thank you very much for those "advices", it is much appreciated!
>
> I'll definitively go with iSCSI (for which I haven't that much
> experience) over HAST.

good luck, i rather cut one of my fingers than using something like this
in production. but its probably a quick way if one targets to find a new
opportunity ;)

>
> Maybe a stupid question but, assuming on the MASTER with ada{0,1} the
> local disks and da{0,1} the exported iSCSI disks from the SLAVE, would
> you go with:
>
> $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
> /dev/da1
>
> or rather:
>
> $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
> /dev/da1
>
> I guess the former is better, but it's just to be sure .. (or maybe it's
> better to iSCSI export a ZVOL from the SLAVE?)
>

are you really sure you understand what you trying to do? even if its
currently so, i bet in a desaster case you will be lost.


> Correct me if I'm wrong but, from a safety point of view this setup is
> also the safest as you'll get the "fullsync" equivalent mode of HAST
> (but but it's also the slowest), so I can be 99,99% confident that the
> pool on the SLAVE will never be corrupted, even in the case where the
> MASTER suddently die (power outage, etc), and that a zpool import -f
> storage will always work?

99,99% ? optimistic, very optimistic.

we are playing with recovery of a test pool which has been imported on
two nodes at the same time. looks pretty messy

>
> One last thing: this "storage" pool will be exported through NFS on the
> clients, and when a failover occur they should, in theory, not notice
> it. I know that it's pretty hypothetical but I wondered if pfsync could
> play a role in this area (active connections)..?
>

they will notice, and they will stuck or worse (reboot)

Julien Cigar

unread,
Jul 1, 2016, 6:15:46 AM7/1/16
to
On Fri, Jul 01, 2016 at 11:42:13AM +0200, InterNetX - Juergen Gotteswinter wrote:
> >
> > Thank you very much for those "advices", it is much appreciated!
> >
> > I'll definitively go with iSCSI (for which I haven't that much
> > experience) over HAST.
>
> good luck, i rather cut one of my fingers than using something like this
> in production. but its probably a quick way if one targets to find a new
> opportunity ;)

why...? I guess iSCSI is slower but should be safer than HAST, no?

>
> >
> > Maybe a stupid question but, assuming on the MASTER with ada{0,1} the
> > local disks and da{0,1} the exported iSCSI disks from the SLAVE, would
> > you go with:
> >
> > $> zpool create storage mirror /dev/ada0s1 /dev/ada1s1 mirror /dev/da0
> > /dev/da1
> >
> > or rather:
> >
> > $> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
> > /dev/da1
> >
> > I guess the former is better, but it's just to be sure .. (or maybe it's
> > better to iSCSI export a ZVOL from the SLAVE?)
> >
>
> are you really sure you understand what you trying to do? even if its
> currently so, i bet in a desaster case you will be lost.
>
>

well this is pretty new to me, but I don't see what could be wrong with:

$> zpool create storage mirror /dev/ada0s1 /dev/da0 mirror /dev/ada1s1
/dev/da1

Let's take some use-cases:
- MASTER and SLAVE are alive, the data is "replicated" on both
nodes. As iSCSI is used, ZFS will see all the details of the
underlying disks and we can be sure that no corruption will occur
(contrary to HAST)
- SLAVE die, correct me if I'm wrong the but pool is still available,
fix the SLAVE, resilver and that's it ..?
- MASTER die, CARP will notice it and SLAVE will take the VIP, the
failover script will be executed with a $> zpool import -f

> > Correct me if I'm wrong but, from a safety point of view this setup is
> > also the safest as you'll get the "fullsync" equivalent mode of HAST
> > (but but it's also the slowest), so I can be 99,99% confident that the
> > pool on the SLAVE will never be corrupted, even in the case where the
> > MASTER suddently die (power outage, etc), and that a zpool import -f
> > storage will always work?
>
> 99,99% ? optimistic, very optimistic.

the only situation where corruption could occur is some sort of network
corruption (bug in the driver, broken network card, etc), or a bug in
ZFS ... but you'll have the same with a zfs send|ssh zfs receive

>
> we are playing with recovery of a test pool which has been imported on
> two nodes at the same time. looks pretty messy
>
> >
> > One last thing: this "storage" pool will be exported through NFS on the
> > clients, and when a failover occur they should, in theory, not notice
> > it. I know that it's pretty hypothetical but I wondered if pfsync could
> > play a role in this area (active connections)..?
> >
>
> they will notice, and they will stuck or worse (reboot)

this is something that should be properly tested I agree..

>
> > Thanks!
> > Julien
> >
> >>
> >>>>>> ZFS would then know as soon as a disk is failing.
> >>>>>> And if the master fails, you only have to import (-f certainly, in case of a master power failure) on the slave.
> >>>>>>
> >>>>>> Ben
> >> _______________________________________________
> >> freeb...@freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
> >

signature.asc

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 6:18:59 AM7/1/16
to
Am 01.07.2016 um 12:15 schrieb Julien Cigar:
> On Fri, Jul 01, 2016 at 11:42:13AM +0200, InterNetX - Juergen Gotteswinter wrote:
>>>
>>> Thank you very much for those "advices", it is much appreciated!
>>>
>>> I'll definitively go with iSCSI (for which I haven't that much
>>> experience) over HAST.
>>
>> good luck, i rather cut one of my fingers than using something like this
>> in production. but its probably a quick way if one targets to find a new
>> opportunity ;)
>
> why...? I guess iSCSI is slower but should be safer than HAST, no?

do your testing, please. even with simulated short network cuts. 10-20
secs are way enaugh to give you a picture of what is going to happen
optimistic

>> we are playing with recovery of a test pool which has been imported on
>> two nodes at the same time. looks pretty messy
>>
>>>
>>> One last thing: this "storage" pool will be exported through NFS on the
>>> clients, and when a failover occur they should, in theory, not notice
>>> it. I know that it's pretty hypothetical but I wondered if pfsync could
>>> play a role in this area (active connections)..?
>>>
>>
>> they will notice, and they will stuck or worse (reboot)
>
> this is something that should be properly tested I agree..
>

do your testing, and keep your clients under load while testing. do
writes onto the nfs mounts and then cut. you will be surprised about the
impact.

Julien Cigar

unread,
Jul 1, 2016, 6:57:57 AM7/1/16
to
On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
> Am 01.07.2016 um 12:15 schrieb Julien Cigar:
> > On Fri, Jul 01, 2016 at 11:42:13AM +0200, InterNetX - Juergen Gotteswinter wrote:
> >>>
> >>> Thank you very much for those "advices", it is much appreciated!
> >>>
> >>> I'll definitively go with iSCSI (for which I haven't that much
> >>> experience) over HAST.
> >>
> >> good luck, i rather cut one of my fingers than using something like this
> >> in production. but its probably a quick way if one targets to find a new
> >> opportunity ;)
> >
> > why...? I guess iSCSI is slower but should be safer than HAST, no?
>
> do your testing, please. even with simulated short network cuts. 10-20
> secs are way enaugh to give you a picture of what is going to happen

of course I'll test everything properly :) I don't have the hardware yet
so ATM I'm just looking for all the possible "candidates", and I'm
aware that a redundant storage is not that easy to implement ...

but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
either zfs send|ssh zfs receive as you suggest (but it's
not realtime), either a distributed FS (which I avoid like the plague..)
signature.asc

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 7:10:15 AM7/1/16
to
Am 01.07.2016 um 12:57 schrieb Julien Cigar:
> On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
>> Am 01.07.2016 um 12:15 schrieb Julien Cigar:
>>> On Fri, Jul 01, 2016 at 11:42:13AM +0200, InterNetX - Juergen Gotteswinter wrote:
>>>>>
>>>>> Thank you very much for those "advices", it is much appreciated!
>>>>>
>>>>> I'll definitively go with iSCSI (for which I haven't that much
>>>>> experience) over HAST.
>>>>
>>>> good luck, i rather cut one of my fingers than using something like this
>>>> in production. but its probably a quick way if one targets to find a new
>>>> opportunity ;)
>>>
>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
>>
>> do your testing, please. even with simulated short network cuts. 10-20
>> secs are way enaugh to give you a picture of what is going to happen
>
> of course I'll test everything properly :) I don't have the hardware yet
> so ATM I'm just looking for all the possible "candidates", and I'm
> aware that a redundant storage is not that easy to implement ...
>
> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> either zfs send|ssh zfs receive as you suggest (but it's
> not realtime), either a distributed FS (which I avoid like the plague..)

zfs send/receive can be nearly realtime.

external jbods with cross cabled sas + commercial cluster solution like
rsf-1. anything else is a fragile construction which begs for desaster.

Miroslav Lachman

unread,
Jul 1, 2016, 7:40:27 AM7/1/16
to
Julien Cigar wrote on 07/01/2016 12:57:

>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
>>
>> do your testing, please. even with simulated short network cuts. 10-20
>> secs are way enaugh to give you a picture of what is going to happen
>
> of course I'll test everything properly :) I don't have the hardware yet
> so ATM I'm just looking for all the possible "candidates", and I'm
> aware that a redundant storage is not that easy to implement ...
>
> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> either zfs send|ssh zfs receive as you suggest (but it's
> not realtime), either a distributed FS (which I avoid like the plague..)

When disaster comes you will need to restart NFS clients in almost all
cases (with CARP + ZFS + HAST|iSCSI) and you will lose some writes too.
And if something bad happens with your mgmt scripts or network you can
end up with corrupted ZFS pool on master and slave too - you will need
to recovery from backups. For example in some split brain scenario when
both nodes will try to import pool.

With ZFS send & receive you will lose some writes but the chance you
will corrupt both pools are much lower than in the first case and the
setup is much simpler and runtime error proof.

I rather prefer some downtime with safe data than shorter downtime but
data in risk. YMMV

Miroslav Lachman

Ben RUBSON

unread,
Jul 1, 2016, 7:59:00 AM7/1/16
to

> On 01 Jul 2016, at 13:40, Miroslav Lachman <000....@quip.cz> wrote:
>
> Julien Cigar wrote on 07/01/2016 12:57:
>
>>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
>>>
>>> do your testing, please. even with simulated short network cuts. 10-20
>>> secs are way enaugh to give you a picture of what is going to happen
>>
>> of course I'll test everything properly :) I don't have the hardware yet
>> so ATM I'm just looking for all the possible "candidates", and I'm
>> aware that a redundant storage is not that easy to implement ...
>>
>> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
>> either zfs send|ssh zfs receive as you suggest (but it's
>> not realtime), either a distributed FS (which I avoid like the plague..)
>
> When disaster comes you will need to restart NFS clients in almost all cases (with CARP + ZFS + HAST|iSCSI) and you will lose some writes too.
> And if something bad happens with your mgmt scripts or network you can end up with corrupted ZFS pool on master and slave too - you will need to recovery from backups. For example in some split brain scenario when both nodes will try to import pool.

Of course you must take care that both nodes do not import the pool at the same time.
For the slave to import the pool, first stop iSCSI targets (ctld), and also put network replication interface down, to be sure.
Then, import the pool.
Once old master repaired, export its pool (if still imported), make its disks iSCSI targets and give them the old slave (promoted master just above).
Of course it implies some meticulous administration.

> With ZFS send & receive you will lose some writes but the chance you will corrupt both pools are much lower than in the first case and the setup is much simpler and runtime error proof.

Only some ?
Depending on the write throughput, won't you loose a lot of data on the target/slave ?
How do you make ZFS send/receive quite realtime ?
while [ 1 ] do ; snapshot ; send/receive ; delete old snapshots ; done ?

Thanks !

Miroslav Lachman

unread,
Jul 1, 2016, 8:17:23 AM7/1/16
to
Ben RUBSON wrote on 07/01/2016 13:58:

>> With ZFS send & receive you will lose some writes but the chance you will corrupt both pools are much lower than in the first case and the setup is much simpler and runtime error proof.
>
> Only some ?
> Depending on the write throughput, won't you loose a lot of data on the target/slave ?
> How do you make ZFS send/receive quite realtime ?
> while [ 1 ] do ; snapshot ; send/receive ; delete old snapshots ; done ?

It depends on throughput and how often do you send. But you need to
compare it to the HAST / iSCSI scenario. Even with this setup ZFS don't
write to disk immediately but in batch delayed according to some sysctl
settings and you will lose this amount of data in all cases + data on
clients which cannot write and must restart their NFS sessions (again
this apply to HAST / iSCSI scenario)

If you will ZFS send often, you can lose about 2 or 4 times more. It
depends on you if it is too much or not.

Miroslav Lachman

Julien Cigar

unread,
Jul 1, 2016, 8:38:28 AM7/1/16
to
On Fri, Jul 01, 2016 at 01:58:42PM +0200, Ben RUBSON wrote:
>
> > On 01 Jul 2016, at 13:40, Miroslav Lachman <000....@quip.cz> wrote:
> >
> > Julien Cigar wrote on 07/01/2016 12:57:
> >
> >>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
> >>>
> >>> do your testing, please. even with simulated short network cuts. 10-20
> >>> secs are way enaugh to give you a picture of what is going to happen
> >>
> >> of course I'll test everything properly :) I don't have the hardware yet
> >> so ATM I'm just looking for all the possible "candidates", and I'm
> >> aware that a redundant storage is not that easy to implement ...
> >>
> >> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> >> either zfs send|ssh zfs receive as you suggest (but it's
> >> not realtime), either a distributed FS (which I avoid like the plague..)
> >
> > When disaster comes you will need to restart NFS clients in almost all cases (with CARP + ZFS + HAST|iSCSI) and you will lose some writes too.
> > And if something bad happens with your mgmt scripts or network you can end up with corrupted ZFS pool on master and slave too - you will need to recovery from backups. For example in some split brain scenario when both nodes will try to import pool.
>
> Of course you must take care that both nodes do not import the pool at the same time.
> For the slave to import the pool, first stop iSCSI targets (ctld), and also put network replication interface down, to be sure.
> Then, import the pool.
> Once old master repaired, export its pool (if still imported), make its disks iSCSI targets and give them the old slave (promoted master just above).
> Of course it implies some meticulous administration.

I was thinking something like this also.. and I definitively think that
the switch from old save (promoted master) to "old master repaired" MUST
be done manually!

>
> > With ZFS send & receive you will lose some writes but the chance you will corrupt both pools are much lower than in the first case and the setup is much simpler and runtime error proof.

I think losing some writes is somewhat unavoidable, corruption on the
other hand is unacceptable

>
> Only some ?
> Depending on the write throughput, won't you loose a lot of data on the target/slave ?
> How do you make ZFS send/receive quite realtime ?
> while [ 1 ] do ; snapshot ; send/receive ; delete old snapshots ; done ?
>
> Thanks !
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Rick Macklem

unread,
Jul 1, 2016, 8:41:08 AM7/1/16
to
Choosing the most recent post in the thread (Ben RUBSON):

First off, this topic is of interest to me because a pNFS server
would need a similar fail-over for a metadata server.
Having said that, I know little about this and am learning from
what I've read (ie. keep up the good work, folks;-).

I will put a few NFS related comments inline.

>
> > On 01 Jul 2016, at 13:40, Miroslav Lachman <000....@quip.cz> wrote:
> >
> > Julien Cigar wrote on 07/01/2016 12:57:
> >
> >>>> why...? I guess iSCSI is slower but should be safer than HAST, no?
> >>>
> >>> do your testing, please. even with simulated short network cuts. 10-20
> >>> secs are way enaugh to give you a picture of what is going to happen
> >>
> >> of course I'll test everything properly :) I don't have the hardware yet
> >> so ATM I'm just looking for all the possible "candidates", and I'm
> >> aware that a redundant storage is not that easy to implement ...
> >>
> >> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> >> either zfs send|ssh zfs receive as you suggest (but it's
> >> not realtime), either a distributed FS (which I avoid like the plague..)
> >
> > When disaster comes you will need to restart NFS clients in almost all
> > cases (with CARP + ZFS + HAST|iSCSI) and you will lose some writes too.
> > And if something bad happens with your mgmt scripts or network you can end
> > up with corrupted ZFS pool on master and slave too - you will need to
> > recovery from backups. For example in some split brain scenario when both
> > nodes will try to import pool.
>
What the clients need to have happen is their TCP connection fail, so they must
do a new TCP connection. At that point, they will retry any outstanding RPCs,
including ones that did UNSTABLE writes but there was no subsequent Commit RPC
done successfully.

I know nothing about CARP (have never used it), but I have a hunch it will
not do the above at the correct point in time.

In general, I would consider failing over from the NFS server to the backup
one (where the drives have been kept up to date via a ZFS mirror using iSCSI)
is a major event. I don't think you want to do this for short outages.
(Server overload, server crash/reboot, etc.)
I think it is hard to distinguish between a slow server and a failed one and
you don't want this switchover happening unless it is really needed, imho.

When it happens, I think you want a strict ordering of events like:
- old master server shut down (off the network or at least moved to different
IP addresses so it appears off the network to the clients).
- an orderly switchover of the ZFS file system, so that the backup is now the
only system handling the file system (the paragraph just below here covers
that, I think?)
- new server (was backup) comes up with the same IP address(es) as the old one
had before the failure.
--> I am not sure if anything has to be done to speed up ARP cache invalidation
for the old server's IP->MAC mapping.
(I'm not sure what you do w.r.t. mirroring for the backup server. Ideally there
would be a way for a third server to have mirrored disks resilvered from the
backup server's. I don't know enough about ZFS to know if this could be done
with a third server, where its empty disks are set up as iSCSI mirrors or the
backup server's after the failover? I think this is also covered by the para.
below.)

> Of course you must take care that both nodes do not import the pool at the
> same time.
> For the slave to import the pool, first stop iSCSI targets (ctld), and also
> put network replication interface down, to be sure.
> Then, import the pool.
> Once old master repaired, export its pool (if still imported), make its disks
> iSCSI targets and give them the old slave (promoted master just above).
> Of course it implies some meticulous administration.
>
Yes. I don't think CARP can be trusted to do this. The problem is that there
is a time ordering w.r.t. activities on the two servers. However, you can't
assume that the two servers can communicate with each other.

The simpler/reliable way would be done manually be a sysadmin (who might also
know why and how long the master will be down for).

It might be possible to automate this with daemons on the two servers, with
something like:
- master sends regular heartbeat messages to backup.
- if master gets no NFS client activity for N seconds, it sends "shutdown
message to backup" and then does a full shutdown.

- if slave gets shutdown message from master OR doesn't see a heartbeat
message for something like 2 * N seconds, then it assumes it can take over
and does so.
--> The trick is, the backup can't start a takeover until the master really
is offline.

> > With ZFS send & receive you will lose some writes but the chance you will
> > corrupt both pools are much lower than in the first case and the setup is
> > much simpler and runtime error proof.
>
> Only some ?
> Depending on the write throughput, won't you loose a lot of data on the
> target/slave ?
> How do you make ZFS send/receive quite realtime ?
> while [ 1 ] do ; snapshot ; send/receive ; delete old snapshots ; done ?
>
Well, if the NFS clients aren't buggy and the server isn't running sync=disabled,
then nothing should get lost if the clients recognize that the server has
"crashed/rebooted". This happens when the TCP connection to the server breaks.

You can't have a case where the client TCP connection to the server keeps
functioning but actually goes to the backup server and you can't have the
case where some clients are still doing NFS RPCs to the old master while
others are doing RPCs to the backup.

Good luck with it. It is an interesting challenge and worth exploring.

rick

Joe Love

unread,
Jul 1, 2016, 9:18:29 AM7/1/16
to

> On Jul 1, 2016, at 6:09 AM, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:

>
> Am 01.07.2016 um 12:57 schrieb Julien Cigar:
>> On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
>>
>> of course I'll test everything properly :) I don't have the hardware yet
>> so ATM I'm just looking for all the possible "candidates", and I'm
>> aware that a redundant storage is not that easy to implement ...
>>
>> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
>> either zfs send|ssh zfs receive as you suggest (but it's
>> not realtime), either a distributed FS (which I avoid like the plague..)
>
> zfs send/receive can be nearly realtime.
>
> external jbods with cross cabled sas + commercial cluster solution like
> rsf-1. anything else is a fragile construction which begs for desaster.

This sounds similar to the CTL-HA code that went in last year, for which I haven’t seen any sort of how-to. The RSF-1 stuff sounds like it has more scaling options, though. Which it probably should, given its commercial operation.

-Joe

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 9:44:58 AM7/1/16
to

Am 01.07.2016 um 15:18 schrieb Joe Love:
>
>> On Jul 1, 2016, at 6:09 AM, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:
>>
>> Am 01.07.2016 um 12:57 schrieb Julien Cigar:
>>> On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
>>>
>>> of course I'll test everything properly :) I don't have the hardware yet
>>> so ATM I'm just looking for all the possible "candidates", and I'm
>>> aware that a redundant storage is not that easy to implement ...
>>>
>>> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
>>> either zfs send|ssh zfs receive as you suggest (but it's
>>> not realtime), either a distributed FS (which I avoid like the plague..)
>>
>> zfs send/receive can be nearly realtime.
>>
>> external jbods with cross cabled sas + commercial cluster solution like
>> rsf-1. anything else is a fragile construction which begs for desaster.
>
> This sounds similar to the CTL-HA code that went in last year, for which I haven’t seen any sort of how-to. The RSF-1 stuff sounds like it has more scaling options, though. Which it probably should, given its commercial operation.

rsf is what pacemaker / heartbeat tries to be, judge me for linking
whitepapers but in this case its not such evil marketing blah

http://www.high-availability.com/wp-content/uploads/2013/01/RSF-1-HA-PLUGIN-ZFS-STORAGE-CLUSTER.pdf


@ Julien

seems like you take availability really serious, so i guess you also got
plans how to accomplish network problems like dead switches, flaky
cables and so on.

like using multiple network cards in the boxes, cross cabling between
the hosts (rs232 and ethernet of course, using proved reliable network
switches in a stacked configuration for example cisco 3750 stacked). not
to forget redundant power feeds to redundant power supplies.

if not, i whould start again from scratch.

Julien Cigar

unread,
Jul 1, 2016, 10:39:39 AM7/1/16
to
the only thing that is not redundant (yet?) is our switch, an HP Pro
Curve 2530-24G) .. it's the next step :)

>
> if not, i whould start again from scratch.
>
> >
> > -Joe
> >
> > _______________________________________________
> > freeb...@freebsd.org mailing list
> > https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> > To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
> >
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 10:42:20 AM7/1/16
to

Arubas, okay, a quick view in the spec sheet does not seem to list
stacking option.

what about power?

InterNetX - Juergen Gotteswinter

unread,
Jul 1, 2016, 10:44:48 AM7/1/16
to
dont get me wrong, what i try to say is that imho you are trying to
reach something which looks great until something goes wrong.

keep it simple, stupid simple, without much moving parts and avoid
automagic voodoo wherever possible.

Ben RUBSON

unread,
Jul 1, 2016, 11:02:46 AM7/1/16
to
I think what we miss is some kind of this :
http://milek.blogspot.fr/2007/03/zfs-online-replication.html
http://www.compnect.net/?p=16461

Online replication built in ZFS would be awesome.

Julien Cigar

unread,
Jul 1, 2016, 11:04:01 AM7/1/16
to
there is a "diesel generator" for the server room, and redundant power
supply on "most critical" servers (our PostgreSQL servers for example).

Router and Firewall (Soekris 6501) runs CARP / PFSync, same for
the load balancer (HAProxy), local Unbound, etc

(Everything is jailed and managed by SaltStack, so in worst case scenario
I could always redeploy "something" (service, webapp, etc) in minutes on
$somemachine ..)

No, the really SPOF that should be fixed quickly ATM is the shared files
storage, it's a NFS mount on a single machine (with redundant power
supply) and if the hardware die we're in troubles (...)

>
> >
> >>
> >> if not, i whould start again from scratch.
> >>
> >>>
> >>> -Joe
> >>>
> >>> _______________________________________________
> >>> freeb...@freebsd.org mailing list
> >>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >>> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
> >>>
> >> _______________________________________________
> >> freeb...@freebsd.org mailing list
> >> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> >> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
> >
>

signature.asc

Julien Cigar

unread,
Jul 1, 2016, 11:12:11 AM7/1/16
to
On Fri, Jul 01, 2016 at 04:44:24PM +0200, InterNetX - Juergen Gotteswinter wrote:
> dont get me wrong, what i try to say is that imho you are trying to
> reach something which looks great until something goes wrong.

I agree..! :)

>
> keep it simple, stupid simple, without much moving parts and avoid
> automagic voodoo wherever possible.
>

to be honnest I've always been relunctant to "automatic failover", as I
think the problem is always not "how" to do it but "when".. and as Rick
said "The simpler/reliable way would be done manually be a sysadmin"..

> Am 01.07.2016 um 16:41 schrieb InterNetX - Juergen Gotteswinter:
signature.asc

Gary Palmer

unread,
Jul 1, 2016, 11:47:16 AM7/1/16
to
On Fri, Jul 01, 2016 at 05:11:47PM +0200, Julien Cigar wrote:
> On Fri, Jul 01, 2016 at 04:44:24PM +0200, InterNetX - Juergen Gotteswinter wrote:
> > dont get me wrong, what i try to say is that imho you are trying to
> > reach something which looks great until something goes wrong.
>
> I agree..! :)
>
> >
> > keep it simple, stupid simple, without much moving parts and avoid
> > automagic voodoo wherever possible.
> >
>
> to be honnest I've always been relunctant to "automatic failover", as I
> think the problem is always not "how" to do it but "when".. and as Rick
> said "The simpler/reliable way would be done manually be a sysadmin"..

I agree. They can verify that the situation needs a fail over much better
than any script. In a previous job I heard of a setup where the cluster
manager software on the standby node decided that the active node was
down so it did a force takeover of the disks. Since the active node was
still up it somehow managed to wipe out the partition tables on the disks
along with the vxvm configuration (Veritas Volume Manager) inside the
partitions.

They were restoring the partition tables and vxvm config from backups.
From what I remember the backups were printouts, which made it slow going
as they had to be re-entered by hand. The system probably had dozens
of disks (I don't know, but I know what role it was serving so I can
guess)

I'd rather not see that happen ever again

(this was 15+ years ago FWIW, but the lesson is still applicable today)

Gary

>
> > Am 01.07.2016 um 16:41 schrieb InterNetX - Juergen Gotteswinter:
> > > Am 01.07.2016 um 16:39 schrieb Julien Cigar:
> > >> On Fri, Jul 01, 2016 at 03:44:36PM +0200, InterNetX - Juergen Gotteswinter wrote:
> > >>>
> > >>>
> > >>> Am 01.07.2016 um 15:18 schrieb Joe Love:
> > >>>>
> > >>>>> On Jul 1, 2016, at 6:09 AM, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:
> > >>>>>
> > >>>>> Am 01.07.2016 um 12:57 schrieb Julien Cigar:
> > >>>>>> On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
> > >>>>>>
> > >>>>>> of course I'll test everything properly :) I don't have the hardware yet
> > >>>>>> so ATM I'm just looking for all the possible "candidates", and I'm
> > >>>>>> aware that a redundant storage is not that easy to implement ...
> > >>>>>>
> > >>>>>> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
> > >>>>>> either zfs send|ssh zfs receive as you suggest (but it's
> > >>>>>> not realtime), either a distributed FS (which I avoid like the plague..)
> > >>>>>
> > >>>>> zfs send/receive can be nearly realtime.
> > >>>>>
> > >>>>> external jbods with cross cabled sas + commercial cluster solution like
> > >>>>> rsf-1. anything else is a fragile construction which begs for desaster.
> > >>>>
> > >>>> This sounds similar to the CTL-HA code that went in last year, for which I haven???t seen any sort of how-to. The RSF-1 stuff sounds like it has more scaling options, though. Which it probably should, given its commercial operation.

Chris Watson

unread,
Jul 1, 2016, 12:55:19 PM7/1/16
to
Hi Gary!

So I'll add another voice to the KISS camp. I'd rather have two boxes each with two NICs attached to each other doing zfs replication from A to B. Adding more redundant hardware just adds more points of failure. NICs have no moving parts so as long as they are thermally controlled they won't fail. This is simple and as safe as you can get. As for how to handle an actual failover is really like to try out the ctl-ha option. Maybe this weekend.

Sent from my iPhone 5

Jordan Hubbard

unread,
Jul 1, 2016, 1:54:45 PM7/1/16
to

> On Jun 30, 2016, at 11:57 AM, Julien Cigar <jul...@perdition.city> wrote:
>
> It would be more than welcome indeed..! I have the feeling that HAST
> isn't that much used (but maybe I am wrong) and it's difficult to find
> informations on it's reliability and concrete long-term use cases...

This has been a long discussion so I’m not even sure where the right place to jump in is, but just speaking as a storage vendor (FreeNAS) I’ll say that we’ve considered HAST many times but also rejected it many times for multiple reasons:

1. Blocks which are found to be corrupt by ZFS (fail checksum) get replicated by HAST nonetheless since it has no idea - it’s below that layer. This means that both good data and corrupt data are replicated to the other pool, which isn’t a fatal flaw but it’s a lot nicer to be replicating only *good* data at a higher layer.

2. When HAST systems go split-brain, it’s apparently hilarious. I don’t have any experience with that in production so I can’t speak authoritatively about it, but the split-brain scenario has been mentioned by some of the folks who are working on clustered filesystems (glusterfs, ceph, etc) and I can easily imagine how that might cause hilarity, given the fact that ZFS has no idea its underlying block store is being replicated and also likes to commit changes in terms of transactions (TXGs), not just individual block writes, and writing a partial TXG (or potentially multiple outstanding TXGs with varying degrees of completion) would Be Bad.

3. HAST only works on a pair of machines with a MASTER/SLAVE relationship, which is pretty ghetto by today’s standards. HDFS (Hadoop’s filesystem) can do block replication across multiple nodes, as can DRDB (Distributed Replicated Block Device), so chasing HAST seems pretty retro and will immediately set you up for embarrassment when the inevitable “OK, that pair of nodes is fine, but I’d like them both to be active and I’d also like to add a 3rd node in this one scenario where I want even more fault-tolerance - other folks can do that, how about you?” question comes up.

In short, the whole thing sounds kind of MEH and that’s why we’ve avoided putting any real time or energy into HAST. DRDB sounds much more interesting, though of course it’s Linux-only. This wouldn’t stop someone else from implementing a similar scheme in a clean-room fashion, of course.

And yes, of course one can layer additional things on top of iSCSI LUNs, just as one can punch through LUNs from older SAN fabrics and put ZFS pools on top of them (been there, done both of those things), though of course the additional indirection has performance and debugging ramifications of its own (when a pool goes sideways, you have additional things in the failure chain to debug). ZFS really likes to “own the disks” in terms of providing block-level fault tolerance and predictable performance characteristics given specific vdev topologies, and once you start abstracting the disks away from it, making statements about predicted IOPs for the pool becomes something of a “???” exercise.

- Jordan

Ben RUBSON

unread,
Jul 1, 2016, 2:24:13 PM7/1/16
to

> On 01 Jul 2016, at 19:54, Jordan Hubbard <j...@ixsystems.com> wrote:
> (...)

> And yes, of course one can layer additional things on top of iSCSI LUNs, just as one can punch through LUNs from older SAN fabrics and put ZFS pools on top of them (been there, done both of those things), though of course the additional indirection has performance and debugging ramifications of its own (when a pool goes sideways, you have additional things in the failure chain to debug). ZFS really likes to “own the disks” in terms of providing block-level fault tolerance and predictable performance characteristics given specific vdev topologies, and once you start abstracting the disks away from it, making statements about predicted IOPs for the pool becomes something of a “???” exercise.

Would you say that giving an iSCSI disk to ZFS hides some details of the raw disk to ZFS ?
I though that iSCSI would have been a totally "transparent" layer, transferring all ZFS requests to the raw disk, giving back the answers, hiding anything.

As you experienced iSCSI, any sad story with iSCSI disks given to ZFS ?

Many thanks for your long feedback Jordan !

Ben RUBSON

unread,
Jul 2, 2016, 11:04:09 AM7/2/16
to

> On 01 Jul 2016, at 20:23, Ben RUBSON <ben.r...@gmail.com> wrote:
>
>
>> On 01 Jul 2016, at 19:54, Jordan Hubbard <j...@ixsystems.com> wrote:
>> (...)
>> And yes, of course one can layer additional things on top of iSCSI LUNs, just as one can punch through LUNs from older SAN fabrics and put ZFS pools on top of them (been there, done both of those things), though of course the additional indirection has performance and debugging ramifications of its own (when a pool goes sideways, you have additional things in the failure chain to debug). ZFS really likes to “own the disks” in terms of providing block-level fault tolerance and predictable performance characteristics given specific vdev topologies, and once you start abstracting the disks away from it, making statements about predicted IOPs for the pool becomes something of a “???” exercise.
>
> Would you say that giving an iSCSI disk to ZFS hides some details of the raw disk to ZFS ?
> I though that iSCSI would have been a totally "transparent" layer, transferring all ZFS requests to the raw disk, giving back the answers, hiding anything.

From #openzfs :
A: typically iSCSI disks still appear as "physical" disks to the OS connecting to them. you can even get iSCSI servers that allow things like SMART pass-thru
Q: so ZFS will be as happy with iSCSI disks as if it used local disks ? or will it miss something ?
A: no, and ZFS isn't "unhappy" per se. but there are optimizations it applies when it knows the disks belong to ZFS only
Q: and using iSCSI disks, ZFS will not apply these optimizations (even if these iSCSI disks are only given to ZFS) ? ie, will ZFS know these iSCSI disks belong to ZFS only ?
A: if it looks like a physical disk, if it quacks like a physical disk...

Ben RUBSON

unread,
Jul 2, 2016, 11:04:40 AM7/2/16
to

> On 30 Jun 2016, at 20:57, Julien Cigar <jul...@perdition.city> wrote:
>
> On Thu, Jun 30, 2016 at 11:32:17AM -0500, Chris Watson wrote:
>>
>>
>> Sent from my iPhone 5
>>
>>>
>>>>
>>>> Yes that's another option, so a zpool with two mirrors (local +
>>>> exported iSCSI) ?
>>>
>>> Yes, you would then have a real time replication solution (as HAST), compared to ZFS send/receive which is not.
>>> Depends on what you need :)
>>>
>>>>
>>>>> ZFS would then know as soon as a disk is failing.
>>
>> So as an aside, but related, for those watching this from the peanut gallery and for the benefit of the OP perhaps those that run with this setup might give some best practices and tips here in this thread on making this a good reliable setup. I can see someone reading this thread and tossing two crappy Ethernet cards in a box and then complaining it doesn't work well.
>
> It would be more than welcome indeed..! I have the feeling that HAST
> isn't that much used (but maybe I am wrong) and it's difficult to find
> informations on it's reliability and concrete long-term use cases...
>
> Also the pros vs cons of HAST vs iSCSI

I made further testing today.

# serverA, serverB :
kern.iscsi.ping_timeout=5
kern.iscsi.iscsid_timeout=5
kern.iscsi.login_timeout=5
kern.iscsi.fail_on_disconnection=1

# Preparation :
- serverB : let's make 2 iSCSI targets : rem3, rem4.
- serverB : let's start ctld.
- serverA : let's create a mirror pool made of 4 disks : loc1, loc2, rem3, rem4.
- serverA : pool is healthy.

# Test 1 :
- serverA : put a lot of data into the pool ;
- serverB : stop ctld ;
- serverA : put a lot of data into the pool ;
- serverB : start ctld ;
- serverA : make all pool disks online : it works, pool is healthy.

# Test 2 :
- serverA : put a lot of data into the pool ;
- serverA : export the pool ;
- serverB : import the pool : it does not work, as ctld locks the disks ! Good news, nice protection (both servers won't be able to access the same disks at the same time).
- serverB : stop ctld ;
- serverB : import the pool : it works, 2 disks missing ;
- serverA : let's make 2 iSCSI targets : rem1, rem2 ;
- serverB : make all pool disks online : it works, pool is healthy.

# Test 3 :
- serverA : put a lot of data into the pool ;
- serverB : stop ctld ;
- serverA : put a lot of data into the pool ;
- serverB : import the pool : it works, 2 disks missing ;
- serverA : let's make 2 iSCSI targets : rem1, rem2 ;
- serverB : make all pool disks online : it works, pool is healthy, but of course data written at step3 is lost.

# Test 4 :
- serverA : put a lot of data into the pool ;
- serverB : stop ctld ;
- serverA : put a lot of data into the pool ;
- serverA : export the pool ;
- serverA : let's make 2 iSCSI targets : rem1, rem2 ;
- serverB : import the pool : it works, pool is healthy, data written at step3 is here.

# Test 5 :
- serverA : rsync a huge remote repo into the pool in the background ;
- serverB : stop ctld ;
- serverA : 2 disks missing, but rsync still runs flawlessly ;
- serverB : start ctld ;
- serverA : make all pool disks online : it works, pool is healthy.
- serverB : ifconfig <replication_interface> down ;
- serverA : 2 disks missing, but rsync still runs flawlessly ;
- serverB : ifconfig <replication_interface> up ;
- serverA : make all pool disks online : it works, pool is healthy.
- serverB : power reset !
- serverA : 2 disks missing, but rsync still runs flawlessly ;
- serverB : let's wait for server to be up ;
- serverA : make all pool disks online : it works, pool is healthy.

Quite happy with these tests actually :)

Ben

Julien Cigar

unread,
Jul 3, 2016, 3:30:19 PM7/3/16
to
Thank you very much for thoses quick tests! I'll start my own ones
tomorrow, but based on your preliminary it *seems* that ZFS + iSCSI
combinaison could be a potential candidate for what I'd like to do..!

>
> Ben
>
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Julien Cigar

unread,
Jul 3, 2016, 5:47:54 PM7/3/16
to
another question from a performance point of view, imagine that you
create a single mirror zpool, something like:
$> zpool create storage mirror loc1 loc2 rem1 rem2

(where rem1 and rem2 are iSCSI disks)

I guess that ZFS will split the read requests accross all devices in
order to maximize performance... which could lead to contrary to what is
expecpted when iSCSI disks are involved, no?
Is there some sysctl params which could prevent this unexpected
behavior?
signature.asc

Jordan Hubbard

unread,
Jul 3, 2016, 6:25:14 PM7/3/16
to

> On Jul 3, 2016, at 2:47 PM, Julien Cigar <jul...@perdition.city> wrote:
>
> I guess that ZFS will split the read requests accross all devices in
> order to maximize performance... which could lead to contrary to what is
> expecpted when iSCSI disks are involved, no?
> Is there some sysctl params which could prevent this unexpected
> behavior?

Nope. You will suffer the performance implications of layering a filesystem that expects “rotating media or SSDs” (with the innate ability to parallelize multiple requests in a way that ADD performance) on top of a system which is now serializing the requests across an internet connection to another software layer which may offer no performance benefits to having multiple LUNs at all. You can try iSCSI-specific tricks like MPIO to try and increase performance, but ZFS itself is just going to treat everything it sees as “a disk” and so physical concepts like mirrors or multiple vdevs for performance won’t translate across.

Example question: What’s the point of writing multiple copies of data across virtual disks in a mirror configuration if the underlying storage for the virtual disks is already redundant and the I/Os to it serialize?
Example Answer: There is no point. In fact, it’s a pessimization to do so.

This is not a lot different than running ZFS on top of RAID controllers that turn N physical disks into 1 or more virtual disks. You have to make entirely different performance decisions based on such scenarios and that’s just the way it is, which is also why we don’t recommend doing that.

- Jordan

Ben RUBSON

unread,
Jul 4, 2016, 1:56:48 AM7/4/16
to
> I guess that ZFS will split the read requests accross all devices in
> order to maximize performance... which could lead to contrary to what is
> expecpted when iSCSI disks are involved, no?

Not necessarily no, if your network card is not a bottleneck.
If you only have 1Gbps adapters, forget this solution.
You should have at least 10Gbps adapters, and depending on how many disks you have, you would have to go with 25/40Gbps adapters...

> Is there some sysctl params which could prevent this unexpected
> behavior?
>
>>
>>>
>>> Ben
>>>
>>> _______________________________________________
>>> freeb...@freebsd.org mailing list
>>> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
>>> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"
>>
>> --
>> Julien Cigar
>> Belgian Biodiversity Platform (http://www.biodiversity.be)
>> PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0
>> No trees were killed in the creation of this message.
>> However, many electrons were terribly inconvenienced.
>
>
>
> --
> Julien Cigar
> Belgian Biodiversity Platform (http://www.biodiversity.be)
> PGP fingerprint: EEF9 F697 4B68 D275 7B11 6A25 B2BB 3710 A204 23C0
> No trees were killed in the creation of this message.
> However, many electrons were terribly inconvenienced.

Ben RUBSON

unread,
Jul 4, 2016, 2:06:01 AM7/4/16
to
> On 04 Jul 2016, at 00:24, Jordan Hubbard <j...@ixsystems.com> wrote:
>
>> On Jul 3, 2016, at 2:47 PM, Julien Cigar <jul...@perdition.city> wrote:
>>
>> I guess that ZFS will split the read requests accross all devices in
>> order to maximize performance... which could lead to contrary to what is
>> expecpted when iSCSI disks are involved, no?
>> Is there some sysctl params which could prevent this unexpected
>> behavior?
>
> Nope. You will suffer the performance implications of layering a filesystem that expects “rotating media or SSDs” (with the innate ability to parallelize multiple requests in a way that ADD performance) on top of a system which is now serializing the requests across an internet connection to another software layer which may offer no performance benefits to having multiple LUNs at all. You can try iSCSI-specific tricks like MPIO to try and increase performance, but ZFS itself is just going to treat everything it sees as “a disk” and so physical concepts like mirrors or multiple vdevs for performance won’t translate across.
>
> Example question: What’s the point of writing multiple copies of data across virtual disks in a mirror configuration if the underlying storage for the virtual disks is already redundant and the I/Os to it serialize?
> Example Answer: There is no point. In fact, it’s a pessimization to do so.

Of course Jordan, in this topic, we (well at least me :) make the following assumption :
one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !

> This is not a lot different than running ZFS on top of RAID controllers that turn N physical disks into 1 or more virtual disks. You have to make entirely different performance decisions based on such scenarios and that’s just the way it is, which is also why we don’t recommend doing that.

Of course you loose all ZFS benefits if you only mirror 2 "disks", a big one from storage array A, the same from storage array B.
No interest.

Julien Cigar

unread,
Jul 4, 2016, 8:25:50 AM7/4/16
to
another option I had in mind is to export ZVOL through iSCSI:
- on serverA: $> zpool create storage mirror /dev/ada1 /dev/ada2
- on serverB: $> zpool create storage mirror /dev/ada1 /dev/ada2

create a 50G dedicated redundant storage for serverC:
- on serverA: $> zfs create -V 50G storage/serverc
- on serverB: $> zfs create -V 50G storage/serverc

iSCSI export /dev/zvol/storage/serverc from serverA and serverB and
create a ZFS dataset on serverC:
- on serverC: $> zpool create storage mirror /dev/ivol1 /dev/ivol2

(where ivol1 is the volume from serverA and ivol2 the volume from
serverB)

create, for example, a dataset for the dovecot service
- on serverC: $> zfs create -o compress=lz4 storage/dovecot

any thoughts?

>
> Ben
>
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Miroslav Lachman

unread,
Jul 4, 2016, 8:56:19 AM7/4/16
to
Julien Cigar wrote on 07/04/2016 14:25:

> another option I had in mind is to export ZVOL through iSCSI:
> - on serverA: $> zpool create storage mirror /dev/ada1 /dev/ada2
> - on serverB: $> zpool create storage mirror /dev/ada1 /dev/ada2
>
> create a 50G dedicated redundant storage for serverC:
> - on serverA: $> zfs create -V 50G storage/serverc
> - on serverB: $> zfs create -V 50G storage/serverc
>
> iSCSI export /dev/zvol/storage/serverc from serverA and serverB and
> create a ZFS dataset on serverC:
> - on serverC: $> zpool create storage mirror /dev/ivol1 /dev/ivol2
>
> (where ivol1 is the volume from serverA and ivol2 the volume from
> serverB)
>
> create, for example, a dataset for the dovecot service
> - on serverC: $> zfs create -o compress=lz4 storage/dovecot

I think it will be painfully slow. ZFS on top of ZFS throught iSCSI...
too much layering, too much delays.

Miroslav Lachman

Ben RUBSON

unread,
Jul 4, 2016, 8:57:27 AM7/4/16
to

> On 04 Jul 2016, at 14:56, Miroslav Lachman <000....@quip.cz> wrote:
>
> Julien Cigar wrote on 07/04/2016 14:25:
>
>> another option I had in mind is to export ZVOL through iSCSI:
>> - on serverA: $> zpool create storage mirror /dev/ada1 /dev/ada2
>> - on serverB: $> zpool create storage mirror /dev/ada1 /dev/ada2
>>
>> create a 50G dedicated redundant storage for serverC:
>> - on serverA: $> zfs create -V 50G storage/serverc
>> - on serverB: $> zfs create -V 50G storage/serverc
>>
>> iSCSI export /dev/zvol/storage/serverc from serverA and serverB and
>> create a ZFS dataset on serverC:
>> - on serverC: $> zpool create storage mirror /dev/ivol1 /dev/ivol2
>>
>> (where ivol1 is the volume from serverA and ivol2 the volume from
>> serverB)
>>
>> create, for example, a dataset for the dovecot service
>> - on serverC: $> zfs create -o compress=lz4 storage/dovecot
>
> I think it will be painfully slow. ZFS on top of ZFS throught iSCSI... too much layering, too much delays.

And serverC is a SPOF.

InterNetX - Juergen Gotteswinter

unread,
Jul 4, 2016, 9:04:05 AM7/4/16
to
i see so much pain upcoming when $something goes wrong, even the smallest piece
will end up in drama

zfs has been designed as a pragmatic stupid simpel solid solution, why whould
one rape this concept with such complexity :(


> Miroslav Lachman <000....@quip.cz> hat am 4. Juli 2016 um 14:56 geschrieben:
>
>
> Julien Cigar wrote on 07/04/2016 14:25:
>
> > another option I had in mind is to export ZVOL through iSCSI:
> > - on serverA: $> zpool create storage mirror /dev/ada1 /dev/ada2
> > - on serverB: $> zpool create storage mirror /dev/ada1 /dev/ada2
> >
> > create a 50G dedicated redundant storage for serverC:
> > - on serverA: $> zfs create -V 50G storage/serverc
> > - on serverB: $> zfs create -V 50G storage/serverc
> >
> > iSCSI export /dev/zvol/storage/serverc from serverA and serverB and
> > create a ZFS dataset on serverC:
> > - on serverC: $> zpool create storage mirror /dev/ivol1 /dev/ivol2
> >
> > (where ivol1 is the volume from serverA and ivol2 the volume from
> > serverB)
> >
> > create, for example, a dataset for the dovecot service
> > - on serverC: $> zfs create -o compress=lz4 storage/dovecot
>
> I think it will be painfully slow. ZFS on top of ZFS throught iSCSI...
> too much layering, too much delays.
>
> Miroslav Lachman

Jordan Hubbard

unread,
Jul 4, 2016, 1:56:00 PM7/4/16
to

> On Jul 3, 2016, at 11:05 PM, Ben RUBSON <ben.r...@gmail.com> wrote:
>
> Of course Jordan, in this topic, we (well at least me :) make the following assumption :
> one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !

I certainly wouldn’t make that assumption. Once you allow iSCSI to be the back-end in any solution, end-users will avail themselves of the flexibility to also export arbitrary or synthetic devices (like zvols / RAID devices) as “disks”. You can’t stop them from doing so, so you might as well incorporate that scenario into your design. Even if you could somehow enforce the 1:1 mapping of LUN to disk, iSCSI itself is still going to impose a serialization / performance / reporting (iSCSI LUNs don’t report SMART status) penalty that removes a lot of the advantages of having direct physical access to the media, so one might also ask what you’re gaining by imposing those restrictions.

- Jordan

Jordan Hubbard

unread,
Jul 4, 2016, 2:06:54 PM7/4/16
to

> On Jul 1, 2016, at 11:23 AM, Ben RUBSON <ben.r...@gmail.com> wrote:
>
> Would you say that giving an iSCSI disk to ZFS hides some details of the raw disk to ZFS ?

Yes, of course.

> I though that iSCSI would have been a totally "transparent" layer, transferring all ZFS requests to the raw disk, giving back the answers, hiding anything.

Not really, no. There are other ways of talking to a disk or SSD device, such as getting S.M.A.R.T. data to see when/if a drive is failing. Drives also return checksum errors that may be masked by the iSCSI target. Finally, there is SCSI-2 and there is SCSI-3 (where things like persistent reservations are implemented). None of these things are necessarily implemented by iSCSI.

Julien Cigar

unread,
Jul 4, 2016, 2:37:05 PM7/4/16
to
On Mon, Jul 04, 2016 at 10:55:40AM -0700, Jordan Hubbard wrote:
>
> > On Jul 3, 2016, at 11:05 PM, Ben RUBSON <ben.r...@gmail.com> wrote:
> >
> > Of course Jordan, in this topic, we (well at least me :) make the following assumption :
> > one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !
>
> I certainly wouldn’t make that assumption. Once you allow iSCSI to be the back-end in any solution, end-users will avail themselves of the flexibility to also export arbitrary or synthetic devices (like zvols / RAID devices) as “disks”. You can’t stop them from doing so, so you might as well incorporate that scenario into your design. Even if you could somehow enforce the 1:1 mapping of LUN to disk, iSCSI itself is still going to impose a serialization / performance / reporting (iSCSI LUNs don’t report SMART status) penalty that removes a lot of the advantages of having direct physical access to the media, so one might also ask what you’re gaining by imposing those restrictions.

I think the discussion evolved a bit since I started this thread, the
original purpose was to build a low-cost redundant storage for a small
infrastructure, no more no less.

The context is the following: I work in a small company, partially
financed by public funds, we started small, evolved a bit to a point
that some redundancy is required for $services.
Unfortunately I'm alone to take care of the infrastructure (and it's
only 50% of my time) and we don't have that much money :(

That's why I was just thinking of two machines with a simple HBA card,
2x4To, a zpool mirror on those 4 disks (2 local, 2 iSCSI exported), a
NFS share on top, and an easy failover procedure..

I understand that iSCSI hides some details, but as far as I know it's
the "lowest" level that you can provide to ZFS when the disks are not
local, no ?

Anyway, thanks for your feedback, it's greatly appreciated! :)

Julien

>
> - Jordan
>
> _______________________________________________
> freeb...@freebsd.org mailing list
> https://lists.freebsd.org/mailman/listinfo/freebsd-fs
> To unsubscribe, send any mail to "freebsd-fs-...@freebsd.org"

signature.asc

Jordan Hubbard

unread,
Jul 4, 2016, 2:57:15 PM7/4/16
to

> On Jul 4, 2016, at 11:36 AM, Julien Cigar <jul...@perdition.city> wrote:
>
> I think the discussion evolved a bit since I started this thread, the
> original purpose was to build a low-cost redundant storage for a small
> infrastructure, no more no less.
>
> The context is the following: I work in a small company, partially
> financed by public funds, we started small, evolved a bit to a point
> that some redundancy is required for $services.
> Unfortunately I'm alone to take care of the infrastructure (and it's
> only 50% of my time) and we don't have that much money :(

Sure, I get that part also, but let’s put the entire conversation into context:

1. You’re looking for a solution to provide some redundant storage in a very specific scenario.

2. We’re talking on a public mailing list with a bunch of folks, so the conversation is also naturally going to go from the specific to the general - e.g. “Is there anything of broader applicability to be learned / used here?” I’m speaking more to the larger audience who is probably wondering if there’s a more general solution here using the same “moving parts”.

To get specific again, I am not sure I would do what you are contemplating given your circumstances since it’s not the cheapest / simplest solution. The cheapest / simplest solution would be to create 2 small ZFS servers and simply do zfs snapshot replication between them at periodic intervals, so you have a backup copy of the data for maximum safety as well as a physically separate server in case one goes down hard. Disk storage is the cheap part now, particularly if you have data redundancy and can therefore use inexpensive disks, and ZFS replication is certainly “good enough” for disaster recovery. As others have said, adding additional layers will only increase the overall fragility of the solution, and “fragile” is kind of the last thing you need when you’re frantically trying to deal with a server that has gone down for what could be any number of reasons.

I, for example, use a pair of FreeNAS Minis at home to store all my media and they work fine at minimal cost. I use one as the primary server that talks to all of the VMWare / Plex / iTunes server applications (and serves as a backup device for all my iDevices) and it replicates the entire pool to another secondary server that can be pushed into service as the primary if the first one loses a power supply / catches fire / loses more than 1 drive at a time / etc. Since I have a backup, I can also just use RAIDZ1 for the 4x4Tb drive configuration on the primary and get a good storage / redundancy ratio (I can lose a single drive without data loss but am also not wasting a lot of storage on parity).

Just my two cents. There are a lot of different ways to do this, and like all things involving computers (especially PCs), the simplest way is usually the best.

Julien Cigar

unread,
Jul 4, 2016, 3:32:00 PM7/4/16
to
On Mon, Jul 04, 2016 at 11:56:57AM -0700, Jordan Hubbard wrote:
>
> > On Jul 4, 2016, at 11:36 AM, Julien Cigar <jul...@perdition.city> wrote:
> >
> > I think the discussion evolved a bit since I started this thread, the
> > original purpose was to build a low-cost redundant storage for a small
> > infrastructure, no more no less.
> >
> > The context is the following: I work in a small company, partially
> > financed by public funds, we started small, evolved a bit to a point
> > that some redundancy is required for $services.
> > Unfortunately I'm alone to take care of the infrastructure (and it's
> > only 50% of my time) and we don't have that much money :(
>
> Sure, I get that part also, but let’s put the entire conversation into context:
>
> 1. You’re looking for a solution to provide some redundant storage in a very specific scenario.
>
> 2. We’re talking on a public mailing list with a bunch of folks, so the conversation is also naturally going to go from the specific to the general - e.g. “Is there anything of broader applicability to be learned / used here?” I’m speaking more to the larger audience who is probably wondering if there’s a more general solution here using the same “moving parts”.

of course..! It has been an interesting discussion, learned some things,
and it's always enjoyable to get different point of view.

>
> To get specific again, I am not sure I would do what you are contemplating given your circumstances since it’s not the cheapest / simplest solution. The cheapest / simplest solution would be to create 2 small ZFS servers and simply do zfs snapshot replication between them at periodic intervals, so you have a backup copy of the data for maximum safety as well as a physically separate server in case one goes down hard. Disk storage is the cheap part now, particularly if you have data redundancy and can therefore use inexpensive disks, and ZFS replication is certainly “good enough” for disaster recovery. As others have said, adding additional layers will only increase the overall fragility of the solution, and “fragile” is kind of the last thing you need when you’re frantically trying to deal with a server that has gone down for what could be any number of reasons.
>
> I, for example, use a pair of FreeNAS Minis at home to store all my media and they work fine at minimal cost. I use one as the primary server that talks to all of the VMWare / Plex / iTunes server applications (and serves as a backup device for all my iDevices) and it replicates the entire pool to another secondary server that can be pushed into service as the primary if the first one loses a power supply / catches fire / loses more than 1 drive at a time / etc. Since I have a backup, I can also just use RAIDZ1 for the 4x4Tb drive configuration on the primary and get a good storage / redundancy ratio (I can lose a single drive without data loss but am also not wasting a lot of storage on parity).

You're right, I'll definitively reconsider the zfs send / zfs receive
approach.

>
> Just my two cents. There are a lot of different ways to do this, and like all things involving computers (especially PCs), the simplest way is usually the best.
>

Thanks!

Julien

> - Jordan
signature.asc

Ben RUBSON

unread,
Jul 5, 2016, 6:37:53 AM7/5/16
to

> On 04 Jul 2016, at 20:06, Jordan Hubbard <j...@ixsystems.com> wrote:
>
>
>> On Jul 1, 2016, at 11:23 AM, Ben RUBSON <ben.r...@gmail.com> wrote:
>>
>> Would you say that giving an iSCSI disk to ZFS hides some details of the raw disk to ZFS ?
>
> Yes, of course.
>
>> I though that iSCSI would have been a totally "transparent" layer, transferring all ZFS requests to the raw disk, giving back the answers, hiding anything.
>
> Not really, no. There are other ways of talking to a disk or SSD device, such as getting S.M.A.R.T. data to see when/if a drive is failing.

Yep but SMART is not used by ZFS itself, according to a dev on #openzfs.
There is however a feature request here : https://github.com/zfsonlinux/zfs/issues/2777

I don't know whether FreeBSD iSCSI target implementation is SMART pass-thru or not (I don't think so, my tests some months ago did not work).
However SMART data of iSCSI disks can easily be checked on the target server itself (so not on the initiator, I agree) using smartmontools.

> Drives also return checksum errors that may be masked by the iSCSI target.

Should be caught by smartmontools running on target, right ?

> Finally, there is SCSI-2 and there is SCSI-3 (where things like persistent reservations are implemented). None of these things are necessarily implemented by iSCSI.

At least FreeBSD implements persistent reservations :
https://www.freebsd.org/cgi/man.cgi?query=ctl
However I'm not sure ZFS itself uses such feature, which are more relevant for clusters. Does it ?

Ben

Jan Bramkamp

unread,
Jul 12, 2016, 9:16:18 AM7/12/16
to
On 04/07/16 19:55, Jordan Hubbard wrote:
>
>> On Jul 3, 2016, at 11:05 PM, Ben RUBSON <ben.r...@gmail.com> wrote:
>>
>> Of course Jordan, in this topic, we (well at least me :) make the following assumption :
>> one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !
>
> I certainly wouldn’t make that assumption. Once you allow iSCSI to be the back-end in any solution, end-users will avail themselves of the flexibility to also export arbitrary or synthetic devices (like zvols / RAID devices) as “disks”. You can’t stop them from doing so, so you might as well incorporate that scenario into your design. Even if you could somehow enforce the 1:1 mapping of LUN to disk, iSCSI itself is still going to impose a serialization / performance / reporting (iSCSI LUNs don’t report SMART status) penalty that removes a lot of the advantages of having direct physical access to the media, so one might also ask what you’re gaining by imposing those restrictions.


How about 3way ZFS mirrors spread over three SAS JBODs with dual-ported
expanders connected to two FreeBSD servers with SAS HBAs and a
*reliable* arbiter to the disks. This could either be an external
locking server e.g. consul/etcd/zookeeper and/or SCSI reservations. If
more than two head servers are to share the disks a pair of SAS switches
should do the job.

If N-1 disk redundancy is enough two JBODs and 2way mirrors would work
as well.

While you can't prevent stupid operators from blowing their feet of it
doesn't offer the same "flexibility" as iSCSI if only because you can't
conveniently hookup everything talking Ethernet offering itself als
iSCSI target. That is until someone implements a SAS target with CTL and
a suitable HBA in FreeBSD ;-).

This kind of setup should also preserve all assumptions ZFS has
regarding disks.

I have the required spare hardware to build a two JBOD test setup [1]
and could run some tests if anyone is interested in such a setup.


[1]: Test setup

+-----------+ +-----------+
| MASTER | | SLAVE |
| | | |
| HBA0 HBA1 | | HBA0 HBA1 |
+--+----+---+ +--+----+---+
^ ^ ^ ^
| | | |
| | | +------+
| | | |
| | +----+ |
| | | |
| +-----------+ | |
| | | |
v v v |
+--+--------+ +--+----+---+ |
| JBOD 0 | | JBOD 1 | |
+-------+---+ +-----------+ |
^ |
| |
+-----------------------+

Ben RUBSON

unread,
Jul 14, 2016, 11:50:56 AM7/14/16
to

> On 12 Jul 2016, at 15:15, Jan Bramkamp <cr...@rlwinm.de> wrote:
>
> On 04/07/16 19:55, Jordan Hubbard wrote:
>>
>>> On Jul 3, 2016, at 11:05 PM, Ben RUBSON <ben.r...@gmail.com> wrote:
>>>
>>> Of course Jordan, in this topic, we (well at least me :) make the following assumption :
>>> one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !
>>
>> I certainly wouldn’t make that assumption. Once you allow iSCSI to be the back-end in any solution, end-users will avail themselves of the flexibility to also export arbitrary or synthetic devices (like zvols / RAID devices) as “disks”. You can’t stop them from doing so, so you might as well incorporate that scenario into your design. Even if you could somehow enforce the 1:1 mapping of LUN to disk, iSCSI itself is still going to impose a serialization / performance / reporting (iSCSI LUNs don’t report SMART status) penalty that removes a lot of the advantages of having direct physical access to the media, so one might also ask what you’re gaining by imposing those restrictions.
>
>
> How about 3way ZFS mirrors spread over three SAS JBODs with dual-ported expanders connected to two FreeBSD servers with SAS HBAs and a *reliable* arbiter to the disks. This could either be an external locking server e.g. consul/etcd/zookeeper and/or SCSI reservations. If more than two head servers are to share the disks a pair of SAS switches should do the job.

It would be nice if it could work without a third server, so one important / interesting thing to test would be the SCSI reservations : be sure that when the pool is imported on MASTER, SLAVE can't use the disks anymore.
(this is the case with iSCSI, when SLAVE exports its disks through CTL, it can't import them using ZFS as CTL locks them as soon as it it started)

> If N-1 disk redundancy is enough two JBODs and 2way mirrors would work as well.

Or if we only have 2 JBODs (for whatever reason), we could (should certainly :) use 4way mirrors so that if one JBOD dies, we're still confident with the pool.

> While you can't prevent stupid operators from blowing their feet of it doesn't offer the same "flexibility" as iSCSI if only because you can't conveniently hookup everything talking Ethernet offering itself als iSCSI target. That is until someone implements a SAS target with CTL and a suitable HBA in FreeBSD ;-).

Why would you prefer a SAS target over an iSCSI target ?
How would it fit ?

> This kind of setup should also preserve all assumptions ZFS has regarding disks.

Yep, although AFAIR no one demonstrated ZFS suffers from iSCSI :) (devs on #openzfs stated it does not)

Anyway, this is nice SAS-only setup, which avoids an additional protocol, a very good reason to go with it.
One good reason for iSCSI is that it allows servers to be in different racks (well there are long SAS cables) / different rooms / buildings.

Ben RUBSON

unread,
Jul 21, 2016, 1:53:06 AM7/21/16
to

> On 01 Jul 2016, at 17:02, Ben RUBSON <ben.r...@gmail.com> wrote:
>
> I think what we miss is some kind of this :
> http://milek.blogspot.fr/2007/03/zfs-online-replication.html
> http://www.compnect.net/?p=16461
>
> Online replication built in ZFS would be awesome.

Note that I opened the following feature request a few days ago :
https://www.illumos.org/issues/7166

Could be interesting to follow it.

Ben

InterNetX - Juergen Gotteswinter

unread,
Jul 21, 2016, 3:59:24 AM7/21/16
to
i whould not expect to see something like maybe 2040 or so.

no offense agains the zfs devs, imho the fs itself is not the right
place for this. I could be wrong, if yes someone please feel free to
point at it.

In the End, to me it looks like you have taken the most important
features out of hast, carp and probably rsf-1 cluster and with some
mixing and stirring one can get a solution like this.

no wont work, (somehow yes, but rather no: i dont know of anyone missing
active sync replication, and manpower for this important parts is
limited, but afaik theres the possility to sponsor such an addon.

like, shut up and take my money

Ben RUBSON

unread,
Jul 21, 2016, 4:09:14 AM7/21/16
to

> On 21 Jul 2016, at 09:51, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:
>
> i whould not expect to see something like maybe 2040 or so.
>
> no offense agains the zfs devs, imho the fs itself is not the right
> place for this. I could be wrong, if yes someone please feel free to
> point at it.
>
> In the End, to me it looks like you have taken the most important
> features out of hast, carp and probably rsf-1 cluster and with some
> mixing and stirring one can get a solution like this.

HAST (same for Linux DRBD) adds an additional stack between the disks and ZFS.
In addition, HAST may require a lot of network bandwidth depending on the pool layout, much more than the incoming data throughput.
Built-in ZFS replication would require not much network bandwidth than the incoming data throughput itself.
Suitable for long-distance replication :)

> no wont work, (somehow yes, but rather no: i dont know of anyone missing
> active sync replication, and manpower for this important parts is
> limited, but afaik theres the possility to sponsor such an addon.
>
> like, shut up and take my money

We may then hope to see it before 2040 ;)

InterNetX - Juergen Gotteswinter

unread,
Jul 21, 2016, 4:25:26 AM7/21/16
to
Nevermind, good to see having you much fun with this project. wish you
great secuess, but probably rethink some points and maybe consider the
keep it fuc... simpl thing. less parts, less moving parts, less black
magic which calls itself beeing smart / intelligent / proactive - theres
so much bs out there..

Please,

read this

https://www.joyent.com/blog/network-storage-in-the-cloud-delicious-but-deadly

where at the same point, like they descriped. guess what happened.

InterNetX - Juergen Gotteswinter

unread,
Jul 21, 2016, 4:26:57 AM7/21/16
to
0 new messages