This sounds similar to the CTL-HA code that went in last year, for which I haven’t seen any sort of how-to. The RSF-1 stuff sounds like it has more scaling options, though. Which it probably should, given its commercial operation.
-Joe
Am 01.07.2016 um 15:18 schrieb Joe Love:
>
>> On Jul 1, 2016, at 6:09 AM, InterNetX - Juergen Gotteswinter <j...@internetx.com> wrote:
>>
>> Am 01.07.2016 um 12:57 schrieb Julien Cigar:
>>> On Fri, Jul 01, 2016 at 12:18:39PM +0200, InterNetX - Juergen Gotteswinter wrote:
>>>
>>> of course I'll test everything properly :) I don't have the hardware yet
>>> so ATM I'm just looking for all the possible "candidates", and I'm
>>> aware that a redundant storage is not that easy to implement ...
>>>
>>> but what solutions do we have? It's either CARP + ZFS + (HAST|iSCSI),
>>> either zfs send|ssh zfs receive as you suggest (but it's
>>> not realtime), either a distributed FS (which I avoid like the plague..)
>>
>> zfs send/receive can be nearly realtime.
>>
>> external jbods with cross cabled sas + commercial cluster solution like
>> rsf-1. anything else is a fragile construction which begs for desaster.
>
> This sounds similar to the CTL-HA code that went in last year, for which I haven’t seen any sort of how-to. The RSF-1 stuff sounds like it has more scaling options, though. Which it probably should, given its commercial operation.
rsf is what pacemaker / heartbeat tries to be, judge me for linking
whitepapers but in this case its not such evil marketing blah
http://www.high-availability.com/wp-content/uploads/2013/01/RSF-1-HA-PLUGIN-ZFS-STORAGE-CLUSTER.pdf
@ Julien
seems like you take availability really serious, so i guess you also got
plans how to accomplish network problems like dead switches, flaky
cables and so on.
like using multiple network cards in the boxes, cross cabling between
the hosts (rs232 and ethernet of course, using proved reliable network
switches in a stacked configuration for example cisco 3750 stacked). not
to forget redundant power feeds to redundant power supplies.
if not, i whould start again from scratch.
Arubas, okay, a quick view in the spec sheet does not seem to list
stacking option.
what about power?
keep it simple, stupid simple, without much moving parts and avoid
automagic voodoo wherever possible.
Online replication built in ZFS would be awesome.
This has been a long discussion so I’m not even sure where the right place to jump in is, but just speaking as a storage vendor (FreeNAS) I’ll say that we’ve considered HAST many times but also rejected it many times for multiple reasons:
1. Blocks which are found to be corrupt by ZFS (fail checksum) get replicated by HAST nonetheless since it has no idea - it’s below that layer. This means that both good data and corrupt data are replicated to the other pool, which isn’t a fatal flaw but it’s a lot nicer to be replicating only *good* data at a higher layer.
2. When HAST systems go split-brain, it’s apparently hilarious. I don’t have any experience with that in production so I can’t speak authoritatively about it, but the split-brain scenario has been mentioned by some of the folks who are working on clustered filesystems (glusterfs, ceph, etc) and I can easily imagine how that might cause hilarity, given the fact that ZFS has no idea its underlying block store is being replicated and also likes to commit changes in terms of transactions (TXGs), not just individual block writes, and writing a partial TXG (or potentially multiple outstanding TXGs with varying degrees of completion) would Be Bad.
3. HAST only works on a pair of machines with a MASTER/SLAVE relationship, which is pretty ghetto by today’s standards. HDFS (Hadoop’s filesystem) can do block replication across multiple nodes, as can DRDB (Distributed Replicated Block Device), so chasing HAST seems pretty retro and will immediately set you up for embarrassment when the inevitable “OK, that pair of nodes is fine, but I’d like them both to be active and I’d also like to add a 3rd node in this one scenario where I want even more fault-tolerance - other folks can do that, how about you?” question comes up.
In short, the whole thing sounds kind of MEH and that’s why we’ve avoided putting any real time or energy into HAST. DRDB sounds much more interesting, though of course it’s Linux-only. This wouldn’t stop someone else from implementing a similar scheme in a clean-room fashion, of course.
And yes, of course one can layer additional things on top of iSCSI LUNs, just as one can punch through LUNs from older SAN fabrics and put ZFS pools on top of them (been there, done both of those things), though of course the additional indirection has performance and debugging ramifications of its own (when a pool goes sideways, you have additional things in the failure chain to debug). ZFS really likes to “own the disks” in terms of providing block-level fault tolerance and predictable performance characteristics given specific vdev topologies, and once you start abstracting the disks away from it, making statements about predicted IOPs for the pool becomes something of a “???” exercise.
- Jordan
Would you say that giving an iSCSI disk to ZFS hides some details of the raw disk to ZFS ?
I though that iSCSI would have been a totally "transparent" layer, transferring all ZFS requests to the raw disk, giving back the answers, hiding anything.
As you experienced iSCSI, any sad story with iSCSI disks given to ZFS ?
Many thanks for your long feedback Jordan !
From #openzfs :
A: typically iSCSI disks still appear as "physical" disks to the OS connecting to them. you can even get iSCSI servers that allow things like SMART pass-thru
Q: so ZFS will be as happy with iSCSI disks as if it used local disks ? or will it miss something ?
A: no, and ZFS isn't "unhappy" per se. but there are optimizations it applies when it knows the disks belong to ZFS only
Q: and using iSCSI disks, ZFS will not apply these optimizations (even if these iSCSI disks are only given to ZFS) ? ie, will ZFS know these iSCSI disks belong to ZFS only ?
A: if it looks like a physical disk, if it quacks like a physical disk...
Nope. You will suffer the performance implications of layering a filesystem that expects “rotating media or SSDs” (with the innate ability to parallelize multiple requests in a way that ADD performance) on top of a system which is now serializing the requests across an internet connection to another software layer which may offer no performance benefits to having multiple LUNs at all. You can try iSCSI-specific tricks like MPIO to try and increase performance, but ZFS itself is just going to treat everything it sees as “a disk” and so physical concepts like mirrors or multiple vdevs for performance won’t translate across.
Example question: What’s the point of writing multiple copies of data across virtual disks in a mirror configuration if the underlying storage for the virtual disks is already redundant and the I/Os to it serialize?
Example Answer: There is no point. In fact, it’s a pessimization to do so.
This is not a lot different than running ZFS on top of RAID controllers that turn N physical disks into 1 or more virtual disks. You have to make entirely different performance decisions based on such scenarios and that’s just the way it is, which is also why we don’t recommend doing that.
- Jordan
Of course Jordan, in this topic, we (well at least me :) make the following assumption :
one iSCSI target/disk = one real physical disk (a SAS disk, a SSD disk...), from a server having its own JBOD, no RAID adapter or whatever, just what ZFS likes !
> This is not a lot different than running ZFS on top of RAID controllers that turn N physical disks into 1 or more virtual disks. You have to make entirely different performance decisions based on such scenarios and that’s just the way it is, which is also why we don’t recommend doing that.
Of course you loose all ZFS benefits if you only mirror 2 "disks", a big one from storage array A, the same from storage array B.
No interest.
I certainly wouldn’t make that assumption. Once you allow iSCSI to be the back-end in any solution, end-users will avail themselves of the flexibility to also export arbitrary or synthetic devices (like zvols / RAID devices) as “disks”. You can’t stop them from doing so, so you might as well incorporate that scenario into your design. Even if you could somehow enforce the 1:1 mapping of LUN to disk, iSCSI itself is still going to impose a serialization / performance / reporting (iSCSI LUNs don’t report SMART status) penalty that removes a lot of the advantages of having direct physical access to the media, so one might also ask what you’re gaining by imposing those restrictions.
- Jordan
Sure, I get that part also, but let’s put the entire conversation into context:
1. You’re looking for a solution to provide some redundant storage in a very specific scenario.
2. We’re talking on a public mailing list with a bunch of folks, so the conversation is also naturally going to go from the specific to the general - e.g. “Is there anything of broader applicability to be learned / used here?” I’m speaking more to the larger audience who is probably wondering if there’s a more general solution here using the same “moving parts”.
To get specific again, I am not sure I would do what you are contemplating given your circumstances since it’s not the cheapest / simplest solution. The cheapest / simplest solution would be to create 2 small ZFS servers and simply do zfs snapshot replication between them at periodic intervals, so you have a backup copy of the data for maximum safety as well as a physically separate server in case one goes down hard. Disk storage is the cheap part now, particularly if you have data redundancy and can therefore use inexpensive disks, and ZFS replication is certainly “good enough” for disaster recovery. As others have said, adding additional layers will only increase the overall fragility of the solution, and “fragile” is kind of the last thing you need when you’re frantically trying to deal with a server that has gone down for what could be any number of reasons.
I, for example, use a pair of FreeNAS Minis at home to store all my media and they work fine at minimal cost. I use one as the primary server that talks to all of the VMWare / Plex / iTunes server applications (and serves as a backup device for all my iDevices) and it replicates the entire pool to another secondary server that can be pushed into service as the primary if the first one loses a power supply / catches fire / loses more than 1 drive at a time / etc. Since I have a backup, I can also just use RAIDZ1 for the 4x4Tb drive configuration on the primary and get a good storage / redundancy ratio (I can lose a single drive without data loss but am also not wasting a lot of storage on parity).
Just my two cents. There are a lot of different ways to do this, and like all things involving computers (especially PCs), the simplest way is usually the best.
How about 3way ZFS mirrors spread over three SAS JBODs with dual-ported
expanders connected to two FreeBSD servers with SAS HBAs and a
*reliable* arbiter to the disks. This could either be an external
locking server e.g. consul/etcd/zookeeper and/or SCSI reservations. If
more than two head servers are to share the disks a pair of SAS switches
should do the job.
If N-1 disk redundancy is enough two JBODs and 2way mirrors would work
as well.
While you can't prevent stupid operators from blowing their feet of it
doesn't offer the same "flexibility" as iSCSI if only because you can't
conveniently hookup everything talking Ethernet offering itself als
iSCSI target. That is until someone implements a SAS target with CTL and
a suitable HBA in FreeBSD ;-).
This kind of setup should also preserve all assumptions ZFS has
regarding disks.
I have the required spare hardware to build a two JBOD test setup [1]
and could run some tests if anyone is interested in such a setup.
[1]: Test setup
+-----------+ +-----------+
| MASTER | | SLAVE |
| | | |
| HBA0 HBA1 | | HBA0 HBA1 |
+--+----+---+ +--+----+---+
^ ^ ^ ^
| | | |
| | | +------+
| | | |
| | +----+ |
| | | |
| +-----------+ | |
| | | |
v v v |
+--+--------+ +--+----+---+ |
| JBOD 0 | | JBOD 1 | |
+-------+---+ +-----------+ |
^ |
| |
+-----------------------+
It would be nice if it could work without a third server, so one important / interesting thing to test would be the SCSI reservations : be sure that when the pool is imported on MASTER, SLAVE can't use the disks anymore.
(this is the case with iSCSI, when SLAVE exports its disks through CTL, it can't import them using ZFS as CTL locks them as soon as it it started)
> If N-1 disk redundancy is enough two JBODs and 2way mirrors would work as well.
Or if we only have 2 JBODs (for whatever reason), we could (should certainly :) use 4way mirrors so that if one JBOD dies, we're still confident with the pool.
> While you can't prevent stupid operators from blowing their feet of it doesn't offer the same "flexibility" as iSCSI if only because you can't conveniently hookup everything talking Ethernet offering itself als iSCSI target. That is until someone implements a SAS target with CTL and a suitable HBA in FreeBSD ;-).
Why would you prefer a SAS target over an iSCSI target ?
How would it fit ?
> This kind of setup should also preserve all assumptions ZFS has regarding disks.
Yep, although AFAIR no one demonstrated ZFS suffers from iSCSI :) (devs on #openzfs stated it does not)
Anyway, this is nice SAS-only setup, which avoids an additional protocol, a very good reason to go with it.
One good reason for iSCSI is that it allows servers to be in different racks (well there are long SAS cables) / different rooms / buildings.