Two ESOS Heads sharing the same SAS shelves, NVMe shelves

538 views
Skip to first unread message

ace402

unread,
Feb 2, 2021, 10:23:07 AM2/2/21
to esos-users
Hi all,

I'm very new to storage and planning out a SAN. I'm very interested in a configuration using ESOS but confused if what I want to do makes sense, and is possible?

I would like to have 2 ESOS nodes/heads, each in their own machine, one as master and one as slave. In the event that one needs to be maintained or rebooted, I want to be able to switch over to the other to maintain availability to the same set of LUNs created from the shared shelves.

I believe the answer to this is to ensure that each shelf I buy has 2 host connections. For example, if I buy a SAS shelf for holding mechanical drives, it should have 2 host SAS connections. It will connect to each ESOS box.

However, even though I can see from several conversations around here that multiple ESOS nodes/heads is a possible configuration, what is the standard way to connect these nodes/heads? Does anyone know of any reference to a guide showing the hardware required to successfully set up 2 ESOS boxes as dual heads to the same shared storage?

Additionally, I am wondering if anyone has set up ESOS connected to a SAS shelf and an NVMe shelf at the same time. Do you use FC or RoCE to connect to the NVMe shelf? Does the software handle it nicely?

Thanks for your time

Andrei Wasylyk

unread,
Feb 2, 2021, 11:24:11 AM2/2/21
to esos-...@googlegroups.com
Hi ace,

I'm currently discussing this type of setup with Heiss, see our thread a few threads down.

Note ESPECIALLY the part about not for the faint of heart. And if you are new, be prepared to spend a few months prototyping and researching and testing.

Quick answer to some Qs:
Your desired setup is absolutely possible to achieve. You want two servers connected simultaneously to one or more SAS shelves such that they both see all drives. This requires: SAS HBAs for each server, A shelf (or JBOD is more commonly accepted term) that has "Dual Expanders", no SATA drives. Each SAS drive has two PHYs in their connectors, dual-expander JBODs create two parallel physical paths starting from the host connection down through the backplane and out the downstream SAS ports. JBODs with two host connections is not specific enough really, I can imagine a world where someone creates a 60 bay JBOD with two host connections - one connecting the first 30 drives and the other connecting to the second 30 drives. This would not achieve what you want - taking this example you would want two host ports connecting to all of the first 30 drives, and two host ports connecting to all of the second set of 30 drives. On smaller JBODs (say 24 bay) i have never seen a two host setup where they split it up into 2 seperate 12-drive connections, so its safe to say a 24 bay JBOD with two host connections is probably what you want - but that doesnt mean it doesnt exist - the common technical term for what you are looking for is "Dual-Expander" or "HA Capable" or "Supporting Multipath" on the product description. If it's still unclear then you'd have to look through the application guide for your chosen unit where the internal connection routing is visible and listed.

NVMe is tricky. The industry is still trying to converge on a solution and theres so much movement its hard to figure out what is considered a standard deployment. Ill give you an example, I have trimode HBAs from LSI, but they use special connectors (external) that aren't quite to spec, they are but they aren't. As far as I can tell the 1st gen supermicro NVMe JBOF should work, but I cant get any guarantee nor can I get the LSI firmware supporting NVMe operation on my card. False advertising, but, whatever.

IMO if you are sticking to the same SAN principles we use for SAS and want to DIY a shared NVMe JBOF connected to two controllers (servers) you are left with a handful of confusing options. On a small scale (cause big time SANs employ either scale out design or high end unified fabric based connection between controllers and NVMe drives) what you want is some kind of magic card you put into your server that speaks PCIE, connected via external cable (1M ive seen, maybe 2M) to a JBOF that has two PCIE switches, each with their own dedicated path to each NVMe drive in the enclosure, with each drive supporting dual-port operation. All these things exist, finding them is hard, finding a supplier who will sell them to you even harder.

RoCE and all that can happen on the frontend and will net you great results. on the back end the only non-proprietary simple solution that would fit a small scale DIY I know of is pure pcie over copper.

After my disappointment with my very expensive HBAs I got upset and gave up, but I'd be willing to open up that book again and see whats out there.

ESOS supports this as long as your HBA is supported.  I know the first time I booted up esos the mpt3sas drivers were too old for my card but a month or two later marc updated the kernel and i was golden.

Andrei

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/a8247597-5b92-4646-9d27-cf0ef28a1859n%40googlegroups.com.

ace402

unread,
Feb 2, 2021, 9:54:17 PM2/2/21
to esos-users
Hey Andrei, thank you so much your comments. All these details are incredibly helpful to me, as I've been finding it particularly difficult to find variety of guides/tutorials on ESOS and SAN design in general.

I did see your conversation with Heiss - I must have had a misunderstanding, but I thought that the complication of what he was trying to achieve was a dual master/master head setup with a shared JBOD bay. I hoped that because I am looking to do master/slave, what I want is a more standard usage of ESOS. Now, I am willing and able to put the time, research, testing, etc. to find out how to accomplish this project with proper care, but at the same time I do not want my setup to become an "edge case". I have made no particular decisions with respect to my project and I have a strong preference to use each component in a more standard/expected way. I'm wondering, is having multiple instances of ESOS sharing the same disks not a "mainstream" intended usage for ESOS? The reason why I wanted this is because I considered that maybe the ESOS boxes would have RAID controllers, so there is a chance that even before ESOS touches the disks, they will already have a redundancy mechanism protecting them. Further, I have seen that ESOS includes mdadm, which I am also interested in as a possible software RAID solution. I also noticed that ZFS is mentioned on the ESOS project home page - I'm not sure if this is possible but I think it would be nice to be able to set up LUNs using a ZFS pool.

So, with all that in mind, I thought the idea of having redundant ESOS instances each with their own mirror copy of a whole JBOD bay was overkill of redundancy, at least for my preference. I would rather use the extra JBOD bay as additional space. If I want to add new disks for a new LUN, I don't want to have to, for example, add a pair of disks as a ZFS mirrored vdev to the first ESOS JBOD, and then add another pair of disks as a ZFS mirrored vdev to the second ESOS JBOD, just to get 1/4 the space, even if such a setup is ultra redundant. However, I would love to hear any comments or caveats you might think of regarding this higher level design decision.

Understood that I would be looking for a dual expander JBOD bay, thank you for this knowledge. I noticed you said "no SATA". Is it that all SAS disks have the 2 PHYs connections by common design, and SATA disks do not? I am not quite at the phase yet where I am picking out the hardware, but I will be careful to pick something such that each of the 2 host connections go downstream to every single drive through what I believe would be mini-SAS cables or "SAS loop".

Another thing is still not clear to me. Assuming I get the appropriate JBOD bays and set up my master/slave ESOS heads correctly, I would expect that I would configure LUNs on the master instance, and as I make configurations, they would be replicated in the slave OS. Then, if I wanted to take down the master, the slave could instantly take over. If this all makes sense, through what hardware or existing connection does the configuration of LUNs from the master head get communicated to the slave head? Before this consideration, in my mind the 2 ESOS boxes are connected in 3 ways: via the SAN fabric, the JBOD bay connections, and some Ethernet management network.

Regarding NVMe, because I am not planning on actually building this project for a few months perhaps things will become more clear by then. The reason why I'm interested in NVMe is because I am interested in using SSDs for certain LUNs and I have read that SCSI does not take full advantage of the potential of flash NVMe. Do you by any chance know how to calculate the point at which an SSD becomes bottlenecked by SCSI, or if it affects all of them, calculate what the loss of performance is from using them with SCSI?


Thanks again for all your help

Andrei Wasylyk

unread,
Feb 3, 2021, 12:04:29 AM2/3/21
to esos-...@googlegroups.com
Ha HA, good questions. I'm a bit of a scatter brain so I will try to keep it structured as well as give some more pointed general feedback based on what you've told me, to wit


 Now, I am willing and able to put the time, research, testing, etc. to find out how to accomplish this project with proper care, but at the same time I do not want my setup to become an "edge case".

This is a tough one to answer for many reasons. I think if we did a quick survey of all esos users, you would find that hardly any are using the HA capability. I can tell you that rummaging through the internet and manually walking through the failover mechanisms employed by vendors like ix (Truenas) that Marc has done a wonderful job of designing this functionality. Marc leverages pacemaker coupled with his phenominally maintained esos ocf agents to achieve this. If you walk your way through the scripts (ocf agents are highly structured scripts) and then compare his workflow to what you'll find in truenas's failover.py and accompanying middlewared libraries - you'll find that they are very similar in function (though fundamentally different in implementation). 
In summary, the HA functionality is mature and Marc's work is sound. But you absolutely are an edge case.

I hoped that because I am looking to do master/slave, what I want is a more standard usage of ESOS.

I would say that, master/slave is the agreed-upon fundamentally safe way to handle HA targets for block storage. There are many untested strategies to do active/active on a single lun with limited benefits and huge caveats. I have rolled this myself with success, but given the caveats I didn't keep it for long.

I have made no particular decisions with respect to my project and I have a strong preference to use each component in a more standard/expected way.

Good. This should be your exploratory phase. I have this preference as well. But to my knowledge this is not the standard way to use storage at all. IMO shared SAS on commodity hardware with open source frameworks is for one thing - a clustered filesystem. 
But let's not lose sight of why we are trying to achieve here, we have a need to satisfy: We hope to build an HA SAN using two controllers (servers) running open source software. There's a reason not many projects like this exist - they are fucking hard to do - which is why the market is primarily served by proprietary products with very expensive price tags on them. 
If you want something easier, and standard and better organized I'd highly recommend ESOS commander licensing. It is the only way to get a "standard" HA esos deployment without all the fuss. 

I considered that maybe the ESOS boxes would have RAID controllers, so there is a chance that even before ESOS touches the disks, they will already have a redundancy mechanism protecting them. Further, I have seen that ESOS includes mdadm, which I am also interested in as a possible software RAID solution. I also noticed that ZFS is mentioned on the ESOS project home page - I'm not sure if this is possible but I think it would be nice to be able to set up LUNs using a ZFS pool.

RAID cards.... I don't see how they could possibly work in an HA deployment unless they were meant to do so - lsi synchro (not defunct) had this - but the crux of the matter is that how can you possibly have a raid card co-ordinate a parity array without stepping on the opposite card's toes? If you can answer that please let me know.
I use mdadm with LVM on top, its lightweight in comparison to ZFS in terms of stopping and starting an array but they are functionally similar. I recently deployed ZFS at home and I am so impressed by it I would love to experiment there - I just cant anymore since I'm already locked into my mdadm/lvm setup.
I'd love it if someone conclusively benchmarked both and posted results.

In either case coming back to our goals here and how to satisfy them... with a few exceptions you are trying to manage the sharing of storage types that are fundamentally unshareable. In all my research the limited factor seems to be parity - neither mdadm nor zfs can support a shared parity array. Is it a fundamental incompatibility? who knows, maybe. 

mdadm does support shared RAID10 arrays, and on top of that you can lay lvmlockd VGs which allow either exclusive activated LVs (with the majority of LVM features in tact) or shared activated LVs (with almost no features in tact). But what lvmlockd does get it you is the ability to take your mdraid10 array and chop it into whatever LVs you want and have each server activate LVs exclusively as they wish. 

mdraid5, well there you can only activate the array on one server at a time and by extension the VG you create with that array can only have LVs activated on the server that has the array currently active, this is the limitation present on zfs (except for zfs it is ALWAYS the case, no matter what vdevs you use in the pool) I am in this case quating a VG to a zpool.

So, if you want a simple setup, make one (or multiple) md arrays out of your disks, create a VG on that array and slice out LVs as you see fit! In your cluster config you would basically have 
-start md array
-activate vg (on same node array is active)
-activate lvs (on same node vg is active, unless you are autoactivating all lvs when vg activates)
during a failover you work backwards
-deactivate lvs
-deactivate vg
-stop md array
-start md array on new server
-activate vg on new server
-activate lvs
THIS entire process, under most circumstances occurs fast enough that clients connected to targets dont freak out.

But hell, maybe instead of doing this, cut half your disks into one array, and the other half in a second. repeat the vg steps. in cluster, make server1 active for array1 and standby for array2, on server2 active for array2 and standby for array1.

I really recommend looking at how quickly the ZFS ocf agent can fail a pool though, im kinda starting to fall in love with it and it simplifies things in some ways because all you have to do is export pool and import pool to failover.

So, with all that in mind, I thought the idea of having redundant ESOS instances each with their own mirror copy of a whole JBOD bay was overkill of redundancy, at least for my preference. I would rather use the extra JBOD bay as additional space. If I want to add new disks for a new LUN, I don't want to have to, for example, add a pair of disks as a ZFS mirrored vdev to the first ESOS JBOD, and then add another pair of disks as a ZFS mirrored vdev to the second ESOS JBOD, just to get 1/4 the space, even if such a setup is ultra redundant. However, I would love to hear any comments or caveats you might think of regarding this higher level design decision.

Ick in my opinion, no replicating and copying bs. your servers already have access through the sas fabric. 1/4 space sucks and doesnt really solve any usability problems.

 I noticed you said "no SATA". Is it that all SAS disks have the 2 PHYs connections by common design, and SATA disks do not?

Short answer, yes.


Another thing is still not clear to me. Assuming I get the appropriate JBOD bays and set up my master/slave ESOS heads correctly, I would expect that I would configure LUNs on the master instance, and as I make configurations, they would be replicated in the slave OS. Then, if I wanted to take down the master, the slave could instantly take over. If this all makes sense, through what hardware or existing connection does the configuration of LUNs from the master head get communicated to the slave head? Before this consideration, in my mind the 2 ESOS boxes are connected in 3 ways: via the SAN fabric, the JBOD bay connections, and some Ethernet management network.


Herein lies the problem. No the target/lun mapping are not sync'ed you have to configure it all manually on each server. There is a silver lining to this though. There is an initial configuration that is tricky because you need complimentary but not completely identical configurations on each host. This means you cant simply sync the .conf and expected it to work. Once the initial config is done however, and your alua groups are configured correctly, adding additional LUNs is easy. I won't get into the details of how and why it is, you'll just have to take my word for it. 
Part of what makes it easy is again Marc's incredible work on his ocfs... All state information is handled by the clustering software. In your conf all devices start inactive and all alua target groups start offline - this is your base .conf... when pacemaker starts up with a properly configured ocf, IT will handle activating luns and marking Target ports active or standby for you based on the cluster configuration and the current state of the cluster.

If this hasn't frightened you, then we can discuss more. I suggest before we proceed if you have vmware or hyperv that you create two esos vms with 4 or 5 x10 GB shared virtual disks (in addition to os disk). This will allow you to get a better feel for how the things work and whether or not it's a solution you are ready to fuck around with. 
Otherwise, there's always esos commander which automates and GUI-ifies all this insane madness for you.

Andrei

Andrei Wasylyk

unread,
Feb 3, 2021, 12:15:02 AM2/3/21
to esos-...@googlegroups.com
woops, looks like esos commander was discontinued sorry.
Reply all
Reply to author
Forward
0 new messages