ESOS setup - would this setup work?

529 views
Skip to first unread message

W. Heiss

unread,
Feb 1, 2021, 2:37:30 AM2/1/21
to esos-users
Hi,
I am planning a new storage.
My plan looks like this:
2x DS4243 shelf w/ 24 disks each
       4 QSFP cable - crossing out each server to each storage
2x Server with LSI 9300, 8G FC QLE2564, maybe SSD as cache?
     4x FC link
2x Brocade 300 switches

behind that, some machines with other FC controllers, each 2 connections - to one brocade

The questions I am unsure of:
 * Is ESOS able to use the storages (that expose JBOD, AFAIK) together, i,e, shared? I'd like to be active/active, not active/passive
* Would SSD cache on the "controllers" have any benefit?
* I have some ESX behind, so I try multipathing. The Linux servers behind will use lvmlockd
Is my expectation right that this should be a HA setup?

Thanks in advance!

Andrei Wasylyk

unread,
Feb 1, 2021, 8:35:42 AM2/1/21
to esos-...@googlegroups.com
Yes. But it's not for faint of heart.

What is your experience with pacemaker and crmsh ?

Active/active with vdisk_blockio exposing an  LVMlockd LV in shared access mode on both servers works. This is the only scenario I have tested and I haven't used it for long. vdisk_filio sitting on a GFS2 filesystem could be another option, but there are considerations there to make that IMO I dont know enough about GFS2 to trust it.

In a pure LVM setup, shared lvmlockd lvs have many caveats: no snaps, no caches, no splits, no thinpools... Basically all you CAN do is expand an LV, even that doesnt actually work in shared - it quickly switches the LV to exclusive and back again. I'm sure scst will just love having handler's a handler's storage disappear for a couple seconds and then reappear, except larger.

In the end, active/active is a lot more trouble than its worth it seems and it always made me nervous.

Best case scenario IMO is active/active but seperate luns. Lets say you have 10 TB storage. Make 2x5TB luns and prefer each to a different controller. If you reboot one controller (server) then that lun moves to the other. And back again after.

This is what I use in prod and it works very well.

The docs for HA need updating last I checked and to be honest there's so many situations to account for I'm not sure how I would personally update them so as to not confuse the absolute hell out of people. They are a good starting point maybe.

But the tui becomes less useful, I manage it all by hand using mark's ocfs.

Again, not for the faint of heart. Depending on your expertise you are looking at months of experimentation and research. I happen to have a job where I can afford to do take my time with things and spend months doing - I guess - professional development. Most don't.

Andrei

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/9ea25396-8765-431b-9e2d-443106bf2b62n%40googlegroups.com.

W. Heiss

unread,
Feb 1, 2021, 10:48:15 AM2/1/21
to esos-users
in this case, it is my small company, so I'm the one deciding the timeline ;))) - i am just antsy to get that stuff up and running .
I use keepalived, not pacemaker, normally, so there is a learning curve (obviously).

OCFS/GFS scare me a bit.. also, they don't seem to be supported in ESOS.

Thank you for your feedback.
I change to:
Some kind of RAID on the shelves (zfs-6?). As they are redundant, I assume there will not be a shelf failure.

the question I don't know yet: Would you take one esos and one disk shelf as one unit or would you "merge" the shelves ?
If I merge them, that'd mean no DRBD, which is nicer, also double capacity, which is also nice, but a bunch of complexity I am not sure I can handle...
Also, GFS/OCFS do not seem to be part of ESOS, and I don't really know lxfs (it is xfs, but l?)
I don't see any ZFS in the TUI, either..
is the TUI really limited or is it me ;)?

For any clustering, 3 nodes seem to be a minimum to prevent split-brain...

hm..
I think I'll go the easy route.
Create a 24-disk RAID6 zPool, use DRDB to sync between the 2 nodes (over a 10G link) and create 2 image files on ZFS - one for VMWare, one for Linux.
VMWare doesn't care anyways, they are cluster aware, and for Linux, I'll LVM the hell out of that LUN with lvmlockd on the clients.

That'd be hot/cold, then.
How does that sound?
Do you think handover between the nodes is more or less seamless?
ps: do i need to DRDB all disks individually or does zfs create a pseudo-device?

sorry for all my questions, but I did read so much that I am more confused than before :/

Andrei Wasylyk

unread,
Feb 1, 2021, 12:32:24 PM2/1/21
to esos-...@googlegroups.com
Oh oh oh holdon I may have misunderstood what you were doing. Let me look at your shelves, I know you mention qsfp cross connected but I assumed this was functionally similar to SAS with dual expander.

Sidenote, I spent a lot of time looking for dual-ported SAS drives only to realize that nowadays SAS drives are all by definition dual ported. 

Wide port operation - i.e. using both sas links to double your throughput is an option on SAS SSDs because the disk's internal transfer rate exceeds a single sas link - something that is not applicable in spinning drives since to my knowledge no conventional drive comes close to maxing out that speed.

In my setup, I have dual connections to the backend drives through and through. Each controller (server) has an HBA that sees every drive in every shelf irrespective of the other controller... In my situation I need only make sure the controllers dont step on each other's toes - something that LVM and pacemaker handle reliably.

Third node is not REQUIRED to avoid split brain... But I did consider it. Fundamentelly there are a couple of different ways to arbitrate a dual node cluster to avoid split brain. On windows clusters for example, a witness can be used - now pacemaker is not windows by any stretch but there are analogs available. However, as with any system the solutions available depend heavily on what you are doing and how you are doing it using what hardware. In a simple world, one could say you do not even need a witness if you are employing an effective stonith solution - split brain cannot occur if you shoot the other node in the head fast enough haha.

Conceptually your keepalived experience will be useful, but pacemaker is a conpletely different beast. If you really want to embark on this journey, I would recommend that once you have an idea with what design you'd like to adopt, I can share with you my own configs - assuming the design you choose is compatible with my configs on a conceptual level. If you are going zfs, I have a lot of experience with it now but I have never used it with esos nor pacemaker.

The tui is very good - take it from someone who has failed multiple times to structure an scst.conf by hand: it is deceptively complex to write the conf. The TUI serves as an excellent way to make sure you don't make a stupid mistake that is impossible to troubleshoot. Having said that, what you are trying to achieve with HA... well the TUI would need a ton of work to make that happen. This is simply put, complicated stuff. Just look at how much truenas and other competitors charge for a nice simple HA storage GUI - and theirs ONLY work on their own specific hardware. Not to be dicks but because it is just really really difficult.

ill take a look at your shelves and give you my (possibly incorrect) opinion on how to use them.

Andrei Wasylyk

unread,
Feb 1, 2021, 7:42:04 PM2/1/21
to esos-...@googlegroups.com
Ok, so looking at your shelves and your quick rundown of the proposed setup I recommend the following:

First, what's wrong with your idea:

Forget DRBD, your storage is cut in half and your performance will be limited to what you can synchronously pass over the 10G link. A single zpool can only be used by a single server - full stop. Fundamentally ZFS will never allow this (I mean never say never, but in this case I'm saying it) DRBD offers zero advantage here because the goal of DRBD of DRBD is to provide (usually synchronous and in your case ABSOLUTELY must be synchronous) replication of block devices. Conceptually this means that your if your replicated zpool is active on server1, well the replicated zpool is ALSO active on server2. If you tried to force import it using dual primary you would absolutely destroy your zpool because you'd effectively be activating that pool on two servers. Again something that is impossible, unless your goal is to destroy the pool then yes go ahead. Many people do this setup because they dont have shared SAS. So the only way they can achieve HA is to have a DRBD replicated pool. But now there's a lot of trouble during failover - you have to export the zpool on server1, demote drbd master on server1, promote drbd slave on server2 and import zpool on server2. On top of that, assuming a single pool with datasets or zvols exposed to scst, you won't be able to balance targets across servers. Remember what I said earlier - it is too troublesome to have a single storage object (LV, zvol, zpool....etc...) exposed on both servers at once - however if you plan your storage around this limitation then you absolutely CAN have one storage object exposed on one server and another on the second.

Now, using your apparent preference for ZFS (no judgement, I'm just working with what impressions you are giving me - I dont use ZFS but I know enough to help you implement it here) here is the solution I propose:

Both servers connected to both shelves as per the "2 controller connection diagram" in the quick setup pdf from QNAP. Each server should see ALL drives. This requires SAS HDDs, I heard you can use an interposer with SATA HDDs but I do not know the caveats of doing so.

2 x 24 drive zpools. I thought in ZFS parity levels you were limited to the IOPS (not throughput) of a single drive per vdev. So, 24 disk raidz2 will have the IOPS of a single disk. If you can manage, I would recommend chopping that up a bit - I thought ZFS can even do multiple vdev types in a pool, no? If yes I would recommend (again, stealing from microsoft) some kind of tiered zpool with say 2 x 6 drive raidz2 plus a 12 drive raid10 vdev (does ZFS do raid10? or is it just a stripe of mirror vdevs? this is where my knowledge is lacking). If ZFS allows this type of operation as well as setting up tiers so that say writes are always hitting your mirror vdev, and they get pushed to the parity set afterwards. Thats what I would like to see anyway, I don't know if ZFS does this, sorry.

In this case, depending on your usage I would fully and absolutely recommend SSD caching, but those SSDs have to be visible 
to both servers. This leaves you with two options: get SAS SSDs, you cant have a single SSD caching for different pools, so if you want 2TB cache for each pool the. that means 2 x 2TB SSDs. Other option, which will complicate the hell out of your setup but may net you a large amount of performance would be NVMe drives for each server replicated with drbd and then used for each zpool - I doubt this would work well and it introduces failover lag, possibly too much for vmware.

Now, under this (non DRBD) setup we have two zpools, under normal operation each zpool would be pegged to a different server. This mode of operation is safe. Failing over a zpool consists of: mark all front end target ports as transition, wait for all writes to finish, marking front end ports on active as standby, export zpool on server active, wait for export to complete, import zpool on server inactive, mark fromt end target ports on standby as active. This sequence of events is handled by pacemaker. Once it's setup and it works, you never touch it again.

To my knowledge, this transition should occur fast enough to not upset vmware, but I've never used ZFS.

* ZFS negates the need for lvmlockd. You are no longer using that concept.
* Forget TUI, if you wanted to take this route, I would use TUI for initial setup of the scst.conf and then make the changes necessary by hand to scst and persist them.

Anyone care to offer opinion if I'm on the right path here? Am I completely incorrect?

W. Heiss, is anything I'm saying making sense? If not I have documentation to direct you to and advice on where to start learning. When you are ready we can go further because (assuming im not completely out of my gourd) I believe someone attempting what you want to do should be reading this and saying "oh ok, yes, im not sure how to achieve that but I see how it all fits together"

Not being condescending, just being realistic. This is like the first 5% of what you need to do and it only gets harder. Not to mention that with ZFS there are many unknowns I won't be able to answer.

Andrei Wasylyk

unread,
Feb 1, 2021, 7:47:54 PM2/1/21
to esos-...@googlegroups.com
Correction:
"Conceptually this means that your if your replicated zpool is active on server1, well the replicated zpool is ALSO active on server2." 

I mean to say that the zpool server2 sees is active on server1, even if that zpool is the replica - as far as the zpool is concerned it is already active on another server and by virtue cannot be imported.


"Both servers connected to both shelves as per the "2 controller connection diagram" in the quick setup pdf from QNAP. Each server should see ALL drives."

Excuse me, NETAPP
 

On Mon., Feb. 1, 2021, 7:41 p.m. Andrei Wasylyk, <and...@gmail.com> wrote:
Ok, so looking at your shelves and your quick rundown of the proposed setup I recommend the following:

First, what's wrong with your idea:

Forget DRBD, your storage is cut in half and your performance will be limited to what you can synchronously pass over the 10G link. A single zpool can only be used by a single server - full stop. Fundamentally ZFS will never allow this (I mean never say never, but in this case I'm saying it) DRBD offers zero advantage here because the goal of DRBD of DRBD is to provide (usually synchronous and in your case ABSOLUTELY must be synchronous) replication of block devices. If you tried to force import it using dual primary you would absolutely destroy your zpool because you'd effectively be activating that pool on two servers. Again something that is impossible, unless your goal is to destroy the pool then yes go ahead. Many people do this setup because they dont have shared SAS. So the only way they can achieve HA is to have a DRBD replicated pool. But now there's a lot of trouble during failover - you have to export the zpool on server1, demote drbd master on server1, promote drbd slave on server2 and import zpool on server2. On top of that, assuming a single pool with datasets or zvols exposed to scst, you won't be able to balance targets across servers. Remember what I said earlier - it is too troublesome to have a single storage object (LV, zvol, zpool....etc...) exposed on both servers at once - however if you plan your storage around this limitation then you absolutely CAN have one storage object exposed on one server and another on the second.

Now, using your apparent preference for ZFS (no judgement, I'm just working with what impressions you are giving me - I dont use ZFS but I know enough to help you implement it here) here is the solution I propose:


W. Heiss

unread,
May 6, 2021, 5:05:29 PM5/6/21
to esos-users
Hi,
thanks for your very informative replies!

Sorry for the long.. long.... long.. delay in responses.
After fighting !"$%"§$§ shelves and controllers, I _finally_ am able to SEE the disks (although via a HPE P822e, not via the LSI 9300e I wanted to use for that.
A few hundred euros, some IOMs, an additional Dell shelf, some controllers and change later,  I am ordering cables.. again.. because.. cables. Memo to self: SFF 8644 to SFF 8080 does suck.

At least, now I see disks and am in the process of re-formatting them to a correct logical block size (NetApp is on 520, not 512 ;) )

I am not really that much into ZFS, I just assumed it was what is preferred by ESOS.

The "real" setup now is:
* 2 NetApp Shelves with 6GBit (IOM3 to IOM6 works, sometimes ;) ) - 15k SAS disks, so .. 300 MB/s, or 2.3 GBit/s per disk, ideally (i can dream!)
-  bloody cables. At the moment, I only got one for HP P822 to the shelves, so further testing is required. At 50/pcs I did not really want to buy too many..
     in the end, this will be 4 cables - each server can talk to each shelf. maybe. hopefully.
    -- card: currently HPE P822e, should be the LSI 9300e
* 2 HPE DL360p Gen8
   -- card: Brocade Quad-FC-Card (4*8GBit)
        FC cables (duh)
* Brocade 300 8GBit FC switches
From here, vanilla setup.

From my calculation,
disks: 300 MB/s == 2.3 GBit/s (24 per shelf, 2 shelves)
Servers: one quad-SAS 6 GBit cable per shelf  -> 24 GBit per shelf per server, a total of 96 GBit/s theoretical throughput
The FC-Cards do 4*8=32GBit per Server, a total of 64 GBit/s

If I use master/slave, I am at 48 to the shelves and 32 to the servers.
As I am not that much into storage - would you say that there is a chance of actually getting close to these numbers? If no, where is the most likely bottleneck?

When playing around, the lower part (DL360-> Brocade->Server) did work, I have a Cisco booting off a local disk in one of my ESOS nodes ;)
I'll keep you posted, at the moment, I just wanna get drunk. like, seriously drunk. 3 months until I can see disks is.. not good for my mental state.

> In my setup, I have dual connections to the backend drives through and through. Each controller (server) has an HBA that sees every drive in every shelf irrespective of the other controller... In my situation I need only make sure the controllers dont step on each other's toes - something that LVM and pacemaker handle reliably.

My planned setup looks similar to yours on the hardware level ;)
A question came to my mind: When one shelf fails, that means that any RAID level >1 I am aware of will crash and burn, as 50% of the disks gone is considered catastrophic. This adds a new single point of failure to the whole thing.. what do you do? Trust the shelf?
Also, when changing disks, does this really require some kind of reboot? I am confused there right now - when changing the blocksize, the OS got it right, but the HPE controller needed a kick (reboot)..

>In this case, depending on your usage I would fully and absolutely recommend SSD caching, but those SSDs have to be visible to both servers
SSD caching: 2T SAS SSD is ~500€ oO - that is for.. later. For the time being, I have some 2TB SATA SSDs lying around.. I'll try them, they should max out 6GBit/s
The NVME with DRDB sounds.. bad. like really bad. NVME over fabrics (FC, IP) may be better.

I still have the on-board  P420i , though - maybe I can wire them to some shared storage thingie... I need to meditate over that. Maybe a very small 1u dual shelf - any ideas?

> ZFS negates the need for lvmlockd. You are no longer using that concept.
I don't see why - or I was unclear.
I want:
* 2 ESOS heads in active/active (this is what all the above is about). No idea what FS there will be used, or how I'll RAID things.
The storage presented to the machines, however, falls into 2 categories:
* VMWare with VMFS: This is cluster-aware already
* Linux servers: To save me all that headache with kernel DLM and so on, lvmlockd seems a way to prevent one server accessing another's mounted LV.

W. Heiss

unread,
May 14, 2021, 10:38:32 AM5/14/21
to esos-users
so, 200€ cables later, my LSI 9300 (SAS3008 PCI-Express Fusion-MPT SAS-3)  see all disks, the shelves do work flawless, the IOM do work, too, I reformatted the disks and if anyone wants to know: These work: https://www.amazon.de/gp/product/B08H4YZPDW/ for LSI 3008 QSFP to DS4243 IOM6, just reformat the disks with sg_format to 512 and all is good.

I have all disks on all controllers on both heads. ;)

Now I am where i wanted to be at the beginning of the thread.. ;)
Reply all
Reply to author
Forward
0 new messages