Two node cluster quorum and master/master vs master/slave

Kristián Feldsam

unread,

May 1, 2015, 1:10:57 PM5/1/15

to esos-...@googlegroups.com

Hi all,

I build two node master/master esos cluster, all working good.

I have problem with quorum on clvm. I add two_node: 1 to corosync.conf and also no-quorum-policy=ignore in pacemaker, but clvm still waiting for quorum. Clvm is not managed by pacemaker, because clvm RA is missing in esos, and I don't know, if RA solve this problem.

So, with one node down, storage is unavailable. Solution should be quorum disk, but mkqdisk command is missing in esos.

Another question, what is advantage in master/master vs master/slave config? SCST daemon still must run in master/slave, so ALL initiators read/write to only one node (no performance benefit) or there is some override to spread initiators across storage nodes?

If there is no benefit then will be better configure master/slave cluster without use of clvm.

Thanks for replies.

Best regards Feldsam.

Marc Smith

unread,

May 1, 2015, 2:19:16 PM5/1/15

to esos-...@googlegroups.com

On Fri, May 1, 2015 at 1:10 PM, Kristián Feldsam <fel...@gmail.com> wrote:

Hi all,

I build two node master/master esos cluster, all working good.

I have problem with quorum on clvm. I add two_node: 1 to corosync.conf and also no-quorum-policy=ignore in pacemaker, but clvm still waiting for quorum. Clvm is not managed by pacemaker, because clvm RA is missing in esos, and I don't know, if RA solve this problem.

Its been quite a while since I've done anything with clvm but if I remember correctly, it doesn't have anything to do with Corosync/Pacemaker... its simply a front-end to locking so if you have LVM shared between nodes (LVM cluster) when doing LVM management/provisioning tasks like PV/VG/LV create/destroy/etc. you can do those tasks safely (from one node). The activation of volume groups is handled by the "LVM" RA. And if I recall correctly, clvm relies on DLM (distributed lock manager) so rc.dlm needs to be enabled/running as well, otherwise I think clvm will just hanging waiting.

So, with one node down, storage is unavailable. Solution should be quorum disk, but mkqdisk command is missing in esos.

If something relevant is missing from ESOS, I'd like to add it. What package does 'mkqdisk' come from?

Another question, what is advantage in master/master vs master/slave config? SCST daemon still must run in master/slave, so ALL initiators read/write to only one node (no performance benefit) or there is some override to spread initiators across storage nodes?

There should never be master/master for SCST across more than 1 node... the blocks of data can be replicated between between the two nodes, but other SCSI functions like SCSI reservations and persistent reservations are held at the SCST layer, and those are NOT replicated between nodes. So if an initiator issues a SCSI reservation and it goes to one host, and then expects to release it on the other host (another path) its not gonna work... very bad.

In order to bring more performance, you can still use multipath (like round-robin) across target interfaces that belong to the same host, not round-robin I/O across targets on different hosts. So say I have two nodes in the cluster, and each has (2) 4 Gb Fibre Channel target interfaces. Configure ALUA with the two targets in the "local" group and the two targets for the other host in the "remote" group and then your initiators can use a round-robin policy and they should obey and only send data across the two paths that belong to one node... the two paths marked as "active" and the two paths on the other host are "nonoptimized" (or whatever state you choose). So you now effectively get 8 Gb of bandwidth since its going across 2 target interfaces. I've tested this personally with vSphere 5.5 and can changing the pathing policy when doing IO tests in a VM from single path to round-robin across multiple, and its very clear to see the performance gain (the bandwidth nearly doubles when doing a large I/O sequential test). Test the multipath setup with other types of initiators, but I will say it works correctly from my experience in vSphere.

If there is no benefit then will be better configure master/slave cluster without use of clvm.

Are you wanting to use clvm just for LVM "cluster"? That's the only thing I know it as, but maybe there is another use? Hope this helps.

--Marc

Thanks for replies.

Best regards Feldsam.

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Kristián Feldsam

unread,

May 1, 2015, 5:46:25 PM5/1/15

to esos-...@googlegroups.com

Dňa piatok, 1. mája 2015 20:19:16 UTC+2 Marc Smith napísal(-a):

On Fri, May 1, 2015 at 1:10 PM, Kristián Feldsam <fel...@gmail.com> wrote:
Hi all,

I build two node master/master esos cluster, all working good.

I have problem with quorum on clvm. I add two_node: 1 to corosync.conf and also no-quorum-policy=ignore in pacemaker, but clvm still waiting for quorum. Clvm is not managed by pacemaker, because clvm RA is missing in esos, and I don't know, if RA solve this problem.

Its been quite a while since I've done anything with clvm but if I remember correctly, it doesn't have anything to do with Corosync/Pacemaker... its simply a front-end to locking so if you have LVM shared between nodes (LVM cluster) when doing LVM management/provisioning tasks like PV/VG/LV create/destroy/etc. you can do those tasks safely (from one node). The activation of volume groups is handled by the "LVM" RA. And if I recall correctly, clvm relies on DLM (distributed lock manager) so rc.dlm needs to be enabled/running as well, otherwise I think clvm will just hanging waiting.

I read SUSE Enteprise Linux HA Extensions Docs and in this docs dlm, clvm and lvm are all managed by pacemaker. I believed that clvm managed by pacemaker solve no quorum problem, but I find out that clvm relies only on corosync...I read corosync docs and I found new option "excepted_votes" which should be 1 in two node cluster. I try updated corosync confing, but dlm every times asking for fencing :( I think that good start will be managing dlm and clvm by pacemaker, so esos need add clvm ra.

So, with one node down, storage is unavailable. Solution should be quorum disk, but mkqdisk command is missing in esos.

If something relevant is missing from ESOS, I'd like to add it. What package does 'mkqdisk' come from?

I find out, that qdisk is related to cman clusters and in corosync are options two_node: 1 and excepted_votes: 1

Another question, what is advantage in master/master vs master/slave config? SCST daemon still must run in master/slave, so ALL initiators read/write to only one node (no performance benefit) or there is some override to spread initiators across storage nodes?

There should never be master/master for SCST across more than 1 node... the blocks of data can be replicated between between the two nodes, but other SCSI functions like SCSI reservations and persistent reservations are held at the SCST layer, and those are NOT replicated between nodes. So if an initiator issues a SCSI reservation and it goes to one host, and then expects to release it on the other host (another path) its not gonna work... very bad.

In order to bring more performance, you can still use multipath (like round-robin) across target interfaces that belong to the same host, not round-robin I/O across targets on different hosts. So say I have two nodes in the cluster, and each has (2) 4 Gb Fibre Channel target interfaces. Configure ALUA with the two targets in the "local" group and the two targets for the other host in the "remote" group and then your initiators can use a round-robin policy and they should obey and only send data across the two paths that belong to one node... the two paths marked as "active" and the two paths on the other host are "nonoptimized" (or whatever state you choose). So you now effectively get 8 Gb of bandwidth since its going across 2 target interfaces. I've tested this personally with vSphere 5.5 and can changing the pathing policy when doing IO tests in a VM from single path to round-robin across multiple, and its very clear to see the performance gain (the bandwidth nearly doubles when doing a large I/O sequential test). Test the multipath setup with other types of initiators, but I will say it works correctly from my experience in vSphere.

I know I know, I read docs, but I am asking for performace (or any other) benefit of two master/master cluster nodes. I use multipath, I have two dual port cards in both nodes, so 4 + 4 targets. I follow your article from 2013, where you setup drbd master/master, clone lvm with clvm and scst master/slave. So WHY go for master/master drbd and clvm when initiators can use only one node? Would not be better to use master/slave drbd + just simple lvm with locking_type = 1 without dlm and clvm, and master/slave scst? I don't see any advantage of master/master setup when scst must be master/slave.

If there is no benefit then will be better configure master/slave cluster without use of clvm.

Are you wanting to use clvm just for LVM "cluster"? That's the only thing I know it as, but maybe there is another use? Hope this helps.

I answer in prev paragraph.

Kristián Feldsam

unread,

May 1, 2015, 6:00:04 PM5/1/15

to esos-...@googlegroups.com

Dňa piatok, 1. mája 2015 23:46:25 UTC+2 Kristián Feldsam napísal(-a):

I know I know, I read docs, but I am asking for performace (or any other) benefit of two master/master cluster nodes. I use multipath, I have two dual port cards in both nodes, so 4 + 4 targets. I follow your article from 2013, where you setup drbd master/master, clone lvm with clvm and scst master/slave. So WHY go for master/master drbd and clvm when initiators can use only one node? Would not be better to use master/slave drbd + just simple lvm with locking_type = 1 without dlm and clvm, and master/slave scst? I don't see any advantage of master/master setup when scst must be master/slave.

Oh, I find out, that for master/slave scst I need underlining devices, so I need active lvm volumes....So I need solve problem with dlm/clvm and no quorum....I think that I can contribute and add clvm ra to esos myself...

Marc Smith

unread,

May 1, 2015, 10:21:42 PM5/1/15

to esos-...@googlegroups.com

I suspect that was a recent addition to the resource-agents package,
as it doesn't appear the version we're using has it:
$ find work/build/resource_agents-3.9.2/ -iname *clvm*
$

Do you know what version that was added in? Actually, just looked at
latest release (3.9.6) and its in there... a recent addition, added in
2014. Guess that's why I hadn't heard of it when I did that DRBD
setup. =)

I can get that updated in the master branch.

>>
>>
>>
>>>
>>> So, with one node down, storage is unavailable. Solution should be quorum
>>> disk, but mkqdisk command is missing in esos.
>>
>>
>> If something relevant is missing from ESOS, I'd like to add it. What
>> package does 'mkqdisk' come from?
>
> I find out, that qdisk is related to cman clusters and in corosync are
> options two_node: 1 and excepted_votes: 1

So 'mkqdisk' is nothing to do with Corosync, only CMAN?

You mean why use the DRBD resource(s) in Master/Master
(Primary/Primary) and not Primary/Secondary? Because SCST needs to be
able to open the block device on both nodes... if it were
Primary/Secondary it would fail opening the /dev/drbdX device on the
Secondary node. With ALUA it doesn't actually deny any commands sent
to the targets, it will process them as usually. Its up to the
initiators to follow the rules, but periodic reads (I assume getting
LU / device information) on the nonoptimized paths are pretty common,
at least with vSphere. Paths to both nodes need to available at all
times, if SCST weren't running on one this would be an issue, or if
you couldn't actually map an SCST device to a LUN, that would be an
issue too. This is all based on my experience with SCST and vSphere...
there may be quirks, or ways around it, but this is what I found
works.

The SCST resource is started on both nodes, but the Master/Slave
states are based on ALUA (and vice versa).

I'll work on updating the resource-agents package to the latest
release so the clvm RA is included. Let me know if anything additional
is needed for the quorum stuff.

--Marc

Marc Smith

unread,

May 1, 2015, 10:51:52 PM5/1/15

to esos-...@googlegroups.com

Sorry for the last reply, I had composed the message earlier in the
evening and sent it without looking... just realized you discovered
the answer beforehand. My bad! =)

--Marc

Kristián Feldsam

unread,

May 2, 2015, 3:25:17 AM5/2/15

to esos-...@googlegroups.com

Dňa sobota, 2. mája 2015 4:21:42 UTC+2 Marc Smith napísal(-a):

Hi, thank you for replies. You should also update:

pacemaker from 1.1.11 to 1.1.12

corosync from 2.3.3 to 2.3.4

btier from 1.3.2 to 1.3.11

Reply all

Reply to author

Forward