Testing HA, what if only one node started correctly

JR

unread,

Jun 3, 2021, 5:19:30 AM6/3/21

to esos-users

Hi all,

I am testing ESOS HA setup, dual primary configuration, and everything already looks working well.

But when I simulate situation when both nodes are suddenly down (powered off), and then only one node started correctly (alive), I do not know how to put it into "limited" operation.

What is the steps to bring only one node up to working state, if second is not ready?

And what steps then, after second node was repaired?

thanks for the advices

Jan

Marc Smith

unread,

Jun 21, 2021, 10:41:38 AM6/21/21

to esos-...@googlegroups.com

On Thu, Jun 3, 2021 at 5:19 AM JR <ramb...@gmail.com> wrote:

Hi all,
I am testing ESOS HA setup, dual primary configuration, and everything already looks working well.
But when I simulate situation when both nodes are suddenly down (powered off), and then only one node started correctly (alive), I do not know how to put it into "limited" operation.

It should automatically be "limited" since only one node is running. But all runnable resources should be running on that standing node (verify with "crm status").

What is the steps to bring only one node up to working state, if second is not ready?
And what steps then, after second node was repaired?

Depending on your cluster configuration, this should all be automatic based on what nodes are joined/started.

--Marc

thanks for the advices
Jan

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/15df7a3e-f3c0-46f7-9d65-1f7ead81b933n%40googlegroups.com.

JR

unread,

Jun 22, 2021, 11:19:07 AM6/22/21

to esos-users

Hello Marc,

I have two-node configuration.
corosync.conf>
totem {
    version: 2
    crypto_hash: none
    crypto_cipher: none
    rrp_mode: passive
    cluster_name: ESOS_CLUSTER
    ip_version: ipv4
#    transport: udpu
    interface {
        bindnetaddr: 192.168.1.1
        ringnumer: 0
        }
}
logging {
    timestamp: off
    fileline: off
    to_stderr: no
    to_syslog: yes
    syslog_facility: local2
    debug: on
}
quorum {
    provider: corosync_votequorum
    two_node: 1
        wait_for_all: 0
}
nodelist {
    node {
        nodeid: 1
        ring0_addr: 192.168.1.1
        name: esos1.xx.xx
    }
    node {
        nodeid: 2
        ring0_addr: 192.168.1.2
        name: esos2.xx.xx
    }
}

Ready state ESOS 3_0_12z>
---------------------------
2 nodes configured
9 resources configured
Online: [ esos1.xx.xx esos2.xx.xx ]
Full list of resources:
Master/Slave Set: ms_drbd [g_drbd]
     Masters: [ esos1.xx.xx esos2.xx.xx ]
Clone Set: clone_lvm [g_lvm]
     Started: [ esos1.xx.xx esos2.xx.xx ]
Master/Slave Set: ms_scst [p_scst]
     Masters: [ esos1.xx.xx ]
     Slaves: [ esos2.xx.xx ]
fence_esos1    (stonith:fence_ipmilan):        Started esos2.xx.xx
fence_esos2    (stonith:fence_ipmilan):        Started esos1.xx.xx
p_notify       (ocf::heartbeat:ClusterMon):    Started esos1.xx.xx
-------------------------------

When I power up only one node, as a simulation of fail of second node (which have disconnected power cords), then>

----------------------------
2 nodes configured
9 resources configured
Node esos2.xx.xx: UNCLEAN (offline)
Online: [ esos1.xx.xx ]
Full list of resources:
Master/Slave Set: ms_drbd [g_drbd]
     Stopped: [ esos1.xx.xx esos2.xx.xx ]
Clone Set: clone_lvm [g_lvm]
     Stopped: [ esos1.xx.xx esos2.xx.xx ]
Master/Slave Set: ms_scst [p_scst]
     Stopped: [ esos1.xx.xx esos2.xx.xx ]
fence_esos1    (stonith:fence_ipmilan):        Stopped
fence_esos2    (stonith:fence_ipmilan):        Stopped
p_notify       (ocf::heartbeat:ClusterMon):    Stopped
Failed Fencing Actions:
* reboot of esos2.xx.xx failed: delegate=, client=crmd.1310, origin=esos1.xx.xx,
    last-failed='Tue Jun 22 16:33:34 2021'
--------------------------------------

It is correct state, second node failed.
But is it possible to start up the resources without running second node (esos2)?

If you need more information or configs, I can send you more details.
Thanks for your advice.

JR

Dne pondělí 21. června 2021 v 16:41:38 UTC+2 uživatel Marc Smith napsal:

Marc Smith

unread,

Jun 22, 2021, 2:04:44 PM6/22/21

to esos-...@googlegroups.com

What's the output of "crm configure show" look like?

--Marc

To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/4718eea7-223b-4673-af23-c5ca6aeb5369n%40googlegroups.com.

JR

unread,

Jun 22, 2021, 6:01:50 PM6/22/21

to esos-users

[root@esos1 ~]# crm configure show
node 1: esos1.xx.xx \
        attributes standby=off
node 2: esos2.xx.xx \
        attributes standby=off
primitive fence_esos1 stonith:fence_ipmilan \
        params pcmk_host_list=esos1.xx.xx ipaddr=192.168.250.132 login=ADMIN passwd=xx \
        op monitor interval=60 \
        meta target-role=Started
primitive fence_esos2 stonith:fence_ipmilan \
        params pcmk_host_list=esos2.xx.xx ipaddr=192.168.250.133 login=ADMIN passwd=xx \
        op monitor interval=60 \
        meta target-role=Started
primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=10 role=Master \
        op monitor interval=20 role=Slave \
        op start interval=0 timeout=240 \
        op stop interval=0 timeout=100 \
        meta target-role=Started
primitive p_lvm_r0 LVM \
        params volgrpname=r0 \
        op start interval=0 timeout=30 \
        op stop interval=0 timeout=30 \
        meta target-role=Started
primitive p_notify ClusterMon \
        params user=root update=30 extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
        op monitor on-fail=restart interval=10
primitive p_scst ocf:esos:scst \
        params alua=true device_group=esos local_tgt_grp=local remote_tgt_grp=remote m_alua_state=active s_alua_state=nonoptimized \
        op monitor interval=10 role=Master \
        op monitor interval=20 role=Slave \
        op start interval=0 timeout=120 \
        op stop interval=0 timeout=90
group g_drbd p_drbd_r0
group g_lvm p_lvm_r0
ms ms_drbd g_drbd \
        meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true target-role=Started
ms ms_scst p_scst \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true target-role=Started
clone clone_lvm g_lvm \
        meta interleave=true target-role=Started
colocation c_r0 inf: clone_lvm:Started ms_scst:Started ms_drbd:Master
location l_fence_esos1 fence_esos1 -inf: esos1.xx.xx
location l_fence_esos2 fence_esos2 -inf: esos2.xx.xx
order o_r0 inf: ms_drbd:promote clone_lvm:start ms_scst:start
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.20-3c4c782f70 \
        cluster-infrastructure=corosync \
        cluster-name=ESOS_CLUSTER \
        stonith-enabled=true \
        maintenance-mode=off \
        last-lrm-refresh=1621415545 \
        no-quorum-policy=ignore
---------------------------

JR

Dne úterý 22. června 2021 v 20:04:44 UTC+2 uživatel Marc Smith napsal:

Marc Smith

unread,

Jun 25, 2021, 11:25:04 AM6/25/21

to esos-...@googlegroups.com

It's not obvious to me by looking at the running config why resources aren't starting on the standing node. If you look in '/var/log/local2.log' file on the "esos1.xx.xx" node, do you see any failed start attempts for the resources? I'd expect to see failed resource actions in your "crm_mon -1" output though if that were the case...

--Marc

To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/6ef4ee56-d77a-4275-8063-2897c2421bafn%40googlegroups.com.

JR

unread,

Jun 25, 2021, 2:23:31 PM6/25/21

to esos-users

Thank you for comming back to me.

in local2.log I can see repeatedly >

Jun 25 14:25:08 esos1 pengine[1309]:   notice: On loss of CCM Quorum: Ignore
Jun 25 14:25:08 esos1 pengine[1309]:   notice: Cannot pair p_scst:0 with instance of clone_lvm
Jun 25 14:25:08 esos1 pengine[1309]:   notice: Cannot pair p_scst:1 with instance of clone_lvm
Jun 25 14:25:08 esos1 pengine[1309]: warning: Scheduling Node esos2.xx.xx for STONITH
Jun 25 14:25:08 esos1 pengine[1309]:   notice: * Fence (reboot) esos2.xx.xx 'node is unclean'
Jun 25 14:25:08 esos1 pengine[1309]:   notice: * Start      p_drbd_r0:0     ( esos1.xx.xx )
Jun 25 14:25:08 esos1 pengine[1309]:   notice: * Start      fence_esos2     ( esos1.xx.xx )
Jun 25 14:25:08 esos1 pengine[1309]:   notice: * Start      p_notify        ( esos1.xx.xx )
Jun 25 14:25:08 esos1 pengine[1309]: warning: Calculated transition 282 (with warnings), saving inputs in /var/lib/pacemaker/pengine/pe-warn-71.bz2
Jun 25 14:25:08 esos1 crmd[1310]:   notice: Requesting fencing (reboot) of node esos2.xx.xx
Jun 25 14:25:08 esos1 stonith-ng[1306]:   notice: Client crmd.1310.e843c1bb wants to fence (reboot) 'esos2.xx.xx' with device '(any)'
Jun 25 14:25:08 esos1 stonith-ng[1306]:   notice: Requesting peer fencing (reboot) of esos2.xx.xx
Jun 25 14:25:08 esos1 stonith-ng[1306]:   notice: fence_esos2 can fence (reboot) esos2.xx.xx: static-list
Jun 25 14:25:08 esos1 stonith-ng[1306]:   notice: fence_esos2 can fence (reboot) esos2.xx.xx: static-list
Jun 25 14:25:18 esos1 stonith-ng[1306]: warning: fence_ipmilan[25714] stderr: [ 2021-06-25 14:25:18,819 ERROR: Failed: Unable to obtain correct plug status or plug is not available ]
Jun 25 14:25:18 esos1 stonith-ng[1306]: warning: fence_ipmilan[25714] stderr: [ ]
Jun 25 14:25:18 esos1 stonith-ng[1306]: warning: fence_ipmilan[25714] stderr: [ ]
Jun 25 14:25:29 esos1 stonith-ng[1306]: warning: fence_ipmilan[25728] stderr: [ 2021-06-25 14:25:29,950 ERROR: Failed: Unable to obtain correct plug status or plug is not available ]
Jun 25 14:25:29 esos1 stonith-ng[1306]: warning: fence_ipmilan[25728] stderr: [ ]
Jun 25 14:25:29 esos1 stonith-ng[1306]: warning: fence_ipmilan[25728] stderr: [ ]
Jun 25 14:25:29 esos1 stonith-ng[1306]:    error: Operation 'reboot' [25728] (call 284 from crmd.1310) for host 'esos2.xx.xx' with device 'fence_esos2' returned: -201 (Generic Pacemaker error)
Jun 25 14:25:29 esos1 stonith-ng[1306]:   notice: Couldn't find anyone to fence (reboot) esos2.xx.xx with any device
Jun 25 14:25:29 esos1 stonith-ng[1306]:    error: Operation reboot of esos2.xx.xx by <no-one> for crmd...@esos1.xx.xx.24cef2e3: No route to host
Jun 25 14:25:29 esos1 crmd[1310]:   notice: Stonith operation 284/83:282:0:a38f0b2b-19d9-419a-a9bd-20a44028f2e6: No route to host (-113)
Jun 25 14:25:29 esos1 crmd[1310]:   notice: Stonith operation 284 for esos2.xx.xx failed (No route to host): aborting transition.
Jun 25 14:25:29 esos1 crmd[1310]: warning: Too many failures (283) to fence esos2.xx.xx, giving up
Jun 25 14:25:29 esos1 crmd[1310]:   notice: Transition aborted: Stonith failed
Jun 25 14:25:29 esos1 crmd[1310]:   notice: Peer esos2.xx.xx was not terminated (reboot) by <anyone> on behalf of crmd.1310: No route to host
Jun 25 14:25:29 esos1 crmd[1310]:   notice: Transition 282 (Complete=5, Pending=0, Fired=0, Skipped=0, Incomplete=11, Source=/var/lib/pacemaker/pengine/pe-warn-71.bz2): Complete
Jun 25 14:25:29 esos1 crmd[1310]:   notice: State transition S_TRANSITION_ENGINE -> S_IDLE
Jun 25 14:40:29 esos1 crmd[1310]:   notice: State transition S_IDLE -> S_POLICY_ENGINE

crm_resource -list
p_drbd_r0:0
p_drbd_r0:1
p_lvm_r0:0
p_lvm_r0:1
p_scst:0
p_scst:1
fence_esos1
fence_esos2
p_notify

crmadmin -S esos1.xx.xx
Status of cr...@esos1.xx.xx: S_IDLE (ok)

crm_mon -1
Stack: corosync
Current DC: esos1.xx.xx (version 1.1.20-3c4c782f70) - partition with quorum
Last updated: Fri Jun 25 18:13:34 2021
Last change: Tue Jun 22 16:24:25 2021 by root via cibadmin on esos1.xx.xx

2 nodes configured
9 resources configured

Node esos2.xx.xx: UNCLEAN (offline)
Online: [ esos1.xx.xx ]

No active resources

Failed Fencing Actions:
* reboot of esos2.xx.xx failed: delegate=, client=crmd.1310, origin=esos1.xx.xx,

last-failed='Fri Jun 25 18:00:28 2021'

Marc in pacemaker.log is allocation and promoting for resources, which finish with 0 of masters>

section of log with drbd resource....

Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:   notice: unpack_config: On loss of CCM Quorum: Ignore
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: unpack_config: Node scores: 'red' = -INFINITY, 'yellow' = 0, 'green' = 0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: determine_online_status_fencing:       Node esos1.xx.xx is active
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: determine_online_status:       Node esos1.xx.xx is online
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: unpack_find_resource: Internally renamed p_lvm_r0 on esos1.xx.xx to p_lvm_r0:0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: unpack_find_resource: Internally renamed p_drbd_r0 on esos1.xx.xx to p_drbd_r0:0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: unpack_find_resource: Internally renamed p_scst on esos1.xx.xx to p_scst:0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: unpack_node_loop:      Node 1 is already processed
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: unpack_node_loop:      Node 1 is already processed
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: clone_print:    Master/Slave Set: ms_drbd [g_drbd]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: short_print:        Stopped: [ esos1.xx.xx esos2.xx.xx ]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: clone_print:    Clone Set: clone_lvm [g_lvm]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: short_print:        Stopped: [ esos1.xx.xx esos2.xx.xx ]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: clone_print:    Master/Slave Set: ms_scst [p_scst]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: short_print:        Stopped: [ esos1.xx.xx esos2.xx.xx ]
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: common_print: fence_esos1     (stonith:fence_ipmilan):        Stopped
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: common_print: fence_esos2     (stonith:fence_ipmilan):        Stopped
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: common_print: p_notify        (ocf::heartbeat:ClusterMon):    Stopped
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: distribute_children:   Allocating up to 2 ms_drbd instances to a possible 1 nodes (at most 1 per host, 2 optimal)
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: native_assign_node:    Assigning esos1.xx.xx to p_drbd_r0:0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: native_assign_node:    All nodes for resource p_drbd_r0:1 are unavailable, unclean or shutting down (esos2.xx.xx: 0, -1000000)
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: native_assign_node:    Could not allocate a node for p_drbd_r0:1
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: native_color: Resource p_drbd_r0:1 cannot run anywhere
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: distribute_children:   Allocated 1 ms_drbd instances of a possible 2
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: master_color: g_drbd:1 master score: 0
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:    debug: master_color: g_drbd:0 master score: -1
Jun 25 14:25:08 [1309] esos1.xx.xx    pengine:     info: master_color: ms_drbd: Promoted 0 instances of a possible 2 to master

same result is showing for rest of the resources.

Thanks for your advice.

JR

Dne pátek 25. června 2021 v 17:25:04 UTC+2 uživatel Marc Smith napsal:

Reply all

Reply to author

Forward