ALUA setup HA

368 views
Skip to first unread message

JR

unread,
May 26, 2021, 6:54:08 AM5/26/21
to esos-users

Hi,

I have HA setup with ESOS 3.0.12_z on Supero servers (with internal storage). Both servers have dual port FC card connected to two Brocade switches. Three ESXI hosts are connected via same Brocades to this ESOS cluster. On the ESXI is correctly 4 fc paths, one is active I/O, but array type is badly VMW_SATP_DEFAULT_AA.

strange info in log, when I on second node run crm node standby >
scst(p_scst)[26919]:    INFO:  Collecting current configuration: done. -> Making requested changes. -> Setting Target Group attribute 'state' to value '' for target group 'esos/remote': done. WARNING: Received the following error: Target group attribute specified is static. -> Done, 0 change(s) made. All done.

what did I forget, where I made a mistake?

settings>
2 nodes configured
9 resources configured
Online: [ esos1.xx.admin esos2.xx.admin ]
Full list of resources:
 Master/Slave Set: ms_drbd [g_drbd]
     Masters: [ esos1.xx.admin esos2.xx.admin ]
 Clone Set: clone_lvm [g_lvm]
     Started: [ esos1.xx.admin esos2.xx.admin ]
 Master/Slave Set: ms_scst [p_scst]
     Masters: [ esos1.xx.admin ]
     Slaves: [ esos2.xx.admin ]
 fence_esos1    (stonith:fence_ipmilan):        Started esos2.xx.admin
 fence_esos2    (stonith:fence_ipmilan):        Started esos1.xx.admin
 p_notify       (ocf::heartbeat:ClusterMon):    Started esos2.xx.admin

----------------------------------------------
node 1: esos1.xx.admin \
        attributes standby=off
node 2: esos2.xx.admin \
        attributes standby=off
primitive fence_esos1 stonith:fence_ipmilan \
        params pcmk_host_list=esos1.xx.admin ipaddr=192.168.250.132 login=ADMIN passwd=xxxx \
        op monitor interval=60 \
        meta target-role=Started
primitive fence_esos2 stonith:fence_ipmilan \
        params pcmk_host_list=esos2.xx.admin ipaddr=192.168.250.133 login=ADMIN passwd=xxxx \
        op monitor interval=60 \
        meta target-role=Started
primitive p_drbd_r0 ocf:linbit:drbd \
        params drbd_resource=r0 \
        op monitor interval=10 role=Master \
        op monitor interval=20 role=Slave \
        op start interval=0 timeout=240 \
        op stop interval=0 timeout=100 \
        meta target-role=Started
primitive p_lvm_r0 LVM \
        params volgrpname=r0 \
        op start interval=0 timeout=30 \
        op stop interval=0 timeout=30 \
        meta target-role=Started
primitive p_notify ClusterMon \
        params user=root update=30 extra_options="-E /usr/local/bin/crm_mon_email.sh -e root" \
        op monitor on-fail=restart interval=10
primitive p_scst ocf:esos:scst \
        params alua=true device_group=esos local_tgt_grp=local remote_tgt_grp=remote m_alua_state=active s_alua_state=nonoptimized \
        op monitor interval=10 role=Master \
        op monitor interval=20 role=Slave \
        op start interval=0 timeout=120 \
        op stop interval=0 timeout=90
group g_drbd p_drbd_r0
group g_lvm p_lvm_r0
ms ms_drbd g_drbd \
        meta master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true target-role=Started
ms ms_scst p_scst \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true interleave=true target-role=Started
clone clone_lvm g_lvm \
        meta interleave=true target-role=Started
colocation c_r0 inf: clone_lvm:Started ms_scst:Started ms_drbd:Master
location l_fence_esos1 fence_esos1 -inf: esos1.xx.admin
location l_fence_esos2 fence_esos2 -inf: esos2.xx.admin
order o_r0 inf: ms_drbd:promote clone_lvm:start ms_scst:start
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=1.1.20-3c4c782f70 \
        cluster-infrastructure=corosync \
        cluster-name=ESOS_CLUSTER \
        stonith-enabled=true \
        maintenance-mode=off \
        last-lrm-refresh=1621415545
-------------------------------------
ESOS1 SCST>
HANDLER vdisk_blockio {
        DEVICE ESOS_DATA {
                cluster_mode 1
                filename /dev/mapper/r0-DATA
                rotational 0
                size 472446402560
                write_through 1
        }
}

TARGET_DRIVER copy_manager {
        TARGET copy_manager_tgt {
                LUN 0 ESOS_DATA
        }
}

TARGET_DRIVER iscsi {
        enabled 1
}

TARGET_DRIVER qla2x00t {
        TARGET 50:01:43:80:16:7c:fb:44 {
                HW_TARGET

                enabled 1
                rel_tgt_id 1

                GROUP TGT_GRP2 {
                        LUN 0 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1e

                        INITIATOR 50:01:43:80:33:15:42:4c

                        INITIATOR 50:01:43:80:33:15:42:80
                }
        }

        TARGET 50:01:43:80:16:7c:fb:46 {
                HW_TARGET

                enabled 1
                rel_tgt_id 2

                GROUP TGT_GRP1 {
                        LUN 0 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1c

                        INITIATOR 50:01:43:80:33:15:42:4e

                        INITIATOR 50:01:43:80:33:15:42:82
                }
        }
}

DEVICE_GROUP esos {
        DEVICE ESOS_DATA

        TARGET_GROUP local {
                group_id 1
                state active

                TARGET 50:01:43:80:16:7c:fb:44
                TARGET 50:01:43:80:16:7c:fb:46
        }

        TARGET_GROUP remote {
                group_id 2
                state nonoptimized

                TARGET 50:01:43:80:09:af:79:40 {
                        rel_tgt_id 3
                }
                TARGET 50:01:43:80:09:af:79:42 {
                        rel_tgt_id 4
                }
        }
}
-----------------------------------------------------------------------

ESOS2 SCST>
HANDLER vdisk_blockio {
        DEVICE ESOS_DATA {
                cluster_mode 1
                filename /dev/mapper/r0-DATA
                rotational 0
                size 472446402560
                write_through 1
        }
}

TARGET_DRIVER copy_manager {
        TARGET copy_manager_tgt {
                LUN 0 ESOS_DATA
        }
}

TARGET_DRIVER iscsi {
        enabled 1
}

TARGET_DRIVER qla2x00t {
        TARGET 50:01:43:80:09:af:79:40 {
                HW_TARGET

                enabled 1
                rel_tgt_id 3

                GROUP TGT_GRP2 {
                        LUN 0 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1c

                        INITIATOR 50:01:43:80:33:15:42:4e

                        INITIATOR 50:01:43:80:33:15:42:82
                }
        }

        TARGET 50:01:43:80:09:af:79:42 {
                HW_TARGET

                enabled 1
                rel_tgt_id 4

                GROUP TGT_GRP1 {
                        LUN 0 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1e

                        INITIATOR 50:01:43:80:33:15:42:4c

                        INITIATOR 50:01:43:80:33:15:42:80
                }
        }
}

DEVICE_GROUP esos {
        DEVICE ESOS_DATA

        TARGET_GROUP local {
                group_id 2
                state nonoptimized

                TARGET 50:01:43:80:09:af:79:40
                TARGET 50:01:43:80:09:af:79:42
        }

        TARGET_GROUP remote {
                group_id 1
                state active

                TARGET 50:01:43:80:16:7c:fb:44 {
                        rel_tgt_id 1
                }
                TARGET 50:01:43:80:16:7c:fb:46 {
                        rel_tgt_id 2
                }
        }
}

Thanks for help
JR

Andrei Wasylyk

unread,
May 26, 2021, 7:24:20 AM5/26/21
to esos-...@googlegroups.com

i have noticed from time to time that when scst is having trouble starting or stopping it is often because of an edit to scst.conf that I have made by hand (even though I am absolutely certain that I have not made any mistakes)
In these cases I will try to generate a brand new conf using scstadmin and inputting all the values from my old conf. 
Something like this has previously caused me problems like you are describing where scst tells me I'm trying to change an attribute that is static (even though I can't see what that could possibly be)

Is it possible this is the case? Have you recently made changes to the conf and it is now broken (but working before?) or is it brand new install?

You can also try comparing the output of scstadmin -save_config  AND /etc/scst.conf. This might show some differences that could show you the issue.

Just out of curiosity, why bother enabling iscsi if you are using fc?

Probably unrelated, but personally I don't bother setting cluster_mode 1 on the devices anymore since the RA does this already after startup. Also I usually set active 0 since the RA handles active/inactive backing device states itself and I figure it's safer to have it start up with disabled.

Andrei

--
You received this message because you are subscribed to the Google Groups "esos-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email to esos-users+...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/esos-users/16d6de59-69fb-49b2-9f35-7cd43a6b3e3bn%40googlegroups.com.

JR

unread,
May 26, 2021, 9:15:30 AM5/26/21
to esos-users
It is new install.
In config I manualy disabled iscsi  .. enabled 0 , then I removed cluste_mode , added active 0.
Then scstadmin --config /etc/scst.conf, scstadmin --write_config  /etc/scst.conf
cat /etc/scst.conf then still include cluster_mode 1 (auto added), removed active 0 (auto remove), otherwise no change.

scstadmin --check_config /etc/scst.conf only show> WARNING: Driver 'iscsi' has no configured targets, that is correct.

So what do you think, should I completely reconfigure scst using the command scstadmin?

ALUA no change...


At this point kernel parameters>
esos1, esos2
/sys/kernel/scst_tgt/devices/ESOS_DATA/expl_alua 0
/sys/kernel/scst_tgt/devices/ESOS_DATA/bind_alua_state 1
/sys/kernel/scst_tgt/devices/ESOS_DATA/active 1
/sys/kernel/scst_tgt/device_groups/esos/devices/ESOS_DATA
/sys/kernel/scst_tgt/device_groups/esos/target_groups/local
/sys/kernel/scst_tgt/device_groups/esos/target_groups/remote

esos1 -  device_groups/esos/target_groups/local/group_id    1
           device_groups/esos/target_groups/remote/group_id   
esos2    - device_groups/esos/target_groups/local/group_id    2
          device_groups/esos/target_groups/remote/group_id    1

esos1    - device_groups/esos/target_groups/local/preferred    0
          device_groups/esos/target_groups/remote/preferred    0
 esos2    - device_groups/esos/target_groups/local/preferred    0
          device_groups/esos/target_groups/remote/preferred    0

esos1    - device_groups/esos/target_groups/local/state    active
          device_groups/esos/target_groups/remote/state    nonoptimized
esos2    - device_groups/esos/target_groups/local/state    nonoptimized   
          device_groups/esos/target_groups/remote/state    active   

esos1    -    esos/target_groups/local/50\:01\:43\:80\:16\:7c\:fb\:44/rel_tgt_id 1
                esos/target_groups/local/50\:01\:43\:80\:16\:7c\:fb\:46/rel_tgt_id 2
                esos/target_groups/remote/50\:01\:43\:80\:09\:af\:79\:40/rel_tgt_id 3
                esos/target_groups/remote/50\:01\:43\:80\:09\:af\:79\:42/rel_tgt_id 4
               
 esos2     -    esos/target_groups/local/50\:01\:43\:80\:09\:af\:79\:40/rel_tgt_id  3
                esos/target_groups/local/50\:01\:43\:80\:09\:af\:79\:42/rel_tgt_id  4
                esos/target_groups/remote/50\:01\:43\:80\:16\:7c\:fb\:44/rel_tgt_id    1
                esos/target_groups/remote/50\:01\:43\:80\:16\:7c\:fb\:46/rel_tgt_id 2
  
Thank you
JR


Dne středa 26. května 2021 v 13:24:20 UTC+2 uživatel and...@gmail.com napsal:
/sys/kernel/scst_tgt/devices/ESOS_DATA/active 1

Andrei Wasylyk

unread,
May 26, 2021, 10:50:18 AM5/26/21
to esos-...@googlegroups.com
ah okay excuse me you are using bind_alua_state 1 (which is the default - I forgot) alua_bind means that when you set the alua state to active (or nonoptimized) scst ITSELF handles activating the backend device... in my setup I dont bind alua states, I use bind_alua_state 0 and active 0 and configure the RA with set_dev_active=true ... if your lvm volume is running in dual-active mode then there is no problem with this. And in any case I don't think it is necessarily the source of your issue. 

I'm sorry, I was just explaining my confusion. Your way SHOULD work, but in my infrastructure (shared SAS jbod) I could not get alua based state change to work.

Yes so, if it's not a production system, I would recommend stopping the cluster altogether and generating the config file with scstadmin by manually specifying the command for each step. It's entirely possible this is not your problem, but that would be my next step to try because everything looks good otherwise.

Dont have to bother disabling iscsi if doing that causes more problems. Lets try to fix one problem at a time instead of causing new ones haha.

If you suspect that the bind_alua_state is causing problems, make a copy of your scst.conf and ill explain the changes needed to handle activation with the RA directly. In fact, make copies before you follow any of my advice because I might be wrong.

Andrei

JR

unread,
May 27, 2021, 3:10:12 AM5/27/21
to esos-users
I mainly made the mentioned SCST settings via the TUI. But some minor tuning then I made manually.

I suppose that cluster I have setuped well>
> crm_mon -R
Stack: corosync
Current DC: esos2.xx.admin (2) (version 1.1.20-3c4c782f70) - partition with quorum
Last updated: Thu May 27 07:47:30 2021
Last change: Thu May 27 07:11:21 2021 by root via crm_attribute on esos1.xx.admin


2 nodes configured
9 resources configured

Online: [ esos1.xx.admin (1) esos2.xx.admin (2) ]

Active resources:

 Master/Slave Set: ms_drbd [g_drbd]
     Resource Group: g_drbd:0
         p_drbd_r0      (ocf::linbit:drbd):     Master esos2.xx.admin
     Resource Group: g_drbd:1
         p_drbd_r0      (ocf::linbit:drbd):     Master esos1.xx.admin

     Masters: [ esos1.xx.admin esos2.xx.admin ]
 Clone Set: clone_lvm [g_lvm]
     Resource Group: g_lvm:0
         p_lvm_r0       (ocf::heartbeat:LVM):   Started esos2.xx.admin
     Resource Group: g_lvm:1
         p_lvm_r0       (ocf::heartbeat:LVM):   Started esos1.xx.admin

     Started: [ esos1.xx.admin esos2.xx.admin ]
 Master/Slave Set: ms_scst [p_scst]
     p_scst     (ocf::esos:scst):       Master esos2.xx.admin
     p_scst     (ocf::esos:scst):       Slave esos1.xx.admin
     Masters: [ esos2.xx.admin ]
     Slaves: [ esos1.xx.admin ]

fence_esos1     (stonith:fence_ipmilan):        Started esos2.xx.admin
fence_esos2     (stonith:fence_ipmilan):        Started esos1.xx.admin
p_notify        (ocf::heartbeat:ClusterMon):    Started esos1.xx.admin

Thanks
JR
Dne středa 26. května 2021 v 16:50:18 UTC+2 uživatel and...@gmail.com napsal:

JR

unread,
Jun 1, 2021, 3:25:56 AM6/1/21
to esos-users
Thak you Andrei for the help, problem solved.
After reconfiguring scst (as you recommended) - detaching from ESXI, clearing scst.conf via scstadmin, creating new config via scstadmin and TUI, now everything working well. From ESXI side is storage array type now VMW_SATP_ALUA.
Only change I was made, was separating TARGET DRIVER GROUP TGT_GRP (4x FC target ... 4x TGT_GRP). If this helped, I do not know. Then I spent some time testing the fencing, but it looks worked as expected, well.
ESOS1.....
TARGET_DRIVER qla2x00t {
        TARGET 50:01:43:80:16:7c:fb:44 {
                HW_TARGET
                enabled 1
                rel_tgt_id 1
                GROUP TGT_GRP1 {
                        LUN 1 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1e
                        INITIATOR 50:01:43:80:33:15:42:4c
                        INITIATOR 50:01:43:80:33:15:42:80
                }
        }
        TARGET 50:01:43:80:16:7c:fb:46 {
                HW_TARGET
                enabled 1
                rel_tgt_id 2
                GROUP TGT_GRP2 {
                        LUN 1 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1c
                        INITIATOR 50:01:43:80:33:15:42:4e
                        INITIATOR 50:01:43:80:33:15:42:82
                }
        }
}
---------------------------------------------------------------------
ESOS2.....
TARGET_DRIVER qla2x00t {
        TARGET 50:01:43:80:09:af:79:40 {
                HW_TARGET
                enabled 1
                rel_tgt_id 3
                GROUP TGT_GRP3 {
                        LUN 1 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1c
                        INITIATOR 50:01:43:80:33:15:42:4e
                        INITIATOR 50:01:43:80:33:15:42:82
                }
        }
        TARGET 50:01:43:80:09:af:79:42 {
                HW_TARGET
                enabled 1
                rel_tgt_id 4
                GROUP TGT_GRP4 {
                        LUN 1 ESOS_DATA

                        INITIATOR 50:01:43:80:33:15:42:1e
                        INITIATOR 50:01:43:80:33:15:42:4c
                        INITIATOR 50:01:43:80:33:15:42:80
                }
        }
}

Thanks JR



Dne středa 26. května 2021 v 16:50:18 UTC+2 uživatel and...@gmail.com napsal:
ah okay excuse me you are using bind_alua_state 1 (which is the default - I forgot) alua_bind means that when you set the alua state to active (or nonoptimized) scst ITSELF handles activating the backend device... in my setup I dont bind alua states, I use bind_alua_state 0 and active 0 and configure the RA with set_dev_active=true ... if your lvm volume is running in dual-active mode then there is no problem with this. And in any case I don't think it is necessarily the source of your issue. 
Reply all
Reply to author
Forward
0 new messages