VMware Vsphere cluster/heartbeating with ALUA/VAAI suport.

webm...@openhome.org

unread,

Jun 6, 2014, 12:17:49 PM6/6/14

to esos-...@googlegroups.com

Hi All,

First of all - great project!

I am currently prototyping a new SAN storage platform built upon ESOS which will be used by 4 VMware hosts and
looking for some input/sanity check -

1. Is anyone else doing this with a Vsphere 5.5 cluster which has DR/HA enabled? (Auto failover of VMs etc and heart beating on data stores)

2. I note that the latest SCST has VAAI support / Atomic Test and Set and may not need SCSI Reservations. Has anyone else tested this yet?

3. The docs suggest that Multipath I/O with Round robin is a bad idea. Is this actually an issue if all paths are in standby mode to the secondary storage array with ALUA? (i.e. Active/Standby). Not sure what I am missing here.

4. How to avoid data corruption when iSCSI Reservation is not shared across both primary/secondary node? is single node active enough? what about failover?

Thanks

Marc Smith

unread,

Jun 6, 2014, 4:41:07 PM6/6/14

to esos-...@googlegroups.com

On Fri, Jun 6, 2014 at 12:17 PM, <webm...@openhome.org> wrote:
> Hi All,
>
> First of all - great project!

Hi,

Thank you.

>
> I am currently prototyping a new SAN storage platform built upon ESOS which
> will be used by 4 VMware hosts and
> looking for some input/sanity check -
>
> 1. Is anyone else doing this with a Vsphere 5.5 cluster which has DR/HA
> enabled? (Auto failover of VMs etc and heart beating on data stores)

Yes. We have (12) ESXi 5.5 hosts (4 per cluster) attached to (4) ESOS
all SSD disk arrays that are running in production. We use the
datastore heartbeat'ing and vSphere HA / DRS work correctly (Fibre
Channel SAN).

>
> 2. I note that the latest SCST has VAAI support / Atomic Test and Set and
> may not need SCSI Reservations. Has anyone else tested this yet?
> 3. The docs suggest that Multipath I/O with Round robin is a bad idea. Is
> this actually an issue if all paths are in standby mode to the secondary
> storage array with ALUA? (i.e. Active/Standby). Not sure what I am missing
> here.
>

I have a couple new boxes setup in a cluster (ESOS) and I'm just about
to be testing in the next day or two. I just recently updated SCST (a
few days ago) which has the COMPARE AND WRITE functionality in vdisk_*
handlers (for VAAI ATS). I can let you know in a few days, but I don't
expect any problems -- I haven't seen any recent complaints on the
scst-devel list about it.

> 4. How to avoid data corruption when iSCSI Reservation is not shared across
> both primary/secondary node? is single node active enough? what about
> failover?
>

Don't round-robin the paths; you only fail over to the other node if
you have to (one active path set to a target host).

--Marc

> Thanks
>
> --
> You received this message because you are subscribed to the Google Groups
> "esos-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to esos-users+...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

webm...@openhome.org

unread,

Jun 7, 2014, 6:32:09 AM6/7/14

to esos-...@googlegroups.com

Hi Marc,

On Friday, June 6, 2014 9:41:07 PM UTC+1, Marc Smith wrote:

On Fri, Jun 6, 2014 at 12:17 PM, <webm...@openhome.org> wrote:
> Hi All,
>
> First of all - great project!

Hi,

Thank you.

>
> I am currently prototyping a new SAN storage platform built upon ESOS which
> will be used by 4 VMware hosts and
> looking for some input/sanity check -
>
> 1. Is anyone else doing this with a Vsphere 5.5 cluster which has DR/HA
> enabled? (Auto failover of VMs etc and heart beating on data stores)

>Yes. We have (12) ESXi 5.5 hosts (4 per cluster) attached to (4) ESOS
>all SSD disk arrays that are running in production. We use the
>datastore heartbeat'ing and vSphere HA / DRS work correctly (Fibre
>Channel SAN).

Great. Sounds like I am on the right track then (still plenty of testing required).

>
> 2. I note that the latest SCST has VAAI support / Atomic Test and Set and
> may not need SCSI Reservations. Has anyone else tested this yet?
> 3. The docs suggest that Multipath I/O with Round robin is a bad idea. Is
> this actually an issue if all paths are in standby mode to the secondary
> storage array with ALUA? (i.e. Active/Standby). Not sure what I am missing
> here.
>

>I have a couple new boxes setup in a cluster (ESOS) and I'm just about
>to be testing in the next day or two. I just recently updated SCST (a
>few days ago) which has the COMPARE AND WRITE functionality in vdisk_*
>handlers (for VAAI ATS). I can let you know in a few days, but I don't
>expect any problems -- I haven't seen any recent complaints on the
>scst-devel list about it.

Sounds good - the ATS code is fairly new, so will be doing some extensive testing as well and will post my findings back to the list.

> 4. How to avoid data corruption when iSCSI Reservation is not shared across
> both primary/secondary node? is single node active enough? what about
> failover?
>

>Don't round-robin the paths; you only fail over to the other node if
>you have to (one active path set to a target host).

I need to be able to make proper use of multiple 1G links in my setup - so one path to each storage wont work as well as expected (need >2GB+/s tx/rx for each datastore.)

I was considering 4 1G paths to each storage array (i.e. 8 paths in total) with 4 being active to the same array and the other 4 to the opposite array being standby (i.e. active/standby implict ALUA) and round-robin mode. (4GB/s tx/rx)

My testing seems to suggest that this might work when paths change state, and I dont see any I/O going to the standby storage controller.

Any Thoughts? The alternative would be to look at multiple datastores, each with a different path active - but that wont give me +1GB/s in a single stream from a single datastore.

Thanks

Marc Smith

unread,

Jun 7, 2014, 10:28:09 AM6/7/14

to esos-...@googlegroups.com

Yes, that sounds like it would work, assuming that the VMware round robin path policy only selects paths that are "optimized". Hopefully it would not select paths that are in the "non-optimized" ALUA state (the ones on your other node). Probably something you'll want to test to confirm, and if it doesn't work as expected, you could always experiment with the other target states.

--Marc

webm...@openhome.org

unread,

Jun 7, 2014, 4:21:04 PM6/7/14

to esos-...@googlegroups.com

>Yes, that sounds like it would work, assuming that the VMware round robin path policy only selects paths that are "optimized". Hopefully it would not select paths that are in the >"non->optimized" ALUA state (the ones on your other node). Probably something you'll want to test to confirm, and if it doesn't work as expected, you could always experiment with the other >target states.

Agreed. In this case I have modified the RA to use 'standby' and 'active' rather than active/non-optimized. I/O blocks to the 'standby' paths and VMware reports as paths to the secondary controller as being 'dead'. The exact same paths are dead on all initiators in the cluster. These paths become active during failover with the previous active paths going back to standby state.

worst case scenario if VMware tried to send I/O to a 'standby' path the I/O will get blocked.

Marc Smith

unread,

Jun 7, 2014, 8:15:49 PM6/7/14

to esos-...@googlegroups.com

One thing that I noticed in my testing from a year or so ago: I
believe I was using the "offline" state (it could have been one of the
others, but 99% sure it was "offline") and with this ALUA state, the
ESXi initiators would not use the other paths under any circumstances
until the targets update the ALUA state (implicit ALUA) and change it
to something that can be used. This can be a problem, since if your
fabric (your switches, or whatever between) fails and not one of the
targets, the ALUA state would never change for the offline target, and
ESXi would never send any IO down it. I suppose you could overcome
this by doing some sort of test from the targets, and then make the
cluster change state if it detects a fabric/switch failure, but that
seems more complex. Leaving it in the "non-optimized" state makes it
so ESXi will use that path in the event of a fabric failure.

>
> worst case scenario if VMware tried to send I/O to a 'standby' path the I/O
> will get blocked.

With SCST that's not really true for ALUA -- SCST will still process
I/O if the initiators send it to a target that isn't "active". I've
been using this SCST ALUA setup in ESOS for over a year in production
without any issues. I assume you've already seen this article, but if
not, here is what I wrote up for that setup:
http://marcitland.blogspot.com/2013/04/building-using-highly-available-esos.html

--Marc

webm...@openhome.org

unread,

Jun 7, 2014, 9:34:04 PM6/7/14

to esos-...@googlegroups.com

Thanks for the pointers - I have seen this too in testing.

Multiple redundant paths to each target would also prevent this, as a split in one of the network links would not take all the paths down
to the single target. e.g.

SAN 0 (eth1) -> SAN 1 Switch -> Initiator (eth1) - Implict ALUA (Active)

SAN 0( eth2 ) -> SAN 2 Switch -> Initiator (eth2) - Implict ALUA (Active)

SAN 1 ( eth1 ) -> SAN 1 Switch -> Inititator ( eth 1) - Implict ALUA (Standby (or Offline))

SAN 1 ( eth2 ) -> SAN 2 Switch -> Initiator ( eth 2) - Implict ALUA (Standby (or Offline))

The initiator can see 4 paths - only two active to SAN 0 (the other two marked as 'dead'). As both paths
to the active target take redundant routes through the SAN fabric any switch outages etc would only impact one of the paths - without having to make the storage
servers intelligent regarding path state.

during a failure condition of a switch one path would still be active on the current target at reduced performance without switching to the secondary storage node.

>
> worst case scenario if VMware tried to send I/O to a 'standby' path the I/O
> will get blocked.

>With SCST that's not really true for ALUA -- SCST will still process
>I/O if the initiators send it to a target that isn't "active".

I don't know if that is the case if the path is marked as 'offline' or 'standby' by AULA though.

In my testing I took both primary paths down with iptables filtering and the LUN blocked traffic, despite the secondary
storage array still having two paths being connected but ALUA blocking I/O (offline/standby state and ESXi marked paths as 'dead').

I've
been using this SCST ALUA setup in ESOS for over a year in production
without any issues. I assume you've already seen this article, but if
not, here is what I wrote up for that setup:
http://marcitland.blogspot.com/2013/04/building-using-highly-available-esos.html

I have seen the article - its a great write up, just not sure how to apply it to my situation
as I need more than a single path active to my primary target for extra bandwidth (no budget for FC or 10G Ethernet).

Perhaps once I have completed some more thorough testing I could contribute
a write-up using iSCSI with standard 1G Ethernet for anyone else in my situation.

Thanks

Marc Smith

unread,

Jun 7, 2014, 10:47:31 PM6/7/14

to esos-...@googlegroups.com

Yes, that is good. I'll be setting up our next generation of clustered ESOS disk arrays in this manner.

I should probably modify the ESOS SCST RA so it allows the user to specify the ALUA states as parameters instead of hard-coding them.

>
> worst case scenario if VMware tried to send I/O to a 'standby' path the I/O
> will get blocked.

>With SCST that's not really true for ALUA -- SCST will still process
>I/O if the initiators send it to a target that isn't "active".

I don't know if that is the case if the path is marked as 'offline' or 'standby' by AULA though.

I'd have to double check the SCST README, but I'm pretty sure it says it will still honor/process any I/O sent to it, either way... hopefully all of the initiators behave. =)

In my testing I took both primary paths down with iptables filtering and the LUN blocked traffic, despite the secondary
storage array still having two paths being connected but ALUA blocking I/O (offline/standby state and ESXi marked paths as 'dead').

I've
been using this SCST ALUA setup in ESOS for over a year in production
without any issues. I assume you've already seen this article, but if
not, here is what I wrote up for that setup:
http://marcitland.blogspot.com/2013/04/building-using-highly-available-esos.html

I have seen the article - its a great write up, just not sure how to apply it to my situation
as I need more than a single path active to my primary target for extra bandwidth (no budget for FC or 10G Ethernet).

Perhaps once I have completed some more thorough testing I could contribute
a write-up using iSCSI with standard 1G Ethernet for anyone else in my situation.

That would be great and much appreciated!

--Marc

webm...@openhome.org

unread,

Jun 12, 2014, 7:14:26 PM6/12/14

to esos-...@googlegroups.com

Hi Marc,

I am starting to progress on with this - will need to modify the RA to be able to go into standby state (or accept parameters to do so)

Are you able to accept patches to add this functionality?

Have you had a chance to try out VAAI/ATS yet?

Thanks

Marc Smith

unread,

Jun 13, 2014, 9:13:27 AM6/13/14

to esos-...@googlegroups.com

On Thu, Jun 12, 2014 at 7:14 PM, <webm...@openhome.org> wrote:
> Hi Marc,
>
> I am starting to progress on with this - will need to modify the RA to be
> able to go into standby state (or accept parameters to do so)
>
> Are you able to accept patches to add this functionality?

Yes, a patch would be great. I'm sure you already found it, but this is the RA:
http://code.google.com/p/enterprise-storage-os/source/browse/trunk/misc/ocf/scst

I need to revisit that as well -- I noticed during my initial setup it
didn't set the ALUA states correctly for one of either local/remote
target groups. I'll look at my notes later and see where I noticed the
problem.

>
> Have you had a chance to try out VAAI/ATS yet?

Yes, continuing to test, but so far no problems with VAAI/ATS and it
now shows "Supported" under Hardware Acceleration in vSphere.

--Marc

Marc Smith

unread,

Jul 14, 2014, 5:11:14 PM7/14/14

to esos-...@googlegroups.com

I updated the SCST RA today to accept ALUA target port states specified as RA parameters ("m_alua_state" and "s_alua_state") and this change has been committed (r663). The new build should post shortly.

I still have a bit more work to do on this RA to keep the local/remote group ALUA states consistent; I'll get to it later this week.

--Marc

Reply all

Reply to author

Forward