Anyone here using the failover/failback functionality in SyncIQ?

2,319 views
Skip to first unread message

Jason Davis

unread,
Jan 24, 2013, 3:47:13 PM1/24/13
to isilon-u...@googlegroups.com
We have 2 Isilon clusters at present time and the intent was to utilize SyncIQ to replicate certain data sets for semi-HA.

I know that SyncIQ really is designed for DR, almost like an automated backup. However in OneFS 7.x they have made this process somewhat easier.

So is anyone doing this now in their environment? If so, how have you gone about and streamlined the process?

Robert Soeting

unread,
Jan 25, 2013, 1:33:44 PM1/25/13
to isilon-u...@googlegroups.com
Yup, 2 clusters running OneFS 7.x, 7 daily snaps on primary, syncIQ to other cluster for DR but also extended backup. On the other cluster 7 daily, 4 weekly and 12 monthly snaps.
We started out with 10 minute sync, but currently we are back to hourly as the snap deletes were up to 2500 on busy days.
Failover and failback improved, but do not forget to suspend your schedule. Completly messed up the sync that way.

SCR512

unread,
Jan 28, 2013, 9:03:40 AM1/28/13
to isilon-u...@googlegroups.com
Are you guys using something along the lines of a DNS C record to point to your smart connect zone at the primary site and then have it flip over to the DR site in the event of an emergency?

What I am envisioning is having a DNS C record in front of the Primary site's SmartConnect zone. In the event of a disaster, I could make the DR site read-write and point my C record to its SmartConnect zone. 

bob.maynard

unread,
Jan 31, 2013, 4:04:52 AM1/31/13
to isilon-u...@googlegroups.com
We are looking to set up a DR cluster along the same lines as you and are also having problems finding out information about how to do this. Has anyone come across any whitepapers or suchlike?

Robert Soeting

unread,
Jan 31, 2013, 1:13:27 PM1/31/13
to isilon-u...@googlegroups.com
I based our solution on the following. 
White Paper- Best Practices for Data Replication with EMC Isilon SyncIQ.pdf

SCR512

unread,
Jan 31, 2013, 1:56:57 PM1/31/13
to isilon-u...@googlegroups.com
Yup, this pretty much details what I've come to understand. So in order to redirect clients to a r/w DR cluster then it would be easier just to swing a C Name over to the DR's SmartConnect zone.

                 files.local (C Name)
                             |
                             |
              ---------------
              |                               
              |                               
 primary.local (SC Zone)    


So in the event of a failure of primary I would just do the following (after prepping the SyncIQ target to allow r/w)

                files.local (C Name)
                             |
                             |
              ---------------
              |                               
              |                               
 dr.local (SC Zone)  


Sound sane? :)

C Ball

unread,
Mar 28, 2013, 2:48:41 PM3/28/13
to isilon-u...@googlegroups.com


I'm in similar situation, though not yet with 7.0.  We originally deployed two clusters for an archive service for which replication and direct access to distinct "readonly" fileserver name on DR cluster meets requirements.

New services will require quick turnaround on moving file access to the DR location.  The DNS CNAME configuration doesn't work for us.
Possible options being considered:

1. Use load balancer to direct DNS traffic to the (smartconnect) DNS server on active cluster
  • Storage admins are delegated ability to mark LB targets active or inactive
  • Automatic LB failover config may work, but may also be prone to transient issues
  • Works for CIFS as well as NFS
2. Mount both primary and failover on all clients and use nfs replica or sync tool for managing root of tree
    Here's the replica version, which hopefully explains the idea:
Primary (setup directory
mkdir -p /ifs/wnas/svc1/content  #directory holding content
#
mkdir /ifs/wnas/svc1/root #service root directory
cd /ifs/wnas/svc1/root
# make link for primary service
ln -s /svc1-primary/content content
[ sync /ifs/wnas/svc1 to alternate cluster ]

Primary and alternate:
isi nas export --path=/ifs/wnas/svc1/content
isi nas export --path=/ifs/wnas/svc1/root

Client:
mount svc1-primary:/ifs/wnas/svc1 /svc1-primary
mount svc1-alternate:/ifs/wnas/svc1 /svc1-alternate
# replica readonly mount
mount -o ro svc1-primary:/ifs/wnas/svc1/root,svc1-alternate:/ifs/wnas/svc1/root /svc1

# Move service to alternate
[ if primary out of service, make alternate RW locally on isilon ]
cd /ifs/wnas/svc1/root; rm content
# make link for alternate service
ln -s /svc1-alternate/content content

The second method seems desirable from perspective that it's manipulating filesystem content to manage service, but if primary goes down and alternate svc1/root is changed, there would be a gap when primary is returned to service during which it's copy of svc1/root would be inconsistent.  This may be OK since stable clients would not automatically flip back to primary..

Questions:
- Would [ separate] nfsv4 server for top level using referral instead of symlink be a way to achieve this so clients could use a single mount?
  Does isilon support referral?  Also, how often do clients check and detect change of referral definitions?
- Is either of these alternatives clearly best?
- Is there a way to setup nfs exports or smartconnect zone so it must be manually activated after event like a power failure?

Jason Davis

unread,
Mar 28, 2013, 3:46:03 PM3/28/13
to isilon-u...@googlegroups.com
I done testing with the new SyncIQ failover/failback in conjunction with the DNS C Name filp and it seems to work for us. I would be somewhat concerned about sticking a load balancer in front as this is yet another point of complexity/failure (we considered this).

With the new SyncIQ functionality in 7.x the secondary cluster files are in a RO state. When you failover and back, this flip-flops between the two clusters you are syncing with SyncIQ. Also, when failing back from secondary to primary changes made to secondary are synced back. All-in-all, it's fairly ATOMIC.


I suppose you could have a bit of command line work go and remove/re-add network adapters from the SmartConenct configuration. The same could be done with removing/adding NFS exports via script. 

TBH this sounds like overkill really although if you aren't at 7.0 then I can see why you would do that.

*Knock on wood* 7.0.1.2 (7.0.1.4 is the latest and greatest at present time) has been good to us so far. We're about to add in some X400s into our primary cluster so this will be fun to see how all of this will behave.





--
You received this message because you are subscribed to the Google Groups "Isilon Technical User Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isilon-user-gr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.
 
 

Keith Nargi

unread,
Mar 28, 2013, 7:28:19 PM3/28/13
to isilon-u...@googlegroups.com, isilon-u...@googlegroups.com
Failover fail back works really good for data now with 7.0.  If DNS is a concern and C-Names don't work for you a  nice feature you might want to look at is adding a smartconnect alias to your failover cluster that matches the name of your primary cluster.  If you do this you wont need leverage a cname all you need to do and you is change the sip address of the primary cluster to be that of the failover cluster 
Example
Primarycluster.company.com is your referral and the sip record is sip.company.com with an address of 10.10.10.200
Your Failover cluster has a referal of failto.company.com with sip record called fsip.company.com with an address 20.20.20.200

On the failover you issue this command

Isi networks modify pool --name=subnet0:pool0 --add-zone-alias=primarycluster.company.com

That will add the primary clusters name as an Alias on your failover cluster.  
Now if you want to Failover swap the sip for the primary cluster to the sip of the failover an you are good to go. 



Sent from my iPhone

Jason Davis

unread,
Mar 28, 2013, 9:33:56 PM3/28/13
to isilon-u...@googlegroups.com

Ah that's nice.

Using a SmartConnect SIP alias seems like a much more elegant solution than DNS C record trickery. I'll have to try this out.

Thanks for the explanation!

C Ball

unread,
Mar 29, 2013, 9:09:40 AM3/29/13
to isilon-u...@googlegroups.com
The data management sounds great in 7.0, though the 6.5 version works for us so far.  We've just about decided that  failover to RO is preferable.  This allows service owners to make decision about accepting possible data loss (RPO) or data recovery effort from orphaned content on primary instead of routinely enabling RW on failover.

Thank you for the explanation of setting up smartconnect aliases.

Our service level objective for failover is 15 minutes, but our OLA with networking group for manual DNS change is 2 hours.  This is the reason I was thinking of using load balancer which has delegated admin access functionality for purpose of managing which cluster gets the DNS queries.  We can allocate a new subnet so that the load balancer is introduced only for services which need fast failover.  A possible limitation of this approach is that every service using this subnet SIP server would move at once; we couldn't use this to split load (yes this is a different scenario).

Hmm ... while main campus DNS requires manual changes by core group, Active Directory delegates ability to register hosts to OU admins.  I'll check out hosting the SIP server A record in AD subdomain.

As I think more about this, we may have preference for the three way mount using NFS replica or NFSv4 referral as admins for service (who get notification) can change the symlink/referral.  We have set up configuration management which enables provisioning of limited delegation capability (e.g. setting quota on directories withing a storage allocation), so content owners should be up for this.  With 7.0 enhancements for data management, we'll look at moving sync policies to the service level and enabling service admins to separately manage fileservice failover and activating RW on DR cluster.


Jason Davis

unread,
Apr 1, 2013, 1:46:38 PM4/1/13
to isilon-u...@googlegroups.com
Is there any explanation experience with how SyncIQ behaves if the primary cluster is completely off line (no power or a general network outage)? 

If I set my DR cluster to allow_write then I can reasonably assume that once the primary cluster comes up that it will see that it's been set to RO?


Keith Nargi

unread,
Apr 1, 2013, 5:31:23 PM4/1/13
to isilon-u...@googlegroups.com
Without getting to deep it works something like this..
The local target (DR cluster) gets a domain ID and a domain ID value lets say its 0. You can see that by doing an isi domain list --long.  That domain ID basically says every file in this part of the tree is part of a SyncIQ job and is locked and marked as RO.  When your primary cluster is up and the DR cluster is read only that Domain ID's value means RO.  As soon as you go in and allow writes on the DR cluster that 0 value no means WR versus read only. When the primary cluster comes back online and see that the domain ID's value is now RW the syncIQ job will report that it can run because the target cluster is writable.  When you go and failback the Primary cluster initials the failback process by creating a new syncIQ job (mirror policy) on the DR cluster.  This job will do a comparison of the DR cluster files and the primary cluster files.  When you run that mirror policy only the updated files or new files will be synced back over. 
At this point your primary cluster becomes the local target and will also have a domain ID with a value of RO. (Quick aside, to speed up failback for the first time you can give your primary a domain ID ahead of time so all of the files have a value ahead of time that means RW) Once that mirror policy runs you can now go back to your primary cluster allow writes and you are back to your original setup.    Obviously you have to do some DNS things to get clients back over to the primary cluster but this kind of how it works.

Hope this helps
Keith
--
Keith 

Keith Nargi

unread,
Apr 1, 2013, 5:34:17 PM4/1/13
to isilon-u...@googlegroups.com, Keith Nargi
This is suppose to say
As soon as you go in and allow writes on the DR cluster that 0 value no means WR versus read only. When the primary cluster comes back online and see that the domain ID's value is now RW the syncIQ job will report that it cant run because the target cluster is writable.
--
Keith 

Daniel Cornel

unread,
Sep 17, 2013, 5:57:42 PM9/17/13
to isilon-u...@googlegroups.com
Thank you for this explanation. I was puzzling over options this morning and this by far became the fastest.  I assume that you remove the alias when it is not needed, and manually add it when you intend to fail to that particular cluster. Does anyone have any ideas on how to automate this function and send notification? 
Reply all
Reply to author
Forward
0 new messages