Planning DR for AWS Environment

Ahmed Osman

unread,

Apr 5, 2013, 1:44:39 PM4/5/13

to alfresco-techn...@googlegroups.com

Hello Everyone,

I’m working on deploying a 4.1.3 Enterprise environment to AWS. The Production environment will be two clustered Alfresco instances with a separate Solr server behind HAProxy within a VPC. For DR/HA there will be a mirror environment with the exception of only a single Alfresco instance in the same VPC on a different subnet and AZ. The content store is hosted in S3.

I’m looking at two possible configurations:

1) DR environment uses that same S3 Content store but a separate database. Production database is replicated to DR database.

2) DR environment uses a separate S3 Content store and database. Production content store is replicated to DR content store (DR Database is not replicated).

If anyone can provide some insight on which configurations are feasible I would greatly appreciate it.

Ahmed Osman

DevOps Engineer

Infrastructure Support Services

TIBCO Spotfire

Richard Esplin

unread,

Apr 5, 2013, 4:21:00 PM4/5/13

to alfresco-techn...@googlegroups.com, Ahmed Osman

Disclaimer: I know we have customers with experience in this scenario (including our public cloud offering), but I haven't personally worked on an Amazon project. I am responding because I told you I would provide what advice I can.

As usual, this is a trade-off between cost and reliability. If you can afford to have a separate S3 content store, then I highly recommend you do it. It eliminates a large single point of failure which will save you if you are ever caught in one of Amazon's S3 outages. If the DR system lags the production system a little, it could also prevent some types of corruption that could occur at the moment the live system goes down.

On the other hand, I believe that Alfresco is more tolerant of problems in the content store than in the database. If the DR database gets corrupted, you won't have a functioning system. If the DR content store gets corrupted, then just the damaged content should be affected. For this reason, having a shared storage might be a reasonable trade-off.

It is also important to remember that disaster recovery does not replace the need for backups. Without a backup, a failed switchover to your DR system would be disastrous. Replicated large amounts of corrupt content or unintended deletions would also result in total system loss. Having multiple points in time from which you can restore provides an important safety net for your DR architecture.

Hopefully one of our resident Amazon experts will also contribute.

Richard

Michael McCarthy

unread,

Apr 5, 2013, 5:22:05 PM4/5/13

to alfresco-techn...@googlegroups.com

Hi Ahmed,

As Richard said DR/HA is a balance between time to recover and cost. A full discussion is probably beyond the scale of an email but here are some generalities.

If you are going to have a cluster with a mirrored hot backup, I would recommend that you keep the two systems separate to eliminate a single point of failure. S3 is 11 nines of availability. S3 buckets exist in different Regions. It might be smart to have your DR in a separate Region. Each S3 bucket is already in multiple availability zones so a single AZ outage should not effect S3. Alfresco should be pretty fault tolerant with respect to the content store. Everything is handled in a transaction so you may end up with additional files but no corresponding record in the DB. Obviously that is no guarantee that your system will be consistent if you need to switch over to your spare. You could use any method you want to replicate your DB and content store. (There are several projects that allow to you access S3 like a drive.)

Below is a general discussion of HA/DR.

There are 3 areas that you need to cover, file system, database, and server.

File System:
As Richard suggested S3 is an excellent choice here. You can do a lot of things for backup, just back up your bucket on a regular basis. You can also do a replicating content store if you like, using EBS or a second S3 bucket. Either way, you should be using a caching content store to the local instance storage on the EC2 instance. Instance storage is much faster than S3 and will help with read performance. Write performance will be as slow as the slowest content store, which in this case is S3.

Database:
On AWS, the easiest thing you can do is use MySQL with multi-AZ and possibly a read replica for performance. An RDS server with multi-AZ, will provide you with HA in that the second server is replicated and can act as a hot backup in the event that your primary DB server goes down. For performance a read replica will provide you with additional read throughput, again, writes go through a single point.

Server(s):
You can do either a hot backup or a cluster here. A hot backup can be achieved with 2 Alfresco servers and an elastic IP address. If the primary server goes down, you can reassign the elastic IP to the hot backup. This can be done through scripting or through the AWS Admin Console. These servers should be in different availability zones. More than likely, you will want a cluster. A cluster would be two Alfresco serves working together at the same time. Again, these should be in two availability zones. If you use a cluster you will want a load balancer in front. You mentioned HAProxy, but I would recommend an ELB instead. ELBs are by nature elastic so can scale to handle additional load.

Here are a couple of resources for you. The first is a presentation I gave at DevCon in 2012. The second is probably too simple for you but I thought I would include it for completeness.
http://devcon.alfresco.com/sanjose/sessions/build-scalable-elastic-alfresco-cluster-aws-5-steps
http://www.alfresco.com/events/webinars/aws-and-alfresco-everything-you-need-know-you-go-devcon

Ahmed Osman

unread,

Apr 5, 2013, 6:44:20 PM4/5/13

to alfresco-techn...@googlegroups.com

Hi Michael,

Thank you VERY much for the detailed response, this is incredibly valuable information (I’ll be watching the two videos you provided later).

The main goal with the environment I described is to provide HA. The DR environment is primarily there to reduce any potential downtime should something happen to the Prod one (AZ goes down). I should have elaborated a bit more on the architecture, I’ll have an HAProxy instance in front of Alfresco in DR/Prod with an ELB in front of that. In the event that Prod goes down the ELB will immediately switch over to DR.

I had initially looked into setting up DR in a separate region but had some concerns over how quickly database changes/transactions would be replicated, for that reason I’m putting it in the same VPC just on a separate AZ.

Using a read replica is an excellent suggestions, I hadn’t considered setting that up. Not only will this give us increased redundancy but an increase in performance. Regarding backups I’m planning to rely on RDS snapshots triggered via the API, for the Content store I’m looking into setting up incremental backups to a separate S3 bucket. As for the Solr and Alfresco servers, that’ll most likely be handled with EBS snapshots pushed to the same S3 bucket used for the incremental backups. As I mentioned earlier I’m aiming for zero downtime, handling deleted content is almost secondary in this case (Potentially make use of a test environment for recovering deleted content).

My main concern with option 1 I detailed was making sure there wouldn’t be any potential conflicts with having the DR server pointed to the same database as it will not be part of the production cluster.

Michael McCarthy

unread,

Apr 7, 2013, 6:40:51 PM4/7/13

to alfresco-techn...@googlegroups.com

In Alfresco it is generally not an issue to have "extra" content on the filesystem. If you have additional content in S3 that does not exist in your DB, then Alfresco will just not know of it's existence. You run into the problems in the opposite scenario, when the DB has transactions with no matching content on your filesystem. So in option 1 that you outlined, I would expect the worst thing that would happen is you have additional content in your content store in S3 but no corresponding transaction in your DB. I haven't thoroughly thought about the situation so there may be a circumstance where this wouldn't work, but I can't think of one.

Feel free to reach out directly if you want to discuss more.

Thanks,
Michael

Reply all

Reply to author

Forward