As part of our DR strategy we require passive nodes in our DR site for some services. Rather than going with different strategies for a lot of the different services we'd like to go with asynchronous disk storage replication.
We have forked BOSH and the way I'm implementing this is modifying the bosh director and agent to do the following :
1. Add passive mode to job definition which will prevent startup if active.
2. Change the bosh agent to mount/unmount storage around startup/stop instead of the mount/unmount messages.
3. Include DRBD build into stemcell.
4. Change the bosh agent to configure DRBD and restart it, updating the persistant store to be the drbd mapping.
5. Switch over to using lvm ontop of the partition.
6. Implement a Dynamic DNS update to the bosh agent on startup.
7. Add drbd health monitor and add functionality to report status to the client.
8. Fix the disk migration which the above will have broken probably using lvm and live migration.
1-3 are done, I'm currently working on 4.
The new job specification will look somthing like :
- instances: 1
name: postgres_test2
networks:
- name: cf1
static_ips:
- 10.93.230.19
persistent_disk: 4096
passive: true
drbd:
enabled: true
force_master: false
replication_node1: 10.93.228.18
replication_node2: 10.92.230.19
replication_type: A
secret: mysecret
properties:
db: databases
release: cf
resource_pool: medium_z1
template: postgres
Obviously this means you won't be able to use more than one instance for these types on servers.
We are currently planning on having two BOSHs split over two datacentres. A cutover process would involve changing the main site to 'passive', doing a bosh deploy, then changing the dr site to 'not passive' and doing another deploy. In a disaster it would be a case of having to use the force_master in the DR site. We would not want to automatically cutover data services to the DR site without human eyes due to the potential data loss.
Any application considered high availablity will not use these services and instead use active active services. However we do run a whole bunch of applications where consistency is valued over availablity where having active/passive instances suits our needs better.
It would be interesting to hear peoples thoughts about this.
Ben.