What are the best practices related to coordinating a switchover or failover with entities outside of the databases? For example, let's take a basic switchover:
Pre-outage actions
Verifying the topology of the cluster being changed
Choosing a replica to become the new master (may depend on DC location, state of the instance, how much redo lag exists, etc.)
Creating blackouts in monitoring and notification systems
Sending notifications to interested parties
Outage actions
Notifying upstream elements to stop sending traffic to the master and waiting for their ACKs
Ensuring that the master being switched away from cannot accept any new read/write traffic
Allowing the in-flight transactions to complete and/or kill them
Ensure that all logs needed by the replica that will become the new master have been received (and possibly applied)
Starting the shutdown of the master being switched away from
Enabling writes on the new master
Notifying upstream elements of the endpoint of the new master and waiting for their ACKs
Notifying upstream elements to start sending traffic to the master and waiting for their ACKs
Post-outage actions
Ensure that the old master shuts down cleanly. Restart it and bring it down cleanly if it does not.
Reinstate the old master as a replica as appropriate
Should this be driven by Orchestrator through hooks to external scripts/binaries or is the expectation that the overall orchestration will be done outside of Orchestrator?
Regards,
John Smiley
Head of Databases
Proton Technologies AG