Hi,
We are running Orchestrator in a 3 nodes configuration with RAFT. We are monitoring about 50 instances with the same Orchestrator cluster running in Kubernetes.
We had a case where nodes for certain clusters were not accessible at all (from Orchestrator and also could not communicate between them).
This happened for about 5 hours and Orchestrator detected a DeadMaster event and started a recovery. The recovery would not complete as the nodes were not available and every one hour orchestrator started a new recovery. There is a config parameter that can specify the interval another recover can start if the old one is not complete and this is to prevent flip flops.This parameter is set to one hour. After about 5 hours with 5 recovery ongoing (actually 2 of them failed in the meantime and the other 3 continued) access to the nodes has been restored and all 3 recoveries completed around the same time. This resulted in the cluster of 4 nodes being split into 2 independent clusters, 2 nodes each.
I am posting the timeline of one of those recoveries below. I have a few questions regarding this.
1. If the old recovery is still ongoing, why would another recovery kick in? Shouldn't the old one FAIL before starting a new one?
2. If multiple recoveries are ongoing for the same cluster, is there a way to serialize them? In this case the recoveries were reseting 2 nodes from the same cluster at almost the same time (few milliseconds difference).
3. Looking at the recovery timeline posted below. There is a big delay between the first few steps. The steps until "RecoverDeadMaster: regrouping replicas via GTID" should go very quick.
We suspected some resource problem that would make orchestrator stall but we have metrics and the nodes were very underutilized. CPU was only at 10% load was minimal, etc..
In what circumstances can Orchestrator take such a long time to execute the first steps in the recovery process?
2020-06-20T17:54:16Z will handle DeadMaster event on NODE:PORT
2020-06-20T18:03:27Z Running 1 PreFailoverProcesses hooks
2020-06-20T18:15:14Z Running PreFailoverProcesses hook 1 of 1: echo 'Will recover from DeadMaster on NODE:PORT' >> /tmp/recovery.log
2020-06-20T18:26:22Z Completed PreFailoverProcesses hook 1 of 1 in 2.453457ms
2020-06-20T19:18:17Z done running PreFailoverProcesses hooks
2020-06-20T19:36:28Z RecoverDeadMaster: will recover NODE:PORT
2020-06-20T19:51:15Z RecoverDeadMaster: masterRecoveryType=MasterRecoveryGTID
2020-06-20T20:19:35Z RecoverDeadMaster: regrouping replicas via GTID
2020-06-20T21:51:30Z RecoverDeadMaster: promotedReplicaIsIdeal(PROMOTED:PORT)
2020-06-20T21:57:20Z RecoverDeadMaster: 0 postponed functions
2020-06-20T21:57:29Z checking if should replace promoted replica with a better candidate
2020-06-20T21:57:45Z + checking if promoted replica is the ideal candidate
2020-06-20T21:58:02Z + searching for an ideal candidate
2020-06-20T21:58:10Z + checking if promoted replica is an OK candidate
2020-06-20T21:58:18Z + searching for a candidate
2020-06-20T21:58:27Z + searching for a candidate
2020-06-20T21:58:43Z + found no server to promote on top promoted replica
2020-06-20T21:59:00Z replace-promoted-replica-with-candidate: promoted instance PROMOTED:PORT requires no further action
2020-06-20T21:59:17Z promoted replica: PROMOTED:PORT
2020-06-20T21:59:58Z RecoverDeadMaster: successfully promoted PROMOTED:PORT
2020-06-20T22:00:23Z - RecoverDeadMaster: promoted server coordinates: 4310_bin.000538:237211272
2020-06-20T22:00:48Z - RecoverDeadMaster: will apply MySQL changes to promoted master
2020-06-20T22:01:20Z - RecoverDeadMaster: applying RESET SLAVE ALL on promoted master: success=true
2020-06-20T22:01:53Z - RecoverDeadMaster: applying read-only=0 on promoted master: success=true
2020-06-20T22:02:34Z Writing KV []
2020-06-20T22:02:34Z - RecoverDeadMaster: applying read-only=1 on demoted master: success=false
2020-06-20T22:22:17Z - RecoverDeadMaster: updating cluster_alias: NODE:PORT -> PROMOTED:PORT
2020-06-20T22:26:38Z Running 1 PostMasterFailoverProcesses hooks
2020-06-20T22:31:08Z Running PostMasterFailoverProcesses hook 1 of 1: echo 'Recovered from DeadMaster on myqb28-rr-001:4310. Failed: myqb28-rr-001:4310; Promoted: PROMOTED:PORT' >> /tmp/recovery.log
2020-06-20T22:35:45Z Completed PostMasterFailoverProcesses hook 1 of 1 in 4.701475ms
2020-06-20T22:41:36Z Running PostFailoverProcesses hook 2 of 2: curl -s <COMMAND>
2020-06-20T22:47:18Z Completed PostFailoverProcesses hook 2 of 2 in 8.676853128s
2020-06-20T22:53:25Z done running PostFailoverProcesses hooks
2020-06-20T23:00:20Z Waiting for 0 postponed functions