Multiple recoveries running in parallel?

45 views

Skip to first unread message

Liviu Lintes

unread,

Jun 24, 2020, 9:45:17 AM6/24/20

to orchestrator-mysql

Hi,

We are running Orchestrator in a 3 nodes configuration with RAFT. We are monitoring about 50 instances with the same Orchestrator cluster running in Kubernetes.

We had a case where nodes for certain clusters were not accessible at all (from Orchestrator and also could not communicate between them).

This happened for about 5 hours and Orchestrator detected a DeadMaster event and started a recovery. The recovery would not complete as the nodes were not available and every one hour orchestrator started a new recovery. There is a config parameter that can specify the interval another recover can start if the old one is not complete and this is to prevent flip flops.This parameter is set to one hour. After about 5 hours with 5 recovery ongoing (actually 2 of them failed in the meantime and the other 3 continued) access to the nodes has been restored and all 3 recoveries completed around the same time. This resulted in the cluster of 4 nodes being split into 2 independent clusters, 2 nodes each.

I am posting the timeline of one of those recoveries below. I have a few questions regarding this.

1. If the old recovery is still ongoing, why would another recovery kick in? Shouldn't the old one FAIL before starting a new one?

2. If multiple recoveries are ongoing for the same cluster, is there a way to serialize them? In this case the recoveries were reseting 2 nodes from the same cluster at almost the same time (few milliseconds difference).

3. Looking at the recovery timeline posted below. There is a big delay between the first few steps. The steps until "RecoverDeadMaster: regrouping replicas via GTID" should go very quick.

We suspected some resource problem that would make orchestrator stall but we have metrics and the nodes were very underutilized. CPU was only at 10% load was minimal, etc..

In what circumstances can Orchestrator take such a long time to execute the first steps in the recovery process?

2020-06-20T17:54:16Z will handle DeadMaster event on NODE:PORT

2020-06-20T18:03:27Z Running 1 PreFailoverProcesses hooks

2020-06-20T18:15:14Z Running PreFailoverProcesses hook 1 of 1: echo 'Will recover from DeadMaster on NODE:PORT' >> /tmp/recovery.log

2020-06-20T18:26:22Z Completed PreFailoverProcesses hook 1 of 1 in 2.453457ms

2020-06-20T19:18:17Z done running PreFailoverProcesses hooks

2020-06-20T19:36:28Z RecoverDeadMaster: will recover NODE:PORT

2020-06-20T19:51:15Z RecoverDeadMaster: masterRecoveryType=MasterRecoveryGTID

2020-06-20T20:19:35Z RecoverDeadMaster: regrouping replicas via GTID

2020-06-20T21:51:30Z RecoverDeadMaster: promotedReplicaIsIdeal(PROMOTED:PORT)

2020-06-20T21:57:20Z RecoverDeadMaster: 0 postponed functions

2020-06-20T21:57:29Z checking if should replace promoted replica with a better candidate

2020-06-20T21:57:45Z + checking if promoted replica is the ideal candidate

2020-06-20T21:58:02Z + searching for an ideal candidate

2020-06-20T21:58:10Z + checking if promoted replica is an OK candidate

2020-06-20T21:58:18Z + searching for a candidate

2020-06-20T21:58:27Z + searching for a candidate

2020-06-20T21:58:43Z + found no server to promote on top promoted replica

2020-06-20T21:59:00Z replace-promoted-replica-with-candidate: promoted instance PROMOTED:PORT requires no further action

2020-06-20T21:59:17Z promoted replica: PROMOTED:PORT

2020-06-20T21:59:58Z RecoverDeadMaster: successfully promoted PROMOTED:PORT

2020-06-20T22:00:23Z - RecoverDeadMaster: promoted server coordinates: 4310_bin.000538:237211272

2020-06-20T22:00:48Z - RecoverDeadMaster: will apply MySQL changes to promoted master

2020-06-20T22:01:20Z - RecoverDeadMaster: applying RESET SLAVE ALL on promoted master: success=true

2020-06-20T22:01:53Z - RecoverDeadMaster: applying read-only=0 on promoted master: success=true

2020-06-20T22:02:34Z Writing KV []

2020-06-20T22:02:34Z - RecoverDeadMaster: applying read-only=1 on demoted master: success=false

2020-06-20T22:22:17Z - RecoverDeadMaster: updating cluster_alias: NODE:PORT -> PROMOTED:PORT

2020-06-20T22:26:38Z Running 1 PostMasterFailoverProcesses hooks

2020-06-20T22:31:08Z Running PostMasterFailoverProcesses hook 1 of 1: echo 'Recovered from DeadMaster on myqb28-rr-001:4310. Failed: myqb28-rr-001:4310; Promoted: PROMOTED:PORT' >> /tmp/recovery.log

2020-06-20T22:35:45Z Completed PostMasterFailoverProcesses hook 1 of 1 in 4.701475ms

2020-06-20T22:41:36Z Running PostFailoverProcesses hook 2 of 2: curl -s <COMMAND>

2020-06-20T22:47:18Z Completed PostFailoverProcesses hook 2 of 2 in 8.676853128s

2020-06-20T22:53:25Z done running PostFailoverProcesses hooks

2020-06-20T23:00:20Z Waiting for 0 postponed functions

Reply all

Reply to author

Forward

0 new messages