We found that the orchestrator is stable in most cases, but in some cases, the number of coroutines in the application will continue to grow. If this situation continues for a period of time, it will cause the orchestrator to fail to recover from the fault. In fact, the orchestrator application is hung.
We expose the indicators of golang, mainly indicators of the number of go coroutines. In addition, the response of API /debug/metrics is in json format. We have also added an API to return data in text format (can be collected by prometheus).
This is very important. With monitoring, we can analyze many performance problems encountered later.
2. Optimize fault recovery algorithm
Currently, when a MySQL cluster master fails, if recovery enters the active period, during this active period, the orchestrator will make a failure recovery attempt every second. If multiple mysql clusters fail, or global recovery is disabled, this behavior will cause the orchestrator's application coroutines to increase.
We queued the recovery attempt, just like discovery behavior in orchestrator.
3. Performance problems with ORC raft
We are using the raft mode to achieve high availability of orchestrator, but the lib of orchestrator raft is out of date, and we may encounter some weird problems from time to time. Include:
- The number of coroutines of the leader and one follower is stable, but the number of the other coroutine has been rising.
- If there is a large number of fault recovery, after restarting the orchestrator, the orchestrator raft will try to write a lot of logs.
- When a large number of raft logs need to be written, the orchestrator API will time out, such as /forget, /discover etc..
The high availability of the orchestrator itself is very important, and data consistency is a big challenge. We still don’t have a solution here, but there are two ways:
- Use two standalone orchestrators for high availability, one with global-failover-recovery enabled and the other with disable.
- Upgrade the version of raft lib, which requires a lot of testing.
4. The active period mechanism may make master recovery impossible.
Assume that a mysql failover will continue to fail for some reason, which will cause the recovery to remain in the active period. If we need to perform force-takeover/force-failover, we need to ack to exit the active period so that force-*over can work properly. However, if the master still fails at this time, the orchestrator's automatic repair mechanism every second will cause the cluster to enter the active period again.
Our solution is to add cluster-level failure recovery features. This is because we think that this feature will be useful in the future, not only when optimizing performance or adding more features.
5. Rolling log
We use zap to keep the log rolling.
6. Some bugs
Currently we have found some bugs, we have found the root causes and fixed them internally. I also very much hope that everyone will feedback more questions. If you report a bug, please provide the relative log.
Bugs encountered include: