Share some issues during the use of orchestrator

239 views

Skip to first unread message

Guokai Zhang

unread,

Jan 7, 2024, 10:35:56 PM1/7/24

to orchestrator-mysql

Hi, all:

I am using orchestrator to get high availability of mysql. The number of mysql clusters is large, so that I use multiple orchestrators to take over all mysql clusters.

In the process of using orchestrator, we found and solved several problems for the community to share and try to get more correct suggestions. If possible, I think we could update orchestrator again.

1. Add appropriate golang application monitoring

We found that the orchestrator is stable in most cases, but in some cases, the number of coroutines in the application will continue to grow. If this situation continues for a period of time, it will cause the orchestrator to fail to recover from the fault. In fact, the orchestrator application is hung.

We expose the indicators of golang, mainly indicators of the number of go coroutines. In addition, the response of API /debug/metrics is in json format. We have also added an API to return data in text format (can be collected by prometheus).

This is very important. With monitoring, we can analyze many performance problems encountered later.

2. Optimize fault recovery algorithm
Currently, when a MySQL cluster master fails, if recovery enters the active period, during this active period, the orchestrator will make a failure recovery attempt every second. If multiple mysql clusters fail, or global recovery is disabled, this behavior will cause the orchestrator's application coroutines to increase.

We queued the recovery attempt, just like discovery behavior in orchestrator.

3. Performance problems with ORC raft
We are using the raft mode to achieve high availability of orchestrator, but the lib of orchestrator raft is out of date, and we may encounter some weird problems from time to time. Include:

The number of coroutines of the leader and one follower is stable, but the number of the other coroutine has been rising.
If there is a large number of fault recovery, after restarting the orchestrator, the orchestrator raft will try to write a lot of logs.
When a large number of raft logs need to be written, the orchestrator API will time out, such as /forget, /discover etc..

The high availability of the orchestrator itself is very important, and data consistency is a big challenge. We still don’t have a solution here, but there are two ways:

Use two standalone orchestrators for high availability, one with global-failover-recovery enabled and the other with disable.
Upgrade the version of raft lib, which requires a lot of testing.

4. The active period mechanism may make master recovery impossible.
Assume that a mysql failover will continue to fail for some reason, which will cause the recovery to remain in the active period. If we need to perform force-takeover/force-failover, we need to ack to exit the active period so that force-*over can work properly. However, if the master still fails at this time, the orchestrator's automatic repair mechanism every second will cause the cluster to enter the active period again.

Our solution is to add cluster-level failure recovery features. This is because we think that this feature will be useful in the future, not only when optimizing performance or adding more features.

5. Rolling log
We use zap to keep the log rolling.

6. Some bugs
Currently we have found some bugs, we have found the root causes and fixed them internally. I also very much hope that everyone will feedback more questions. If you report a bug, please provide the relative log.

Bugs encountered include:

When DelayMasterPromotionIfSQLThreadNotUpToDate is true, failover recovery will fail if the promoted node and the candidate node are not the same. This will make orchestrator stuck in an operation that has been waiting for synchronization.
Fix this bug: https://github.com/openark/orchestrator/issues/1039

7. Some other performance optimization analysis

We found that compared to using mysql for storage, if sqlite is used for storage, orchestrator performance will be worse, and the API time will be unstable when takingover/failover is performed in batches. This is because the SQL used to query sqlite is incompatible with the SQL used to query mysql, and regular expressions have been used for translation several times.

Orchestrator helps us a lot. Shlomi has done a lot of great work and we salute him.

If there are any errors in the description above, thank you for pointing them out.

Reply all

Reply to author

Forward

0 new messages