Preventing replica promotion if they lag

Mohan K.N

unread,

May 21, 2021, 4:50:37 PM5/21/21

to orchestrator-mysql

Hi,

We have a 3 node setup on Centos. 1 Primary and 2 replicas with semi-sync enabled.

We had a scenario where the replicas were lagging by few hours, master was not reachable so one of the replicas was promoted as primary in spite of the huge lag. This resulted in a data loss. I am trying to avoid a promotion of a replica. We would rather take an outage than data loss.

I added this custom hook to the preFailoverprocess and defined in /etc/orchestrator.json. I query the orchestrator sqlite3 Db to get this info directly. Is this a good approach ? Unfortunately i cannot use a pt-heartbeat or any heartbeat mechanism to detect lag. Without having to add a custom hook are there any variables that I can set to detect this lag ?

Please share your thoughts and ideas on this topic.

#!/bin/bash

IS_THERE_A_CANDIDATE=$(/usr/bin/sqlite3 /usr/local/orchestrator/orchestrator.sqlite3 -list "select count(*) from database_instance where master_host!='' and exec_master_log_pos in (select binary_log_pos from database_instance where master_host='');")

if [[ $IS_THERE_A_CANDIDATE = 0 ]]; then

echo "There is no candidate " >> /usr/local/appian/orchestrator/logs/orchestrator-recovery.log

exit -1

fi

Shlomi Noach

unread,

May 22, 2021, 10:37:46 AM5/22/21

to Mohan K.N, orchestrator-mysql

Please see: https://github.com/openark/orchestrator/blob/master/docs/configuration-recovery.md#promotion-actions

Consider using:

- DelayMasterPromotionIfSQLThreadNotUpToDate

- or FailMasterPromotionIfSQLThreadNotUpToDate

- or FailMasterPromotionOnLagMinutes

These should address your issue in different ways, out of the box.

--
You received this message because you are subscribed to the Google Groups "orchestrator-mysql" group.
To unsubscribe from this group and stop receiving emails from it, send an email to orchestrator-my...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/orchestrator-mysql/79595a48-60e1-4a7a-ab31-f0005d3a3737n%40googlegroups.com.

Mohan K.N

unread,

May 25, 2021, 11:29:13 PM5/25/21

to orchestrator-mysql

Thanks Shlomi. I tested with these 2 parameters

"DelayMasterPromotionIfSQLThreadNotUpToDate": true,

"FailMasterPromotionIfSQLThreadNotUpToDate": false

My set up is 3 node cluster with 1 Primary and 2 replicas.

My test case is I have the SQL Threads down on both replicas. With the above parameters set, I still see a replica being promoted and data is lost.

I am using 3.2.4 orchestrator. What am I missing ? Thanks in advance

Shlomi Noach

unread,

May 25, 2021, 11:46:41 PM5/25/21

to Mohan K.N, orchestrator-mysql

Can you please provide logs with —debug?

To view this discussion on the web visit https://groups.google.com/d/msgid/orchestrator-mysql/93d6e5de-5789-4945-90f0-7edf483656d4n%40googlegroups.com.

Mohan

unread,

May 26, 2021, 8:52:25 AM5/26/21

to Shlomi Noach, orchestrator-mysql

Thanks Shlomi.

Attached are orchestrator log with debug and conf files.

Here's the test case.

1 Master and 2 replicas, semi sync enabled. Mariadb 10.5.10. binlog_format is MIXED and binlog_row_image is FULL.

Node 1 is master

1-Stop slave SQL_THREAD on node 2 and 3.

2-Add transactions on Node 1.

3-Stop master.

4-Orchestrator promoted node 2 as primary and added node 3 as slave.

But the data that was added in step 2 is completely lost and not seen on node 2 or 3 even with FailMasterPromotionIfSQLThreadNotUpToDate=false and DelayMasterPromotionIfSQLThreadNotUpToDate=true.

Thanks for your help in advance.

Thanks & have a wonderful day

Mohan K

orchestrator.log

orchestrator.conf.json

Reply all

Reply to author

Forward