Orchestrator config json and automatic failover questions

1,314 views

Skip to first unread message

hsch...@patientping.com

unread,

Dec 21, 2017, 5:57:14 PM12/21/17

to orchestrator-mysql

Goal:

We'd like to use Orchestrator in AWS EC2 for MySQL 5.7 using GTIDs in typical main plus replicas mode -- no Multi Master scenarios and no semi-sync plugin set

Context:

I downloaded from Github orchestrator repo the latest GA release (3.0.3) of Orchestrator and proceeded to use a modified version of the Dockerfile included in the 3.0.3 zipfile with no entrypoint.sh and CMD set to tail -f /dev/null *
I did that so that I could use the orchestrator.sample.conf.json file and start up the orchestrator process with debug flag plus point it to the /usr/local/orchestrator directory the Docker file chooses to put the executable there.
I used the Docker hub MySQL 5.7 Dockerfile, also modded, to not use the entry point shell script nor CMD set to mysqld so that I could start up mysqld independently with the datapoint-entry.sh script calling mysqld as it's argument.
I put both Docker built image tag names into a docker-compose so that I could tie orchestrator container to 4 individual instances of MySQL set up as a main and 3 replicas, mount volumes with specific scripts to manually set and check replication, and put config files for both MySQL and Orchestrator into place.

*The reason I took the entrypoint out for Orchestrator Dockerfile is two fold. The defaults didn't get information from docker network inspection of the network assigned (and therefore the IPs) and the directory /usr/local/orchestrator isn't one of the current default paths to look for the config file.

What I have run into while testing in a Docker Compose environment and the questions I am hoping you all can help me with are:

1. Killed the mysqld process and the topology in the GUI showed the "DeadMaster" -- no replicas were promoted automatically. Does this article still hold true https://www.percona.com/blog/2016/03/08/orchestrator-mysql-replication-topology-manager/ ? Quoting the Limitations section -->
"One of the key missing features is that there is no easy way to promote a slave to be the new master. This could be useful in scenarios where the master server has to be upgraded, there is a planned failover, etc. (this is a known feature request)." (known = https://github.com/outbrain/orchestrator/issues/151)
or does the sample config setting "ApplyMySQLPromotionAfterMasterFailover": false, have to be changed to be true?

I plan to test ApplyMySQLPromotionAfterMasterFailover": true next and wanted to make sure I covered all bases for other orchestrator.config.json settings I probably overlooked or don't understand yet.

2. When I forced the failover to occur with the default orchestrator config file + orchestrator -c force-master-failover -i 58606e157e68 --config=/usr/local/orchestrator/orchestrator.config.json the replica was automatically chosen --> the slaves were not reset, nor the automatically chosen new target set to read/write as I was expecting. I was hoping that one of the remaining replicas would become under the new master.
I read up on https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md and it isn't clear to me like the commands tested in issue 151 (also looks like switch-alive is not a flag in 3.0.3 is that now take-master instead?)

I included the log of the command run with the /tmp/recovery.log output
It looks like the GUI may have initiated the force failover yet didn't acknowledge it at all.
The command line may have tried one replica and then a second. I'm not sure because line 69 in the file shows both docker container id's

The by product /tmp files would be great to have around to verify the steps taken to connect the dots of what commands follow what steps. It looks like they get removed?

3. Would folks be willing to share what orchestrator.config.json settings are for Booking or Github (without any usernames or passwords or other incriminating things ; ) ) that are best for unplanned and planned failover & recovery?

4. I've been looking for the orchestrator -c command line list of steps to run through as a playbook in various presentations, percona blogs, etc. and it's not clear to me what orchestrate -c command(s) chatops is running when the log of orchestrate commands is put to Jabber/IRC client. Will this be a moot point if I get the orchestrator.config.json settings tweaked properly to handle a unplanned failover of Master (DeadMaster) or a planned failover ? Or does each step in either failover scenario need to have a corresponding orchestrator -c command ? Referencing https://githubengineering.com/mysql-testing-automation-at-github/

I know I've just asked a berth of stuff.

I appreciate the pointers as I'd really like to just get the basic scenarios of planned failure for maintenance/upgrade/etc (switch the master/replica) and unplanned (promote a replica to be the master and have a remaining replica go under the new master to keep replicating)

Thanks in advance for reading and parsing through this.

Heidi

orchestrator.sample-conf.json

force-failover-example-docker-compose.log

shlomi...@github.com

unread,

Dec 24, 2017, 7:12:36 AM12/24/17

to orchestrator-mysql

Hi,
Please detailed comments inline.

Summary:

- set "RecoverMasterClusterFilters": [ "*" ] because recovery is opt-in. This opts-in all clusters.

- set "ApplyMySQLPromotionAfterMasterFailover": true,

- check why the replicas would not move below the new master. Do you have log-bin & log-slave-updates set on all servers?

Cheers

On Friday, December 22, 2017 at 12:57:14 AM UTC+2, hsch...@patientping.com wrote:

Goal:
We'd like to use Orchestrator in AWS EC2 for MySQL 5.7 using GTIDs in typical main plus replicas mode -- no Multi Master scenarios and no semi-sync plugin set

Context:
I downloaded from Github orchestrator repo the latest GA release (3.0.3) of Orchestrator and proceeded to use a modified version of the Dockerfile included in the 3.0.3 zipfile with no entrypoint.sh and CMD set to tail -f /dev/null *
I did that so that I could use the orchestrator.sample.conf.json file and start up the orchestrator process with debug flag plus point it to the /usr/local/orchestrator directory the Docker file chooses to put the executable there.

I used the Docker hub MySQL 5.7 Dockerfile, also modded, to not use the entry point shell script nor CMD set to mysqld so that I could start up mysqld independently with the datapoint-entry.sh script calling mysqld as it's argument.
I put both Docker built image tag names into a docker-compose so that I could tie orchestrator container to 4 individual instances of MySQL set up as a main and 3 replicas, mount volumes with specific scripts to manually set and check replication, and put config files for both MySQL and Orchestrator into place.

*The reason I took the entrypoint out for Orchestrator Dockerfile is two fold. The defaults didn't get information from docker network inspection of the network assigned (and therefore the IPs) and the directory /usr/local/orchestrator isn't one of the current default paths to look for the config file.

What I have run into while testing in a Docker Compose environment and the questions I am hoping you all can help me with are:

1. Killed the mysqld process and the topology in the GUI showed the "DeadMaster" -- no replicas were promoted automatically. Does this article still hold true https://www.percona.com/blog/2016/03/08/orchestrator-mysql-replication-topology-manager/ ? Quoting the Limitations section -->
"One of the key missing features is that there is no easy way to promote a slave to be the new master. This could be useful in scenarios where the master server has to be upgraded, there is a planned failover, etc. (this is a known feature request)." (known = https://github.com/outbrain/orchestrator/issues/151)

Mixing a few things here, allow me to elaborate and after the suspense I'll tell you the answer.

The limitation quoted from the blog post is both outdated and irrelevant. It is irrelevant because it relates to planned, intentional switchover of the master, nad is outdated because that's been supported since then. But anyway, your case is that of unplanned failover, which has always been supported.

However, master recovery is opt-in. You wish to configure orchestrator to failover your masters. See https://github.com/github/orchestrator/blob/master/docs/configuration-recovery.md for details, but basically you want to replace your

"RecoverMasterClusterFilters": [
"_master_pattern_"
],

"RecoverMasterClusterFilters": [
"*"
],

or does the sample config setting "ApplyMySQLPromotionAfterMasterFailover": false, have to be changed to be true?

ApplyMySQLPromotionAfterMaster is something most users actually want as `true` but default to `false` because of safety measures. When

ApplyMySQLPromotionAfterMaster is true, orchestrator will issue `RESET SLAVE ALL` and `set read_only=0` on promoted master.

I plan to test ApplyMySQLPromotionAfterMasterFailover": true next and wanted to make sure I covered all bases for other orchestrator.config.json settings I probably overlooked or don't understand yet.

2. When I forced the failover to occur with the default orchestrator config file + orchestrator -c force-master-failover -i 58606e157e68 --config=/usr/local/orchestrator/orchestrator.config.json the replica was automatically chosen --> the slaves were not reset, nor the automatically chosen new target set to read/write as I was expecting. I was hoping that one of the remaining replicas would become under the new master.
I read up on https://github.com/github/orchestrator/blob/master/docs/topology-recovery.md and it isn't clear to me like the commands tested in issue 151 (also looks like switch-alive is not a flag in 3.0.3 is that now take-master instead?)

Again, this question confuses many things. "switch-master" never existed. It was _requested_ by a user; that doesn't mean it was implemented the way the user requested; a request by a user isn't means of documentation.

You're looking for "graceful-master-takeover" which is the supported way to promote a new master _cleanly_. "force-master-failover" also works.

"take-master" overtakes a local master (aka intermediate master), not a primary master (top master for a cluster)

I see I definitely need to improve documentation on this topic.

I included the log of the command run with the /tmp/recovery.log output
It looks like the GUI may have initiated the force failover yet didn't acknowledge it at all.

The command line may have tried one replica and then a second. I'm not sure because line 69 in the file shows both docker container id's
The by product /tmp files would be great to have around to verify the steps taken to connect the dots of what commands follow what steps. It looks like they get removed?

Thank you for providing the log. I see "no candidate replica found" and this suggests the replicas are possibly not configured to be able to replicate from each other. Thus, a new master was promoted (and as suggested above, you want to set "ApplyMySQLPromotionAfterMaster": true), but then, the two remaining replicas could not be pointed below the new master.

Can you issue

orchestrator -c topology -i 58606e157e68

and paste the output?

Better yet, please output the JSON of

http://your-orchestrator:3000/api/all-instances

3. Would folks be willing to share what orchestrator.config.json settings are for Booking or Github (without any usernames or passwords or other incriminating things ; ) ) that are best for unplanned and planned failover & recovery?

No problem. See attached file with some redacted info.

4. I've been looking for the orchestrator -c command line list of steps to run through as a playbook in various presentations, percona blogs, etc. and it's not clear to me what orchestrate -c command(s) chatops is running when the log of orchestrate commands is put to Jabber/IRC client.

It's best to look at documentation: https://github.com/github/orchestrator/blob/master/docs/README.md. Percona blogs are outdated.

Also see the slides on https://speakerdeck.com/shlominoach/practical-orchestrator-tutorial

Will this be a moot point if I get the orchestrator.config.json settings tweaked properly to handle a unplanned failover of Master (DeadMaster) or a planned failover ? Or does each step in either failover scenario need to have a corresponding orchestrator -c command ? Referencing https://githubengineering.com/mysql-testing-automation-at-github/

> Or does each step in either failover scenario need to have a corresponding orchestrator -c command

Not sure what you mean by that, it seems to me to be a confusion.

shlomi...@github.com

unread,

Dec 24, 2017, 7:30:27 AM12/24/17

to orchestrator-mysql

{
"Debug": true,
"EnableSyslog": false,
"ListenAddress": ":3000",
"MySQLTopologyCredentialsConfigFile": "/etc/mysql/orchestrator.cnf",
"MySQLTopologySSLPrivateKeyFile": "",
"MySQLTopologySSLCertFile": "",
"MySQLTopologySSLCAFile": "",
"MySQLTopologySSLSkipVerify": true,
"MySQLTopologyUseMutualTLS": false,
"MySQLTopologyMaxPoolConnections": 3,
"DatabaselessMode__experimental": false,
"MySQLOrchestratorHost": "127.0.0.1",
"MySQLOrchestratorPort": 3306,
"MySQLOrchestratorDatabase": "orchestrator",
"MySQLOrchestratorCredentialsConfigFile": "/etc/mysql/orchestrator_srv.cnf",
"MySQLOrchestratorSSLPrivateKeyFile": "",
"MySQLOrchestratorSSLCertFile": "",
"MySQLOrchestratorSSLCAFile": "",
"MySQLOrchestratorSSLSkipVerify": true,
"MySQLOrchestratorUseMutualTLS": false,
"MySQLConnectTimeoutSeconds": 1,
"DefaultInstancePort": 3306,
"ReplicationLagQuery": "select round(absolute_lag) from meta.heartbeat_view",
"SlaveStartPostWaitMilliseconds": 1000,
"DiscoverByShowSlaveHosts": false,
"InstancePollSeconds": 5,
"ReadLongRunningQueries": false,
"SkipMaxScaleCheck": true,
"BinlogFileHistoryDays": 10,
"UnseenInstanceForgetHours": 240,
"SnapshotTopologiesIntervalHours": 0,
"InstanceBulkOperationsWaitTimeoutSeconds": 10,
"ActiveNodeExpireSeconds": 5,
"HostnameResolveMethod": "default",
"MySQLHostnameResolveMethod": "@@hostname",
"SkipBinlogServerUnresolveCheck": true,
"ExpiryHostnameResolvesMinutes": 60,
"RejectHostnameResolvePattern": "",
"ReasonableReplicationLagSeconds": 10,
"ProblemIgnoreHostnameFilters": [

],
"VerifyReplicationFilters": false,
"MaintenanceOwner": "orchestrator",
"ReasonableMaintenanceReplicationLagSeconds": 20,
"MaintenanceExpireMinutes": 10,
"MaintenancePurgeDays": 365,
"CandidateInstanceExpireMinutes": 60,
"AuditLogFile": "",
"AuditToSyslog": false,
"AuditPageSize": 20,
"AuditPurgeDays": 365,
"RemoveTextFromHostnameDisplay": ":3306",
"ReadOnly": false,
"AuthenticationMethod": "",
"HTTPAuthUser": "",
"HTTPAuthPassword": "",
"AuthUserHeader": "",
"PowerAuthUsers": [
    "*"
],
"ClusterNameToAlias": {
    "127.0.0.1": "test suite"
},
"AccessTokenUseExpirySeconds": 60,
"AccessTokenExpiryMinutes": 1440,
"DetectClusterAliasQuery": "select ifnull(max(cluster_name), '') as cluster_alias from meta.cluster where anchor=1",
"DetectClusterDomainQuery": "",
"DataCenterPattern": "",
"DetectDataCenterQuery": "select 'redacted'",
"PhysicalEnvironmentPattern": "",
"PromotionIgnoreHostnameFilters": [

],
"ServeAgentsHttp": false,
"UseSSL": false,
"UseMutualTLS": false,
"SSLSkipVerify": false,
"SSLPrivateKeyFile": "",
"SSLCertFile": "",
"SSLCAFile": "",
"SSLValidOUs": [

],
"StatusEndpoint": "/api/status",
"StatusSimpleHealth": true,
"StatusOUVerify": false,
"HttpTimeoutSeconds": 60,
"StaleSeedFailMinutes": 60,
"SeedAcceptableBytesDiff": 8192,
"PseudoGTIDPattern": "drop view if exists `meta`.`_pseudo_gtid_hint__asc:",
"PseudoGTIDPatternIsFixedSubstring": true,
"PseudoGTIDMonotonicHint": "asc:",
"DetectPseudoGTIDQuery": "select count(*) as pseudo_gtid_exists from meta.pseudo_gtid_status where anchor = 1 and time_generated > now() - interval 1 day",
"PseudoGTIDPreferIndependentMultiMatch": true,
"BinlogEventsChunkSize": 10000,
"BufferBinlogEvents": true,
"SkipBinlogEventsContaining": [
    "@@SESSION.GTID_NEXT= 'ANONYMOUS'"
],
"ReduceReplicationAnalysisCount": false,
"FailureDetectionPeriodBlockMinutes": 60,
"RecoveryPollSeconds": 2,
"RecoveryPeriodBlockSeconds": 600,
"RecoveryIgnoreHostnameFilters": [

],
"RecoverMasterClusterFilters": [
    "*"
],
"RecoverIntermediateMasterClusterFilters": [
    "*"
],
"OnFailureDetectionProcesses": [
    "/redacted/our-orchestrator-recovery-handler -t 'detection' -f '{failureType}' -h '{failedHost}' -C '{failureCluster}' -A '{failureClusterAlias}' -n '{countSlaves}'"
],
"PreFailoverProcesses": [
    "/redacted/our-orchestrator-recovery-handler -t 'pre-failover' -f '{failureType}' -h '{failedHost}' -C '{failureCluster}' -A '{failureClusterAlias}' -n '{countSlaves}'"
],
"PostFailoverProcesses": [
    "/redacted/our-orchestrator-recovery-handler -t 'post-failover' -f '{failureType}' -h '{failedHost}' -H '{successorHost}' -C '{failureCluster}' -A '{failureClusterAlias}' -n '{countSlaves}' -u '{recoveryUID}'"
],
"PostUnsuccessfulFailoverProcesses": [
    "/redacted/our-orchestrator-recovery-handler -t 'post-unsuccessful-failover' -f '{failureType}' -h '{failedHost}' -C '{failureCluster}' -A '{failureClusterAlias}' -n '{countSlaves}' -u '{recoveryUID}'"
],
"PostMasterFailoverProcesses": [
    "/redacted/do-something # e.g. kick pt-heartbeat on promoted master"
],
"PostIntermediateMasterFailoverProcesses": [

],
"CoMasterRecoveryMustPromoteOtherCoMaster": true,
"DetachLostSlavesAfterMasterFailover": true,
"ApplyMySQLPromotionAfterMasterFailover": true,
"MasterFailoverLostInstancesDowntimeMinutes": 60,
"PostponeReplicaRecoveryOnLagMinutes": 10,
"OSCIgnoreHostnameFilters": [

],
"GraphitePollSeconds": 60,
"GraphiteAddr": "",
"GraphitePath": "",
"GraphiteConvertHostnameDotsToUnderscores": true,
"BackendDB": "mysql",
"MySQLTopologyReadTimeoutSeconds": 3,
"MySQLDiscoveryReadTimeoutSeconds": 3,
"SQLite3DataFile": "/var/lib/orchestrator/orchestrator-sqlite.db",
"RaftEnabled": false,
"RaftBind": "redacted",
"RaftDataDir": "/var/lib/orchestrator",
"DefaultRaftPort": 10008,
"ConsulAddress": "redacted:8500",
"RaftNodes": [
    "redacted",
    "redacted",
    "redacted"
]
}

hsch...@patientping.com

unread,

Dec 26, 2017, 9:47:27 AM12/26/17

to orchestrator-mysql

This was a great Christmas present to find on 12/25 : - D

Many thanks for the pointers. I will go make changes and get a complete baseline before issuing the tests - then issue a forced failover or graceful master takeover. Yes, it would be nice to have a couple of playbooks with steps end to end + the required orchestrator.conf.json settings as documentation because it isn't clear to me what the permutations are.

Best regards,

Heidi

Heidi Schmidt

unread,

Dec 26, 2017, 3:10:32 PM12/26/17

to orchestrator-mysql

Thanks Shlomi,

That helped out quite a bit. I am going to move on to AWS EC2 as I **think** that Docker network is getting in the way now that I have the configuration file set with the recommended settings.

If I can help with any information from Docker you might want/need for Documentation, let me know.

Asides for other folks context using the orchestrator.config.json file you provided. I have attached the one that is currently running in Docker for context.

Changes and info

I added in the 3 main settings from this thread and made sure my.cnf log_slave_updates set ON (missed it - thanks for the reminder)
I dropped the entire backend orchestrator database and recreated it to start clean.
Verified that the main db replicating to both replicas with the correct settings
Put the orchestrator.config.json file into place and had to back out a few things (no sqllite, no pseudo gtid, no DetectAliasQuery using meta.* or heartbeat_view, no .cnf files because the startup was looking for a header in the srv file that was not { "var": "value" } in format. Attached to this thread.
Started the web gui up with ./orchestrator --debug --config=/usr/local/orchestrator/orchestrator.config.json http and plan to use --stack and --verbose as well in AWS EC2

The orchestrator STDOUT log looks so much cleaner.

FYI - help section of the graceful-master-takeover also referenced the force master takeover syntax.

I used the instance flag with the graceful failover syntax. See below.

bash-4.4# orchestrator -c graceful-master-takeover help
graceful-master-takeover:
Gracefully discard master and promote another (direct child) instance instead, even if everything is running well.
This allows for planned switchover.
NOTE:
- Promoted instance must be a direct child of the existing master
- Promoted instance must be the *only* direct child of the existing master. It *is* a planned failover thing.
- Orchestrator will first issue a "set global read_only=1" on existing master
- It will promote candidate master to the binlog positions of the existing master after issuing the above
- There _could_ still be statements issued and executed on the existing master by SUPER users, but those are ignored.
- Orchestrator then proceeds to handle a DeadMaster failover scenario
- Orchestrator will issue all relevant pre-failover and post-failover external processes.
Examples:

orchestrator -c graceful-master-takeover -alias mycluster
Indicate cluster by alias. Orchestrator automatically figures out the master and verifies it has a single direct replica

orchestrator -c force-master-takeover -i instance.in.relevant.cluster.com
Indicate cluster by an instance. You don't structly need to specify the master, orchestrator
will infer the master's identify.

bash-4.4# orchestrator -c graceful-master-takeover -debug -verbose -i 58606e157e68 --config=/usr/local/orchestrator/orchestrator.config.json

The command ran though the replica setup was not successfully set up. I think that is partially due to lack of reverse IP look-up in Docker as the container name is only a local construct to docker network which may have also caused the replication username to not be configured.

Pasting it here so that other folks have it. I had to reset the master and the new replica in order to get them into sync and replicating once I had the correct change master to information in place.

I will move on to AWS EC2 to see if that goes away with the ability to resolve DNS name to IP and back.

mysql> show slave status \G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: ebb7130cec9f
Master_User:
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: 58606e157e68-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 154
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 1593
Last_IO_Error: Fatal error: Invalid (empty) username when attempting to connect to the master server. Connection attempt terminated.
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID:
Master_Info_File: /var/lib/mysql/master.info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp: 171226 19:20:11
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set:
Executed_Gtid_Set: 56546043-df7b-11e7-b8df-0242ac130002:1-3
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)

--
You received this message because you are subscribed to a topic in the Google Groups "orchestrator-mysql" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/orchestrator-mysql/m6bQBImvd0M/unsubscribe.
To unsubscribe from this group and all its topics, send an email to orchestrator-mysql+unsub...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/orchestrator-mysql/bfb895dd-ee46-44ab-8986-7e742eaff184%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Heidi E. Schmidt

Data Engineer, PatientPing

617-460-0966| hsch...@patientping.com

website | @patientping | linkedin

current-working-orchestrator-config.json

hsch...@patientping.com

unread,

Dec 26, 2017, 3:36:17 PM12/26/17

to orchestrator-mysql

I will see if setting all the default report_* values https://dev.mysql.com/doc/refman/5.7/en/replication-options-slave.html report_

Referencing https://github.com/github/orchestrator/blob/master/docs/supported-topologies-and-versions.md

Reply all

Reply to author

Forward

0 new messages