Galera read/write split - Inconsistent failover when stopping nodes.

326 views
Skip to first unread message

spyros...@gmail.com

unread,
Sep 19, 2017, 8:39:36 AM9/19/17
to proxysql
I have a three node cluster that seems to be working ok and is stable. 

My problem is that when I do some tests I get inconsistent "results" with proxysql.
Sometimes when I stop the one write node everything works as expected, the server is "shunned" and another one takes its place.
BUT sometimes no server takes its place, the write node stays "shunned" as expected but the other two servers stay at the "offline_soft" state and the cluster is unusable from proxysql.

When all the nodes are online the proxysql_galera_checker.log file states the following:
Tue Sep 19 15:12:18 EEST 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 10
Tue Sep 19 15:12:18 EEST 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 20
Tue Sep 19 15:12:18 EEST 2017 ###### Not loading mysql_servers, no change needed ######
Tue Sep 19 15:12:20 EEST 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Sep 19 15:12:20 EEST 2017 Hostgroup writers 10
Tue Sep 19 15:12:20 EEST 2017 Hostgroup readers 20
Tue Sep 19 15:12:20 EEST 2017 Number of writers 1
Tue Sep 19 15:12:20 EEST 2017 Writers are readers 1
Tue Sep 19 15:12:20 EEST 2017 log file /var/lib/proxysql/proxysql_galera_checker.log
Tue Sep 19 15:12:20 EEST 2017 ###### HANDLE WRITER NODES ######
Tue Sep 19 15:12:20 EEST 2017 --> Checking WRITE server 10:192.168.30.12:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 10:192.168.30.12:3306 is already ONLINE: 1 of 1 write nodes
Tue Sep 19 15:12:20 EEST 2017 --> Checking WRITE server 10:192.168.30.13:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 10:192.168.30.13:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Sep 19 15:12:20 EEST 2017 --> Checking WRITE server 10:192.168.30.15:3306, current status OFFLINE_SOFT, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 10:192.168.30.15:3306 is already OFFLINE_SOFT, max write nodes reached (1)
Tue Sep 19 15:12:20 EEST 2017 ###### HANDLE READER NODES ######
Tue Sep 19 15:12:20 EEST 2017 --> Checking READ server 20:192.168.30.12:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 20:192.168.30.12:3306 is already ONLINE
Tue Sep 19 15:12:20 EEST 2017 --> Checking READ server 20:192.168.30.13:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 20:192.168.30.13:3306 is already ONLINE
Tue Sep 19 15:12:20 EEST 2017 --> Checking READ server 20:192.168.30.15:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:12:20 EEST 2017 server 20:192.168.30.15:3306 is already ONLINE
Tue Sep 19 15:12:20 EEST 2017 ###### SUMMARY ######
Tue Sep 19 15:12:20 EEST 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 10
Tue Sep 19 15:12:20 EEST 2017 --> Number of readers that are 'ONLINE': 3 : hostgroup: 20
Tue Sep 19 15:12:20 EEST 2017 ###### Not loading mysql_servers, no change needed ######

When I stop one or more nodes (less than 3 of course) I get the following in proxysql_galera_checker.log:
Tue Sep 19 15:09:59 EEST 2017 ###### HANDLE WRITER NODES ######
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.12' (111)
Tue Sep 19 15:09:59 EEST 2017 --> Checking WRITE server 10:192.168.30.12:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:09:59 EEST 2017 server 10:192.168.30.12:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.13' (111)
Tue Sep 19 15:09:59 EEST 2017 --> Checking WRITE server 10:192.168.30.13:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:09:59 EEST 2017 server 10:192.168.30.13:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
Tue Sep 19 15:09:59 EEST 2017 --> Checking WRITE server 10:192.168.30.15:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:09:59 EEST 2017 server 10:192.168.30.15:3306 is already ONLINE: 1 of 1 write nodes
Tue Sep 19 15:09:59 EEST 2017 ###### HANDLE READER NODES ######
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.12' (111)
Tue Sep 19 15:09:59 EEST 2017 --> Checking READ server 20:192.168.30.12:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:09:59 EEST 2017 server 20:192.168.30.12:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.13' (111)
Tue Sep 19 15:09:59 EEST 2017 --> Checking READ server 20:192.168.30.13:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:09:59 EEST 2017 server 20:192.168.30.13:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
Tue Sep 19 15:09:59 EEST 2017 --> Checking READ server 20:192.168.30.15:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:09:59 EEST 2017 server 20:192.168.30.15:3306 is already ONLINE
Tue Sep 19 15:09:59 EEST 2017 ###### SUMMARY ######
Tue Sep 19 15:09:59 EEST 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 10
Tue Sep 19 15:09:59 EEST 2017 --> Number of readers that are 'ONLINE': 1 : hostgroup: 20
Tue Sep 19 15:09:59 EEST 2017 ###### Not loading mysql_servers, no change needed ######
Tue Sep 19 15:10:01 EEST 2017 ###### proxysql_galera_checker.sh SUMMARY ######
Tue Sep 19 15:10:01 EEST 2017 Hostgroup writers 10
Tue Sep 19 15:10:01 EEST 2017 Hostgroup readers 20
Tue Sep 19 15:10:01 EEST 2017 Number of writers 1
Tue Sep 19 15:10:01 EEST 2017 Writers are readers 1
Tue Sep 19 15:10:01 EEST 2017 log file /var/lib/proxysql/proxysql_galera_checker.log
Tue Sep 19 15:10:01 EEST 2017 ###### HANDLE WRITER NODES ######
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.12' (111)
Tue Sep 19 15:10:01 EEST 2017 --> Checking WRITE server 10:192.168.30.12:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:10:01 EEST 2017 server 10:192.168.30.12:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.13' (111)
Tue Sep 19 15:10:01 EEST 2017 --> Checking WRITE server 10:192.168.30.13:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:10:01 EEST 2017 server 10:192.168.30.13:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
Tue Sep 19 15:10:01 EEST 2017 --> Checking WRITE server 10:192.168.30.15:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:10:01 EEST 2017 server 10:192.168.30.15:3306 is already ONLINE: 1 of 1 write nodes
Tue Sep 19 15:10:01 EEST 2017 ###### HANDLE READER NODES ######
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.12' (111)
Tue Sep 19 15:10:01 EEST 2017 --> Checking READ server 20:192.168.30.12:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:10:01 EEST 2017 server 20:192.168.30.12:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
ERROR 2003 (HY000): Can't connect to MySQL server on '192.168.30.13' (111)
Tue Sep 19 15:10:01 EEST 2017 --> Checking READ server 20:192.168.30.13:3306, current status OFFLINE_SOFT, wsrep_local_state 
Tue Sep 19 15:10:01 EEST 2017 server 20:192.168.30.13:3306 is already OFFLINE_SOFT, WSREP status is  which is not ok
Tue Sep 19 15:10:01 EEST 2017 --> Checking READ server 20:192.168.30.15:3306, current status ONLINE, wsrep_local_state 4
Tue Sep 19 15:10:01 EEST 2017 server 20:192.168.30.15:3306 is already ONLINE
Tue Sep 19 15:10:01 EEST 2017 ###### SUMMARY ######
Tue Sep 19 15:10:01 EEST 2017 --> Number of writers that are 'ONLINE': 1 : hostgroup: 10
Tue Sep 19 15:10:01 EEST 2017 --> Number of readers that are 'ONLINE': 1 : hostgroup: 20
Tue Sep 19 15:10:01 EEST 2017 ###### Not loading mysql_servers, no change needed ######


and the following in proxysql.log:
2017-09-19 15:09:13 [INFO] Dumping current MySQL Servers structures for hostgroup ALL
HID: 10 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 1
HID: 10 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: SHUNNED , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 2
HID: 10 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 3
HID: 20 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 1
HID: 20 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 2
HID: 20 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 3
2017-09-19 15:09:16 mysql_connection.cpp:598:handler(): [ERROR] Failed to mysql_real_connect() on 192.168.30.13:3306 , FD (Conn:24 , MyDS:24) , 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 107.
2017-09-19 15:09:16 mysql_connection.cpp:598:handler(): [ERROR] Failed to mysql_real_connect() on 192.168.30.13:3306 , FD (Conn:25 , MyDS:25) , 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 107.
2017-09-19 15:09:16 mysql_connection.cpp:598:handler(): [ERROR] Failed to mysql_real_connect() on 192.168.30.13:3306 , FD (Conn:23 , MyDS:23) , 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 107.
2017-09-19 15:09:16 MySQL_HostGroups_Manager.cpp:280:connect_error(): [ERROR] Shunning server 192.168.30.13:3306 with 5 errors/sec. Shunning for 10 seconds
2017-09-19 15:09:16 mysql_connection.cpp:598:handler(): [ERROR] Failed to mysql_real_connect() on 192.168.30.13:3306 , FD (Conn:24 , MyDS:24) , 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 107.
2017-09-19 15:09:16 mysql_connection.cpp:598:handler(): [ERROR] Failed to mysql_real_connect() on 192.168.30.13:3306 , FD (Conn:23 , MyDS:23) , 2013: Lost connection to MySQL server at 'handshake: reading inital communication packet', system error: 107.
2017-09-19 15:09:18 [INFO] Dumping current MySQL Servers structures for hostgroup ALL
HID: 10 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 1
HID: 10 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: SHUNNED , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 2
HID: 10 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 3
HID: 20 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 1
HID: 20 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 2
HID: 20 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 3
2017-09-19 15:09:18 [INFO] Dumping current MySQL Servers structures for hostgroup ALL
HID: 10 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 1
HID: 10 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: SHUNNED , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 2
HID: 10 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: write 3
HID: 20 , address: 192.168.30.12 , port: 3306 , weight: 1 , status: OFFLINE_SOFT , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 1
HID: 20 , address: 192.168.30.13 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 2
HID: 20 , address: 192.168.30.15 , port: 3306 , weight: 1 , status: ONLINE , max_connections: 100 , max_replication_lag: 10 , use_ssl: 0 , max_latency_ms: 0 , comment: read 3



Note that the previous two logs where taken while the cluster state was 1 node up and two nodes down for ~10minutes. (Checked multiple times with 'service mysql status' and the two nodes were really down)

Meanwhile cluster control had this status which was obviously wrong as 192.168.30.13 was down:





The problems I mentioned above are not consistent with every test. 
Sometimes the hand-off to the last remaining server works fine, other times not.
Sometimes with two remaining nodes everything is ok, other times again not.


By the way my proxysql.cnf is this and it seems to work most of the time:
datadir="/var/lib/proxysql"
admin_variables=
{
admin_credentials="admin:password"
mysql_ifaces="127.0.0.1:6032;/tmp/proxysql_admin.sock"

cluster_check_interval_ms=1000
}
mysql_variables=
{

connect_retries_on_failure=10
connect_timeout_server=10000
connect_timeout_server_max=10000
connection_delay_multiplex_ms=0

default_query_timeout=36000000


hostgroup_manager_verbose=1
interfaces="0.0.0.0:6033;/tmp/proxysql.sock"
max_connections=2048
max_stmts_cache=10000
max_stmts_per_connection=20
mirror_max_concurrency=16
mirror_max_queue_length=32000
monitor_connect_interval=2000
monitor_connect_timeout=200
monitor_enabled=true
monitor_groupreplication_healthcheck_interval=5000
monitor_groupreplication_healthcheck_timeout=800
monitor_username="proxysql-monitor"
monitor_password="password"
monitor_ping_interval=30000
monitor_ping_max_failures=3
monitor_ping_timeout=1000
monitor_query_interval=60000
monitor_query_timeout=100
monitor_read_only_interval=1000
monitor_read_only_timeout=800
monitor_replication_lag_timeout=1000
monitor_slave_lag_when_null=60
monitor_wait_timeout=true
monitor_writer_is_also_reader=true
multiplexing=true
ping_interval_server_msec=10000
ping_timeout_server=200
poll_timeout=2000
poll_timeout_on_failure=100
query_cache_size_MB=256

query_retries_on_failure=1000
server_capabilities=45578
server_version="5.1.30"

session_idle_show_processlist=false
sessions_sort=true
shun_on_failures=5
  shun_recovery_time_sec=10
stacksize=1048576
threads=4
}
mysql_servers =
(
{ address="192.168.30.12" , port=3306 , hostgroup=10, max_connections=100, max_replication_lag=10, comment="write 1" },
{ address="192.168.30.13" , port=3306 , hostgroup=10, max_connections=100, max_replication_lag=10, comment="write 2" },
{ address="192.168.30.15" , port=3306 , hostgroup=10, max_connections=100, max_replication_lag=10, comment="write 3" },
{ address="192.168.30.12" , port=3306 , hostgroup=20, max_connections=100, max_replication_lag=10, comment="read 1" },
{ address="192.168.30.13" , port=3306 , hostgroup=20, max_connections=100, max_replication_lag=10, comment="read 2" },
{ address="192.168.30.15" , port=3306 , hostgroup=20, max_connections=100, max_replication_lag=10, comment="read 3" }
)
mysql_users:
(
{
username = "admin"
password = "admin"
default_hostgroup = 10
max_connections = 10000
default_schema="information_schema"
active = 1
}
)
mysql_query_rules:
(
{
rule_id=100
active=1
match_pattern=".*@.*"
destination_hostgroup=10
apply=1
},
{
rule_id=200
active=1
match_pattern="^SELECT.*"
destination_hostgroup=20
apply=0
},
{
rule_id=300
active=1
match_pattern="^SELECT.*FOR UPDATE"
destination_hostgroup=10
apply=1
}
)
scheduler=
(
   {
      id=1
      active=1
      interval_ms=2000
      filename="/var/lib/proxysql/proxysql_galera_checker.sh"
      arg1="10"
      arg2="20"
      arg3="1"
      arg4="1"
      arg5="/var/lib/proxysql/proxysql_galera_checker.log"
   }
)



My proxysql version is:
proxysql --version
ProxySQL version 1.4.1-45-gab4e6ee, codename Truls

Thank you!
Reply all
Reply to author
Forward
0 new messages