Strange intermitant loses in sync

52 views
Skip to first unread message

Charles Williams

unread,
Oct 26, 2016, 8:45:42 AM10/26/16
to codership
Hello everyone,

I have a 3 node MariaDB (10.0.25) Galera cluster (Debian Wheezy) that is working fine except for ca. once a day when it loses sync and the entire cluster goes down before "usually" coming back up. Today was an exception where the cluster did not come back as usual.

There are no dropped packets an no large latency (all in the same local net on GBit network).

Here is the my.cnf and the log snippet from mysql.log shortly before till after the failure.

https://pastie.itadmins.net/ggljkxkJ

Thanks for any ideas.

hunter86bg

unread,
Oct 27, 2016, 12:25:05 AM10/27/16
to codership
What do you mean by "ca" and why the cluster is going down every day?!?

Charles Williams

unread,
Oct 27, 2016, 3:06:14 AM10/27/16
to codership
"ca" is short for circa (another word for approximately or about).

And that is exactly my question. I cannot figure out why it is going down every day. It doesn't make sense to me atm.

Charles Williams

unread,
Oct 27, 2016, 3:16:57 AM10/27/16
to codership

And the most annoying part of this is that my boss is on the verge of ordering my to drop galera completely in favor of a legacy mysql master/slave cluster.

This is something that I would rather avoid.

hunter86bg

unread,
Oct 27, 2016, 3:26:20 AM10/27/16
to codership
Could you provide the logs (from all Galeras) for the moment when the disconnects have happened previous days.
Is it at the same time when you backup starts ? Is it possible that you mean disconnect , but it actually means "stop serving clients" ?

Lammert Bies

unread,
Oct 27, 2016, 3:29:50 AM10/27/16
to codership
You may want to remove the root password from the config file you posted and change that root password on all your internal machines. It is not a good idea to have such information floating around the internet.

hunter86bg

unread,
Oct 27, 2016, 3:40:59 AM10/27/16
to codership
Also , can you use the following tool to debug the situation:

Thanks to my colleague Martin, I used to debug my Cluster with the tools in this git: MYQ_tools (Direct link I have used is: MYQ_Tolls_1.0.4 )

Just Download it and then extract the myq_status.linux-amd64 (depends on your OS) and copy it to /usr/bin/

Then you can monitor the load with these commands example:

#myq_status.linux-amd64 -a "-uroot -ppassrord -S/tmp/mysql.6000.sock" wsrep

you need to replace the above values with the correct ones for (user password and socket)

I'm suspecting that the cluster is pausing.

Charles Williams

unread,
Oct 27, 2016, 3:44:16 AM10/27/16
to codership
here are the ones from the failure this morning: https://pastie.itadmins.net/KCNl1vBM

Charles Williams

unread,
Oct 27, 2016, 3:49:01 AM10/27/16
to codership
Pausing? Why would it pause?

hunter86bg

unread,
Oct 27, 2016, 4:52:44 AM10/27/16
to codership
Ignore the pause thing.

I saw the following:
1.) Internal MariaDB error code: 1032
This situation is not eventually consistent and MariaDB can’t resolve the situation automatically. If application needs eventual consistency, it needs to resolve this conflict so that all databases again are in the same state that is correct by application rules.
2.) 10.1.5.63 started an SST (lines 268-272)
  1. 161027  8:43:33 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.5.63' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '31893'  '' '
  2. WSREP_SST: [INFO] Streaming with xbstream (20161027 08:43:34.208)
  3. WSREP_SST: [INFO] Using socat as streamer (20161027 08:43:34.211)
  4. WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20161027 08:43:34.258)
  5. 161027  8:43:34 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.1.5.63:4444/xtrabackup_sst//1


Now can you provide the output of this from each node:

mysql -u root -p <pass> -e "SHOW STATUS LIKE 'wsrep%';"

Also consider that the tpm folder for xtrabackup is the default one (/tmp). That could also cause troubles(lines 78-79):

  1. WSREP_SST: [INFO] Using /tmp/tmp.gur61bRfSu as xtrabackup temporary directory (20161027 08:43:34.940)
  2. WSREP_SST: [INFO] Using /tmp/tmp.I6qhPbqjns as innobackupex temporary directory (20161027 08:43:34.944)

James Wang

unread,
Oct 27, 2016, 5:10:19 AM10/27/16
to codership
Do you write to all nodes please?

Also, what is the writes per seconds roughly please?

Charles Williams

unread,
Oct 27, 2016, 5:23:09 AM10/27/16
to codership


On Thursday, October 27, 2016 at 10:52:44 AM UTC+2, hunter86bg wrote:
Ignore the pause thing.

I saw the following:
1.) Internal MariaDB error code: 1032
This situation is not eventually consistent and MariaDB can’t resolve the situation automatically. If application needs eventual consistency, it needs to resolve this conflict so that all databases again are in the same state that is correct by application rules.

Will check that.
 
2.) 10.1.5.63 started an SST (lines 268-272)
  1. 161027  8:43:33 [Note] WSREP: Running: 'wsrep_sst_xtrabackup-v2 --role 'joiner' --address '10.1.5.63' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --defaults-group-suffix '' --parent '31893'  '' '
  2. WSREP_SST: [INFO] Streaming with xbstream (20161027 08:43:34.208)
  3. WSREP_SST: [INFO] Using socat as streamer (20161027 08:43:34.211)
  4. WSREP_SST: [INFO] Evaluating timeout -k 110 100 socat -u TCP-LISTEN:4444,reuseaddr stdio | xbstream -x; RC=( ${PIPESTATUS[@]} ) (20161027 08:43:34.258)
  5. 161027  8:43:34 [Note] WSREP: Prepared SST request: xtrabackup-v2|10.1.5.63:4444/xtrabackup_sst//1


O.K. do you see a problem there that I am missing?
 

Now can you provide the output of this from each node:

mysql -u root -p <pass> -e "SHOW STATUS LIKE 'wsrep%';"


 root@galdbc1a /var/log$ mysql -u root -ppassword -e "SHOW STATUS LIKE 'wsrep%';"
+------------------------------+---------------------------------------------------+
| Variable_name                | Value                                             |
+------------------------------+---------------------------------------------------+
| wsrep_local_state_uuid       | e3d6fc67-480f-11e5-9f36-e6094437fc14              |
| wsrep_protocol_version       | 7                                                 |
| wsrep_last_committed         | 1839535885                                        |
| wsrep_replicated             | 12710                                             |
| wsrep_replicated_bytes       | 42442607                                          |
| wsrep_repl_keys              | 49994                                             |
| wsrep_repl_keys_bytes        | 692413                                            |
| wsrep_repl_data_bytes        | 40936754                                          |
| wsrep_repl_other_bytes       | 0                                                 |
| wsrep_received               | 340                                               |
| wsrep_received_bytes         | 287412                                            |
| wsrep_local_commits          | 12577                                             |
| wsrep_local_cert_failures    | 0                                                 |
| wsrep_local_replays          | 0                                                 |
| wsrep_local_send_queue       | 0                                                 |
| wsrep_local_send_queue_max   | 2                                                 |
| wsrep_local_send_queue_min   | 0                                                 |
| wsrep_local_send_queue_avg   | 0.001951                                          |
| wsrep_local_recv_queue       | 0                                                 |
| wsrep_local_recv_queue_max   | 1                                                 |
| wsrep_local_recv_queue_min   | 0                                                 |
| wsrep_local_recv_queue_avg   | 0.000000                                          |
| wsrep_local_cached_downto    | 1839523140                                        |
| wsrep_flow_control_paused_ns | 0                                                 |
| wsrep_flow_control_paused    | 0.000000                                          |
| wsrep_flow_control_sent      | 0                                                 |
| wsrep_flow_control_recv      | 0                                                 |
| wsrep_cert_deps_distance     | 63.244704                                         |
| wsrep_apply_oooe             | 0.003294                                          |
| wsrep_apply_oool             | 0.000078                                          |
| wsrep_apply_window           | 1.004235                                          |
| wsrep_commit_oooe            | 0.000000                                          |
| wsrep_commit_oool            | 0.000000                                          |
| wsrep_commit_window          | 1.000941                                          |
| wsrep_local_state            | 4                                                 |
| wsrep_local_state_comment    | Synced                                            |
| wsrep_cert_index_size        | 168                                               |
| wsrep_causal_reads           | 0                                                 |
| wsrep_cert_interval          | 0.012318                                          |
| wsrep_incoming_addresses     | 10.1.5.62:3306,10.1.5.63:3306,10.1.5.64:3306      |
| wsrep_evs_delayed            |                                                   |
| wsrep_evs_evict_list         |                                                   |
| wsrep_evs_repl_latency       | 0.000492962/0.000717572/0.00146807/0.000163128/95 |
| wsrep_evs_state              | OPERATIONAL                                       |
| wsrep_gcomm_uuid             | 553b373e-9c20-11e6-a688-46fe800b2ed3              |
| wsrep_cluster_conf_id        | 56                                                |
| wsrep_cluster_size           | 3                                                 |
| wsrep_cluster_state_uuid     | e3d6fc67-480f-11e5-9f36-e6094437fc14              |
| wsrep_cluster_status         | Primary                                           |
| wsrep_connected              | ON                                                |
| wsrep_local_bf_aborts        | 0                                                 |
| wsrep_local_index            | 0                                                 |
| wsrep_provider_name          | Galera                                            |
| wsrep_provider_vendor        | Codership Oy <in...@codership.com>                 |
| wsrep_provider_version       | 25.3.15(r3578)                                    |
| wsrep_ready                  | ON                                                |
| wsrep_thread_count           | 9                                                 |
+------------------------------+---------------------------------------------------+

root@galdbc1b /var/log$ mysql -u root -ppassword -e "SHOW STATUS LIKE 'wsrep%';"
+------------------------------+----------------------------------------------+
| Variable_name                | Value                                        |
+------------------------------+----------------------------------------------+
| wsrep_local_state_uuid       | e3d6fc67-480f-11e5-9f36-e6094437fc14         |
| wsrep_protocol_version       | 7                                            |
| wsrep_last_committed         | 1839535812                                   |
| wsrep_replicated             | 0                                            |
| wsrep_replicated_bytes       | 0                                            |
| wsrep_repl_keys              | 0                                            |
| wsrep_repl_keys_bytes        | 0                                            |
| wsrep_repl_data_bytes        | 0                                            |
| wsrep_repl_other_bytes       | 0                                            |
| wsrep_received               | 12927                                        |
| wsrep_received_bytes         | 42356687                                     |
| wsrep_local_commits          | 0                                            |
| wsrep_local_cert_failures    | 0                                            |
| wsrep_local_replays          | 0                                            |
| wsrep_local_send_queue       | 0                                            |
| wsrep_local_send_queue_max   | 1                                            |
| wsrep_local_send_queue_min   | 0                                            |
| wsrep_local_send_queue_avg   | 0.000000                                     |
| wsrep_local_recv_queue       | 0                                            |
| wsrep_local_recv_queue_max   | 11                                           |
| wsrep_local_recv_queue_min   | 0                                            |
| wsrep_local_recv_queue_avg   | 0.008277                                     |
| wsrep_local_cached_downto    | 1839523187                                   |
| wsrep_flow_control_paused_ns | 0                                            |
| wsrep_flow_control_paused    | 0.000000                                     |
| wsrep_flow_control_sent      | 0                                            |
| wsrep_flow_control_recv      | 0                                            |
| wsrep_cert_deps_distance     | 63.305006                                    |
| wsrep_apply_oooe             | 0.024539                                     |
| wsrep_apply_oool             | 0.000237                                     |
| wsrep_apply_window           | 1.061031                                     |
| wsrep_commit_oooe            | 0.000000                                     |
| wsrep_commit_oool            | 0.000000                                     |
| wsrep_commit_window          | 1.047020                                     |
| wsrep_local_state            | 4                                            |
| wsrep_local_state_comment    | Synced                                       |
| wsrep_cert_index_size        | 146                                          |
| wsrep_causal_reads           | 0                                            |
| wsrep_cert_interval          | 0.012435                                     |
| wsrep_incoming_addresses     | 10.1.5.62:3306,10.1.5.63:3306,10.1.5.64:3306 |
| wsrep_evs_delayed            |                                              |
| wsrep_evs_evict_list         |                                              |
| wsrep_evs_repl_latency       | 0.000706298/0.000706298/0.000706298/0/1      |
| wsrep_evs_state              | OPERATIONAL                                  |
| wsrep_gcomm_uuid             | 621ec49f-9c20-11e6-a410-8b2524afaa49         |
| wsrep_cluster_conf_id        | 56                                           |
| wsrep_cluster_size           | 3                                            |
| wsrep_cluster_state_uuid     | e3d6fc67-480f-11e5-9f36-e6094437fc14         |
| wsrep_cluster_status         | Primary                                      |
| wsrep_connected              | ON                                           |
| wsrep_local_bf_aborts        | 0                                            |
| wsrep_local_index            | 1                                            |
| wsrep_provider_name          | Galera                                       |
| wsrep_provider_vendor        | Codership Oy <in...@codership.com>            |
| wsrep_provider_version       | 25.3.15(r3578)                               |
| wsrep_ready                  | ON                                           |
| wsrep_thread_count           | 9                                            |
+------------------------------+----------------------------------------------+

root@galdbc1c ~$ mysql -u root -ppassword -e "SHOW STATUS LIKE 'wsrep%';"
+------------------------------+----------------------------------------------+
| Variable_name                | Value                                        |
+------------------------------+----------------------------------------------+
| wsrep_local_state_uuid       | e3d6fc67-480f-11e5-9f36-e6094437fc14         |
| wsrep_protocol_version       | 7                                            |
| wsrep_last_committed         | 1839536678                                   |
| wsrep_replicated             | 0                                            |
| wsrep_replicated_bytes       | 0                                            |
| wsrep_repl_keys              | 0                                            |
| wsrep_repl_keys_bytes        | 0                                            |
| wsrep_repl_data_bytes        | 0                                            |
| wsrep_repl_other_bytes       | 0                                            |
| wsrep_received               | 13773                                        |
| wsrep_received_bytes         | 43456771                                     |
| wsrep_local_commits          | 0                                            |
| wsrep_local_cert_failures    | 0                                            |
| wsrep_local_replays          | 0                                            |
| wsrep_local_send_queue       | 0                                            |
| wsrep_local_send_queue_max   | 1                                            |
| wsrep_local_send_queue_min   | 0                                            |
| wsrep_local_send_queue_avg   | 0.000000                                     |
| wsrep_local_recv_queue       | 0                                            |
| wsrep_local_recv_queue_max   | 3                                            |
| wsrep_local_recv_queue_min   | 0                                            |
| wsrep_local_recv_queue_avg   | 0.001089                                     |
| wsrep_local_cached_downto    | 1839523222                                   |
| wsrep_flow_control_paused_ns | 0                                            |
| wsrep_flow_control_paused    | 0.000000                                     |
| wsrep_flow_control_sent      | 0                                            |
| wsrep_flow_control_recv      | 0                                            |
| wsrep_cert_deps_distance     | 64.655421                                    |
| wsrep_apply_oooe             | 0.025771                                     |
| wsrep_apply_oool             | 0.000074                                     |
| wsrep_apply_window           | 1.054140                                     |
| wsrep_commit_oooe            | 0.000000                                     |
| wsrep_commit_oool            | 0.000000                                     |
| wsrep_commit_window          | 1.044634                                     |
| wsrep_local_state            | 4                                            |
| wsrep_local_state_comment    | Synced                                       |
| wsrep_cert_index_size        | 129                                          |
| wsrep_causal_reads           | 0                                            |
| wsrep_cert_interval          | 0.012484                                     |
| wsrep_incoming_addresses     | 10.1.5.62:3306,10.1.5.63:3306,10.1.5.64:3306 |
| wsrep_evs_delayed            |                                              |
| wsrep_evs_evict_list         |                                              |
| wsrep_evs_repl_latency       | 0/0/0/0/0                                    |
| wsrep_evs_state              | OPERATIONAL                                  |
| wsrep_gcomm_uuid             | 6b87ce73-9c20-11e6-b6e8-f771c4d24568         |
| wsrep_cluster_conf_id        | 56                                           |
| wsrep_cluster_size           | 3                                            |
| wsrep_cluster_state_uuid     | e3d6fc67-480f-11e5-9f36-e6094437fc14         |
| wsrep_cluster_status         | Primary                                      |
| wsrep_connected              | ON                                           |
| wsrep_local_bf_aborts        | 0                                            |
| wsrep_local_index            | 2                                            |
| wsrep_provider_name          | Galera                                       |
| wsrep_provider_vendor        | Codership Oy <in...@codership.com>            |
| wsrep_provider_version       | 25.3.15(r3578)                               |
| wsrep_ready                  | ON                                           |
| wsrep_thread_count           | 9                                            |
+------------------------------+----------------------------------------------+


Also consider that the tpm folder for xtrabackup is the default one (/tmp). That could also cause troubles(lines 78-79):

  1. WSREP_SST: [INFO] Using /tmp/tmp.gur61bRfSu as xtrabackup temporary directory (20161027 08:43:34.940)
  2. WSREP_SST: [INFO] Using /tmp/tmp.I6qhPbqjns as innobackupex temporary directory (20161027 08:43:34.944) 

Yeah. will change that as well. 

Charles Williams

unread,
Oct 27, 2016, 5:27:17 AM10/27/16
to codership
James,

No. we are currently limiting writes to the first node and letting wsrep replicate to the other 2.

We are currently not even breakig 2000 writes ps.

Charles Williams

unread,
Oct 27, 2016, 6:09:56 AM10/27/16
to codership
So,

I changed the wsrep_slave_threads from 4 to 8 and it seems to be running better. Could it be that the replications are causing a bottleneck that leads to a timeout on the slave nodes?

hunter86bg

unread,
Oct 27, 2016, 6:15:44 AM10/27/16
to codership
I saw in line 80 from the morning logs:
 
WSREP_SST: [INFO] Streaming GTID file before SST (20161027 08:43:34.948)

And the following mariadb-community-mariadb-galera-cluster-gtids-falling-out-of-sync-inspite.

I'm suspecting that GTID could be the culprit , but I'm not using GTID , nor mariadb (just mysql5.6 version)
Check this: MDEV-10715
If I'm right , you will have to consider if GTID will be off or jumping to 10.2

Of course, I might be wrong and someone more experienced should provide their help and confirm or reject this.

hunter86bg

unread,
Oct 27, 2016, 6:21:56 AM10/27/16
to codership
About the SST, it 's not normal to have SST ,as you have 8GB of 'gcache' , but IST. Of course , lagging behind should have some kind of reason. 

hunter86bg

unread,
Oct 27, 2016, 6:26:16 AM10/27/16
to codership
Regarding the slave threads, if those are dedicated servers you can use
SHOW STATUS LIKE 'wsrep_cert_deps_distance';

and set the slave thread count under this number.
You can read more here: #setting-parallel-slave-threads

James Wang

unread,
Oct 27, 2016, 7:00:13 AM10/27/16
to codership

Does your 2000 ps based on Com_* or Handler_* status, and also does it include INSERTs/UPDATES/DELETEs please? 

Charles Williams

unread,
Oct 27, 2016, 8:57:26 AM10/27/16
to codership
Well, since increasing the slave threads I have not had a complete failure again but it still loses sync and still get aborted connections (although not the 100's of aborted connections as before)

Charles Williams

unread,
Oct 27, 2016, 8:58:26 AM10/27/16
to codership
Yes. complete.

hunter86bg

unread,
Oct 28, 2016, 2:27:51 AM10/28/16
to codership
It is imperative to find the reason why some nodes drop out of sync. Is there a newer version of Mariadb ? Can you update it ?

Charles Williams

unread,
Oct 28, 2016, 4:09:40 AM10/28/16
to codership
Hunter,

Since I tweaked the slave_threads I have had no further full failures. This alone is a huge step forward. Still have to go through the logs to see about sync failures since yesterday. Will post when I have more info.

hunter86bg

unread,
Oct 28, 2016, 4:53:10 AM10/28/16
to codership
Consider increasing the number of threads as per this:

James Wang

unread,
Oct 28, 2016, 5:00:39 AM10/28/16
to codership

i normally set wsrep_slave_threads to 1 or 2 x CPUs

Charles Williams

unread,
Oct 28, 2016, 5:27:12 AM10/28/16
to codership
Well,

The reason for the lose of sync seems to be that certain databases on the slaves are not consistant with the master. So now I need to find out how to do a full SST for certain Databases.

Charles Williams

unread,
Oct 28, 2016, 5:46:01 AM10/28/16
to codership
OK,

Is it correct that I can stop a slave node, delete the grastate.dat file and after a restart it will do a full SST of all DBs?


James Wang

unread,
Oct 28, 2016, 6:06:30 AM10/28/16
to codership
In such situation, I would "rm -rf datadir " on the corrupted node
Reply all
Reply to author
Forward
0 new messages