The database writes hang in Galera cluster with Mariadb.

427 views

Skip to first unread message

Chandra Kapate

unread,

Feb 15, 2019, 3:31:17 PM2/15/19

to codership

Hi,

I have a galera cluster with 3 nodes. The database writes hang when I run multiple instances of test program to update

a table (different rows and same row - both the cases). When this happens, all the writes to db, from any node simply hang; but reads

go through. The test programs are running in a loop. When this happens, on one of the node, the database errors keep getting

added to the file. These error keep repeating, continuously. A snippet of the errors are noted below.

Case I - Updating the same row from all the client apps

......

TRANSACTION 1909970, ACTIVE 2930 sec starting index read

mysql tables in use 1, locked 1

LOCK WAIT

MySQL thread id 10, OS thread handle 140552223639296, query id 6422549 Update_rows_log_event::find_row(533884)

update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)

2019-02-14 20:08:11 140551078930176 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1d24d2 query: update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)

TRANSACTION 1909970, ACTIVE 2931 sec starting index read

mysql tables in use 1, locked 1

LOCK WAIT

MySQL thread id 10, OS thread handle 140552223639296, query id 6422549 Update_rows_log_event::find_row(533884)

update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)

2019-02-14 20:08:12 140551078930176 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1d24d2 query: update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)

TRANSACTION 1909970, ACTIVE 2932 sec starting index read

.......

Case II - Updating different rows of the same table

..........

TRANSACTION 2236864, ACTIVE 4698 sec inserting

mysql tables in use 2, locked 2

LOCK WAIT , undo log entries 1

MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)

insert into .... SQL to update the table

mysql tables in use 2, locked 2

LOCK WAIT , undo log entries 1

MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)

insert into .... SQL to update the table

TRANSACTION 2236864, ACTIVE 4699 sec inserting

mysql tables in use 2, locked 2

LOCK WAIT , undo log entries 1

MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)

insert into .... SQL to update the table

This brings the whole application to stand still. Only way to recover from this is to stop the mariadb service on the node where

the above errors occur. Once, the db is stopped on this node, other db operations on the other nodes start to work. Once, I restart the

db on the error node, it joins the cluster etc.

Question:

In what condition does this happen? Why does the operation come out with a failure instead of continuously trying in a loop?

Any thoughts or suggestions or comments on how to avoid this problem or fix it, are really appreciated.

This is critical as the while cluster is becoming useless when this codition occurs.

Environment:

Mariadb 10.2.21

Galera 25.3.25-1

CentOS 7

Cluster config section from one of the my.cnf file:

[galera]

#wasrep settings

wsrep_on=ON

wsrep_provider=/usr/lib64/galera/libgalera_smm.so

wsrep_cluster_address=gcomm://.....

wsrep_cluster_name=....

wsrep_node_address=........

wsrep_node_name=.......

wsrep_sst_method=mariabackup

wsrep_sst_auth=.......

wsrep_slave_threads=4

default_storage_engine=InnoDB

innodb_autoinc_lock_mode=2

innodb_flush_log_at_trx_commit=0

bind-address=0.0.0.0

innodb_doublewrite=1

wsrep_provider_options="pc.weight=1;pc.bootstrap=YES;pc.recovery=TRUE;pc.wait_prim=FALSE;gcache.recover=YES;pc.ignore_sb=TRUE;

;evs.keepalive_period=PT5S;evs.suspect_timeout=PT30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;

evs.join_retrans_period=PT1S;gcache.size=512M;gcs.fc_factor=0.8;evs.send_window=64;evs.user_send_window=32"

#wsrep_log_conflicts=ON

wsrep_retry_autocommit=4

#wsrep_debug=ON

wsrep_notify_cmd=....scripts/nodeStatusChange.sh

wsrep_auto_increment_control=ON

I can provide other setup information, if needed.

Thanks,

Chandra Kapate

Reply all

Reply to author

Forward

0 new messages