Hi,
I have a galera cluster with 3 nodes. The database writes hang when I run multiple instances of test program to update
a table (different rows and same row - both the cases). When this happens, all the writes to db, from any node simply hang; but reads
go through. The test programs are running in a loop. When this happens, on one of the node, the database errors keep getting
added to the file. These error keep repeating, continuously. A snippet of the errors are noted below.
Case I - Updating the same row from all the client apps
......
TRANSACTION 1909970, ACTIVE 2930 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT
MySQL thread id 10, OS thread handle 140552223639296, query id 6422549 Update_rows_log_event::find_row(533884)
update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)
2019-02-14 20:08:11 140551078930176 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1d24d2 query: update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)
TRANSACTION 1909970, ACTIVE 2931 sec starting index read
mysql tables in use 1, locked 1
LOCK WAIT
MySQL thread id 10, OS thread handle 140552223639296, query id 6422549 Update_rows_log_event::find_row(533884)
update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)
2019-02-14 20:08:12 140551078930176 [Note] InnoDB: WSREP: BF lock wait long for trx:0x1d24d2 query: update zone_serial_number zs join temp t on t.v_z_id = zs.zone_id set zs.serial_number = ifnull(t.v_z_serial,zs.serial_number)
TRANSACTION 1909970, ACTIVE 2932 sec starting index read
.......
Case II - Updating different rows of the same table
..........
TRANSACTION 2236864, ACTIVE 4698 sec inserting
mysql tables in use 2, locked 2
LOCK WAIT , undo log entries 1
MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)
insert into .... SQL to update the table
mysql tables in use 2, locked 2
LOCK WAIT , undo log entries 1
MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)
insert into .... SQL to update the table
TRANSACTION 2236864, ACTIVE 4699 sec inserting
mysql tables in use 2, locked 2
LOCK WAIT , undo log entries 1
MySQL thread id 10, OS thread handle 139905415186176, query id 5173438 Write_rows_log_event::write_row(693241)
insert into .... SQL to update the table
This brings the whole application to stand still. Only way to recover from this is to stop the mariadb service on the node where
the above errors occur. Once, the db is stopped on this node, other db operations on the other nodes start to work. Once, I restart the
db on the error node, it joins the cluster etc.
Question:
In what condition does this happen? Why does the operation come out with a failure instead of continuously trying in a loop?
Any thoughts or suggestions or comments on how to avoid this problem or fix it, are really appreciated.
This is critical as the while cluster is becoming useless when this codition occurs.
Environment:
Mariadb 10.2.21
Galera 25.3.25-1
CentOS 7
Cluster config section from one of the my.cnf file:
[galera]
#wasrep settings
wsrep_on=ON
wsrep_provider=/usr/lib64/galera/libgalera_smm.so
wsrep_cluster_address=gcomm://.....
wsrep_cluster_name=....
wsrep_node_address=........
wsrep_node_name=.......
wsrep_sst_method=mariabackup
wsrep_sst_auth=.......
wsrep_slave_threads=4
default_storage_engine=InnoDB
innodb_autoinc_lock_mode=2
innodb_flush_log_at_trx_commit=0
bind-address=0.0.0.0
innodb_doublewrite=1
wsrep_provider_options="pc.weight=1;pc.bootstrap=YES;pc.recovery=TRUE;pc.wait_prim=FALSE;gcache.recover=YES;pc.ignore_sb=TRUE;
;evs.keepalive_period=PT5S;evs.suspect_timeout=PT30S;evs.inactive_timeout=PT1M;evs.install_timeout=PT1M;
evs.join_retrans_period=PT1S;gcache.size=512M;gcs.fc_factor=0.8;evs.send_window=64;evs.user_send_window=32"
#wsrep_log_conflicts=ON
wsrep_retry_autocommit=4
#wsrep_debug=ON
wsrep_notify_cmd=....scripts/nodeStatusChange.sh
wsrep_auto_increment_control=ON
I can provide other setup information, if needed.
Thanks,
Chandra Kapate