Mariadb Galera wsrep GTID cluster - after update

60 views
Skip to first unread message

Peter Bajan

unread,
Apr 14, 2023, 7:18:34 AM4/14/23
to codership
I had a fully functional Mariadb 10.5.16 with a Galera cluster (26.4.11). There were 3 nodes, and the GTID was set up and functioning properly, with each server having identical settings, including server_id = 1, and wsrep_gtid_domain_id=11. After upgrading to version 10.5.19 of Galera cluster (26.4.14), the GTID numbering stopped working predictably. I noticed the problem when restoring one of the nodes using SST. The GTID began producing numbers with local GTIDs, or the GTID diverged from the other servers. I was unable to sync the GTID in the Galera cluster. Currently, I am testing Mariadb version 10.11.2. I even tried restarting the new Galera cluster, followed by SST on the joining nodes, changing the ID numbers, etc. But nothing worked. I assumed it could be a bug, but after going through all the Mariadb versions up to 10.11.2, there appears to be some new functionality in Galera GTID numbering that doesn't make sense to me. Have you encountered a similar problem? Is there any documentation on the new behavior of Galera GTID numbering and syncing? I haven't found any information on it.

Seppo Jaakola

unread,
Apr 14, 2023, 8:15:45 AM4/14/23
to codership
This could be related to using KILL commands. There was a change, some time in the past, in KILL command handling, to replicate the KILL into the cluster, and this consumed one GTID.
There is ongoing effort to handle KILL in cluster without replication, this refactoring is happening as part of fixes for  https://jira.mariadb.org/browse/MDEV-29293

Peter Bajan

unread,
Apr 17, 2023, 7:18:14 AM4/17/23
to codership
Thank you for the information. As you mentioned, it seems that it is likely related. To provide completeness to my previous information, I will add further details and the results of my investigation.

Workaround for syncing GTID in a Galera cluster with 3 nodes on Ubuntu 18.04:
The GTID of my cluster produces addresses of the type: 1-11-1.

    Start a new Galera cluster (reset binlog and gcache).
    Only one node is running: P1.
    The Galera cluster at this moment only has node P1, which has GTID 1-11-1.
    All writes to the database server are stopped.
    Connect P2 - an SST is performed.
    Node P2 has GTID 1-11-1 (so far, this is okay).
    Perform the first SQL transaction (whether on node P1 or P2 does not matter).
    GTID on P1 increases to 1-11-2.
    On P2, GTID is split and becomes: 0-1-2, 1-11-1, and further SQL transactions only increment the local GTID to 0-1-3, 1-11-1... (the GTID for the Galera cluster remains unchanged).

Solution:
After performing SST on node P2, I must prevent the execution of SQL transactions and restart the MariaDB process. After that, the GTID is synced, and it works correctly.

Observation:
After performing SST, GTID is split into local and global precisely at the moment of completion of SST. All new transactions are written to the local GTID (in my case, 0-11-x).

Please note that the local GTID, the transaction number of node P2, changes to 2, which is the same transaction number as for node P1. But for P1 is the transaction recorded in the Galera cluster gtid (1-11-2) . It seems as after the SST, the newly added node did not have/ignore the information about wsrep_gtid_domain_id, even though it is explicitly stated in the configuration file, in my case it is set to 1.

Maybe this information will help someone.




Dátum: piatok 14. apríla 2023, čas: 14:15:45 UTC+2, odosielateľ: Seppo Jaakola
Reply all
Reply to author
Forward
0 new messages