Re-bootstrapping a cluster

1,819 views
Skip to first unread message

Jay Janssen

unread,
Oct 29, 2012, 11:02:41 AM10/29/12
to codersh...@googlegroups.com
Everyone knows that sticking a gcomm:// back into your wsrep_cluster_address (or wsrep_urls) will re-bootstrap the cluster.  This is really re-initializing the cluster, and requires all the other nodes to re-SST.

Few know that you can re-bootstrap a running node that is non-Primary with 'set global wsrep_provider_options="pc.bootstrap=true";'.  The advantage here is you are re-using the same internal cluster state so as nodes reconnect here, they can just IST if they have the right state (I think).  


However, there is nothing in between these two options.  I propose there needs to be a way to restart a cluster without using gcomm:// where all the nodes have failed and there be a way to *at least* get a node or two back into a non-Primary state.


Use case:  your over-zealous colo operations need to reboot your cluster nodes to enable the Battery-backed write cache on your RAID controllers (purely hypothetical, but oddly specific, I know).  Instead of obeying your instructions to do one node at a time, they decide to save time by downing all the cluster nodes at once.    The nodes were in a stable state when they were taken down, but because there is no running cluster partition for any of these nodes to connect to, you are forced to use gcomm:// and re-SST all the other nodes (which sucks).  


I *think* it might be reasonable to simply let cluster nodes start even if they can't reach an existing cluster node (starting as non-Primary is fine), but I defer to the collective wisdom here.  



Jay Janssen, Senior MySQL Consultant, Percona Inc.
Percona Live in London, UK Dec 3-4th: http://www.percona.com/live/london-2012/

Alex Yurchenko

unread,
Oct 29, 2012, 1:04:51 PM10/29/12
to codersh...@googlegroups.com
Hi Jay,

I think you're getting a slightly wrong profile here.

1) IST vs SST logic depends *solely* on the availability of required
writesets in the donor cache. It does not depend on any other
circumstances, except what "availability of required writesets" imply -
that is that the joiner knows its GTID (so it knows what it required)
and donor having a cache (so that it has them available)

So whenever (it is also important to understand what it means) you need
a state transfer, this simple check will be done and donor will act
accordingly. With the next MySQL-wsrep release almost any *individual*
node crash/restart should end with IST.

2) "Whenever" means when the node is informed that the cluster it is
connected to changed. In group communication speak it is called
"configuration change" and does not necessarily involve membership
change (but usually does). CC is an important checkpoint, sort of a
"flush" method. E.g. if a node joins the cluster, it will see all
messages after and including CC and no messages before.

'set global wsrep_provider_options="pc.bootstrap=true";' works by
generating CC in which cluster state changes from non-primary to
primary.

So with 2.2 release you now can use multiple (comma-separated)
addresses in gcomm:// URL and create non-primary cluster just by
starting the nodes and connecting to each other. (IIRC you could do it
with 2.1 too, you just needed to specify the address of the peer in
gcomm://) And then use the "pc.bootstrap=true" directive to make this
cluster a primary component. We want to give it a test drive and when it
works as expected, deprecate empty gcomm:// usage as too many people get
burned by it and it seems unsafe.

I hope this starts answering your question. Now with your particular
example there is a problem, totally unrelated to gcomm://.

As a trivial case of this consider shutting down an idle cluster. It
will restart without SST or IST as there will be no need for state
transfer at all. And this is regardless of whether you use gcomm:// or
pc.bootstrap=1.

Now if we shut down (and with GTID recovery from InnoDB it almost does
not matter, whether gracefully or not) a working cluster, different
nodes may shut down at different GTIDs. So regardless of how you restart
the cluster, the less updated nodes will need state transfer from the
most updated nodes. But currently gcache is not persistent and does not
survive even graceful node restart (that's something that can be fixed),
so you will end up with SST even if you miss a single writeset.

Hope this explains.

Regards,
Alex
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Vadim Tkachenko

unread,
Oct 29, 2012, 1:27:05 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

On this topic, can you explain how "GTID recovery from InnoDB" works.
I could not make it working in my recent tests.

Thanks,
Vadim
> --
>
>



--
Vadim Tkachenko, CTO, Percona Inc.
Phone +1-925-400-7377, Skype: vadimtk153
Schedule meeting: http://tungle.me/VadimTkachenko

Looking for Replication with Data Consistency?
Try Percona XtraDB Cluster!

Alex Yurchenko

unread,
Oct 29, 2012, 1:42:46 PM10/29/12
to Vadim Tkachenko, codersh...@googlegroups.com
On 2012-10-29 19:27, Vadim Tkachenko wrote:
> Alex,
>
> On this topic, can you explain how "GTID recovery from InnoDB"
> works.
> I could not make it working in my recent tests.
>
> Thanks,
> Vadim

Vadim,

You start mysqld with --wsrep-recover option and it prints GTID to the
error log. Have a look at mysqld_safe from lp:codership-mysql/5.5 head.
Let us know if you still have problems.

Vadim Tkachenko

unread,
Oct 29, 2012, 1:50:40 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

After I kill server and start it, I have this in error.log:

121029 10:48:55 mysqld_safe Number of processes running now: 0
121029 10:48:55 mysqld_safe mysqld restarted
121029 10:48:55 mysqld_safe WSREP: Running position recovery
121029 10:49:02 mysqld_safe WSREP: Failed to recover position:
121029 10:49:03 [Note] WSREP: Read nil XID from storage engines,
skipping position init
121029 10:49:03 [Note] WSREP: wsrep_load(): loading provider library
'/usr/local/mysql/lib/libgalera_smm.so'
121029 10:49:03 [Note] WSREP: wsrep_load(): Galera 2.2(r137) by
Codership Oy <in...@codership.com> loaded succesfully.
121029 10:49:03 [Note] WSREP: Found saved state:
935c7aa3-1eec-11e2-0800-31d20f3481c6:-1
121029 10:49:03 [Note] WSREP: Reusing existing '/mnt/data/mysql//galera.cache'.
121029 10:49:03 [Note] WSREP: Passing config to GCS: base_host =
10.5.171.147; base_port = 4567; cert.log_conflicts = no; gcache.dir =
/mnt/data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0;
gcache.name = /mnt/data/mysql//galera.cache; gcache.page_size = 128M;
gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit
= 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500;
gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807;
gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO;
replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
121029 10:49:03 [Note] WSREP: Assign initial position for
certification: -1, protocol version: -1
121029 10:49:03 [Note] WSREP: wsrep_sst_grab()
121029 10:49:03 [Note] WSREP: Start replication
121029 10:49:03 [Note] WSREP: Setting initial position to
00000000-0000-0000-0000-000000000000:-1
121029 10:49:03 [Note] WSREP: protonet asio version 0
121029 10:49:03 [Note] WSREP: backend: asio



On Mon, Oct 29, 2012 at 10:42 AM, Alex Yurchenko
--
Vadim Tkachenko, CTO, Percona Inc.
Phone +1-925-400-7377, Skype: vadimtk153
Schedule meeting: http://meetme.so/VadimTkachenko

Alex Yurchenko

unread,
Oct 29, 2012, 4:29:19 PM10/29/12
to codersh...@googlegroups.com
Vadim,

you must be missing --wsrep-recover parameter. This is how it will look
like when starting with --wsrep-recover:

121029 22:26:13 [Warning] Changed limits: max_open_files: 1024
max_connections: 214 table_cache: 400
121029 22:26:13 InnoDB: The InnoDB memory heap is disabled
121029 22:26:13 InnoDB: Mutexes and rw_locks use GCC atomic builtins
121029 22:26:13 InnoDB: Compressed tables use zlib 1.2.3.4
121029 22:26:13 InnoDB: Initializing buffer pool, size = 1.0G
121029 22:26:13 InnoDB: Completed initialization of buffer pool
121029 22:26:13 InnoDB: highest supported file format is Barracuda.
InnoDB: The log sequence number in ibdata files does not match
InnoDB: the log sequence number in the ib_logfiles!
121029 22:26:13 InnoDB: Database was not shut down normally!
InnoDB: Starting crash recovery.
InnoDB: Reading tablespace information from the .ibd files...
InnoDB: Restoring possible half-written data pages from the doublewrite
InnoDB: buffer...
121029 22:26:16 InnoDB: Waiting for the background threads to start
121029 22:26:17 InnoDB: 1.1.8 started; log sequence number 1657970747
121029 22:26:17 [Note] Server hostname (bind-address): '0.0.0.0'; port:
3303
121029 22:26:17 [Note] - '0.0.0.0' resolves to '0.0.0.0';
121029 22:26:17 [Note] Server socket created on IP: '0.0.0.0'.
121029 22:26:17 [Note] Event Scheduler: Loaded 0 events
121029 22:26:17 [Note] WSREP: Recovered position:
5ebe53e3-1d1b-11e2-0800-220835277bbd:1400624
121029 22:26:17 InnoDB: Starting shutdown...
121029 22:26:18 InnoDB: Shutdown completed; log sequence number
1657970747
121029 22:26:18 [Note] /tmp/galera1/mysql/sbin/mysqld: Shutdown
complete

Vadim Tkachenko

unread,
Oct 29, 2012, 4:31:35 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

I can try that, but I wonder how it does work in combination with mysqld_safe ?

As you know mysqld_safe starts mysqld by itself when mysqld crashes.
> --

Alex Yurchenko

unread,
Oct 29, 2012, 5:10:54 PM10/29/12
to Vadim Tkachenko, codersh...@googlegroups.com
Vadim,

I have missed that you used mysqld_safe there:

>>> 121029 10:48:55 mysqld_safe Number of processes running now: 0
>>> 121029 10:48:55 mysqld_safe mysqld restarted
>>> 121029 10:48:55 mysqld_safe WSREP: Running position recovery
>>> 121029 10:49:02 mysqld_safe WSREP: Failed to recover position:

it looks like it tried to do recovery but failed, in particular, it
appears to have failed to create and write an error log file when
running with --wsrep-recover. Error log file is created with mktemp, so
see why this could have failed... Another possibility is of course that
mysqld have crashed before even writing anything to log...

Vadim Tkachenko

unread,
Oct 29, 2012, 5:22:40 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

This is repeatable for me on many systems.
There is one more log

121029 14:20:44 mysqld_safe Starting mysqld daemon with databases from
/mnt/data/mysql
121029 14:20:44 mysqld_safe WSREP: Running position recovery
121029 14:20:51 mysqld_safe WSREP: Failed to recover position:
121029 14:20:51 [Note] WSREP: Read nil XID from storage engines,
skipping position init
121029 14:20:51 [Note] WSREP: wsrep_load(): loading provider library
'/usr/local/mysql/lib/libgalera_smm.so'
121029 14:20:51 [Note] WSREP: wsrep_load(): Galera 2.2(r137) by
Codership Oy <in...@codership.com> loaded succesfully.
121029 14:20:51 [Note] WSREP: Found saved state:
2f8acf7d-1f93-11e2-0800-316fe0e276a3:-1
121029 14:20:51 [Note] WSREP: Reusing existing '/mnt/data/mysql//galera.cache'.
121029 14:20:51 [Note] WSREP: Passing config to GCS: base_host =
10.7.77.252; base_port = 4567; cert.log_conflicts = no; gcache.dir =
/mnt/data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0;
gcache.name = /mnt/data/mysql//galera.cache; gcache.page_size = 128M;
gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit
= 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500;
gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807;
gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO;
replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
121029 14:20:51 [Note] WSREP: Assign initial position for
certification: -1, protocol version: -1
121029 14:20:51 [Note] WSREP: wsrep_sst_grab()
121029 14:20:51 [Note] WSREP: Start replication

I do nothing fancy with error.log, the regular setup from my.cnf,
which is:

[mysqld]
datadir=/mnt/data/mysql
user=mysql

log_error=error.log

binlog_format=ROW

wsrep_provider=/usr/local/mysql/lib/libgalera_smm.so

wsrep_cluster_address=gcomm://10.7.75.174

wsrep_slave_threads=4
wsrep_cluster_name=trimethylxanthine
wsrep_sst_method=xtrabackup
wsrep_node_name=node4

innodb_locks_unsafe_for_binlog=1
innodb_autoinc_lock_mode=2

innodb_file_per_table = true
innodb_data_file_path = ibdata1:10M:autoextend
innodb_flush_log_at_trx_commit = 2
innodb_flush_method = O_DIRECT
innodb_log_buffer_size = 16M

innodb_log_file_size=128M
innodb_buffer_pool_size=10G

innodb_read_io_threads = 4
innodb_write_io_threads = 4
innodb_io_capacity=500





On Mon, Oct 29, 2012 at 2:10 PM, Alex Yurchenko

Alex Yurchenko

unread,
Oct 29, 2012, 6:20:08 PM10/29/12
to Vadim Tkachenko, codersh...@googlegroups.com
Vadim,

If you look into mysqld_safe you will see that it tries to start mysqld
2 times:

First with --wsrep-recover:

> 121029 14:20:44 mysqld_safe WSREP: Running position recovery
> 121029 14:20:51 mysqld_safe WSREP: Failed to recover position:

and then for real:

> 121029 14:20:51 [Note] WSREP: Read nil XID from storage engines,
> skipping position init
> 121029 14:20:51 [Note] WSREP: wsrep_load(): loading provider library
> '/usr/local/mysql/lib/libgalera_smm.so'
and on.

What I was talking about is the first instantiation, where it is
started with --log_error=<name> option. So it logs everything there.
<name> is created by mktemp and by default goes to /tmp. And it looks
very much like nothing is created or written to. Please check
mysqld_safe code to see what is going on. There must be some system
configuration problem. Do you have mktemp on your system and does it
work? Is it in the path for mysqld_safe?

Vadim Tkachenko

unread,
Oct 29, 2012, 6:37:58 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

Yes, I figured your script.
Now I see few problems with it (I am not sure if this is CentOS specific).

mktemp create file like
-rw------- 1 root root /tmp/tmp.UfwD1rJknX

(I start mysqld_safe under root)

However I use mysqld under user 'mysql',

so it can't use file with 'rw' only for root.





On Mon, Oct 29, 2012 at 3:20 PM, Alex Yurchenko

Vadim Tkachenko

unread,
Oct 29, 2012, 7:00:57 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

So, to fix the problem with user you need to use following:

wr_logfile=$(mktemp)
chown $user $wr_logfile
chmod 755 $wr_logfile

However there we face another problem.

Script is trying to start mysqld with following options
WSREP: nohup /usr/local/Percona-XtraDB-Cluster-5.5.28-362/bin/mysqld
--basedir=/usr/local/Percona-XtraDB-Cluster-5.5.28-362
--datadir=/mnt/data/mysql
--plugin-dir=/usr/local/Percona-XtraDB-Cluster-5.5.28-362/lib/mysql/plugin
--user=mysql --log-error=/mnt/data/mysql/error.log
--pid-file=/mnt/data/mysql/server-1351288709-az-2-region-a-geo-1.localdomain.pid
< /dev/null >> /mnt/data/mysql/error.log 2>&1
--log_error=/tmp/tmp.6kbKgwpluf --wsrep-recover

and it fails with following error:

/usr/local/Percona-XtraDB-Cluster-5.5.28-362/bin/mysqld: Too many
arguments (first extra is '<')

so it does not product position in /tmp/tmp.6kbKgwpluf

Vadim Tkachenko

unread,
Oct 29, 2012, 7:33:39 PM10/29/12
to Alex Yurchenko, codersh...@googlegroups.com
Alex,

With my proposed fix there are several scenarios, and in one of them
gives me very serious problem.

So following scenario:
we did clean shutdown (via mysqladmin shutdown).

Now we start with
"mysqld_safe &"

In this case the script works, but mysqld does not go to InnoDB
initialization and does not create socket.
There is log:

121029 16:28:08 mysqld_safe Starting mysqld daemon with databases from
/mnt/data/mysql
121029 16:28:08 mysqld_safe WSREP: Running position recovery
121029 16:28:15 mysqld_safe WSREP: Recovered position
2f8acf7d-1f93-11e2-0800-316fe0e276a3:62610891
121029 16:28:15 [Note] WSREP: wsrep_start_position var submitted:
'2f8acf7d-1f93-11e2-0800-316fe0e276a3:62610891'
121029 16:28:15 [Note] WSREP: Read nil XID from storage engines,
skipping position init
121029 16:28:15 [Note] WSREP: wsrep_load(): loading provider library
'/usr/local/mysql/lib/libgalera_smm.so'
121029 16:28:15 [Note] WSREP: wsrep_load(): Galera 2.2(r137) by
Codership Oy <in...@codership.com> loaded succesfully.
121029 16:28:15 [Note] WSREP: Found saved state:
2f8acf7d-1f93-11e2-0800-316fe0e276a3:62610891
121029 16:28:15 [Note] WSREP: Reusing existing '/mnt/data/mysql//galera.cache'.
121029 16:28:15 [Note] WSREP: Passing config to GCS: base_host =
10.7.77.252; base_port = 4567; cert.log_conflicts = no; gcache.dir =
/mnt/data/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0;
gcache.name = /mnt/data/mysql//galera.cache; gcache.page_size = 128M;
gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1; gcs.fc_limit
= 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500;
gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807;
gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = NO;
replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
121029 16:28:15 [Note] WSREP: Assign initial position for
certification: 62610891, protocol version: -1
121029 16:28:15 [Note] WSREP: wsrep_sst_grab()
121029 16:28:15 [Note] WSREP: Start replication
121029 16:28:15 [Note] WSREP: Setting initial position to
2f8acf7d-1f93-11e2-0800-316fe0e276a3:62610891
121029 16:28:15 [Note] WSREP: protonet asio version 0
121029 16:28:15 [Note] WSREP: backend: asio
121029 16:28:15 [Note] WSREP: GMCast version 0
121029 16:28:15 [Note] WSREP: (50558e15-2220-11e2-0800-1792f80bd29d,
'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567
121029 16:28:15 [Note] WSREP: (50558e15-2220-11e2-0800-1792f80bd29d,
'tcp://0.0.0.0:4567') multicast: , ttl: 1
121029 16:28:15 [Note] WSREP: EVS version 0
121029 16:28:15 [Note] WSREP: PC version 0
121029 16:28:15 [Note] WSREP: gcomm: connecting to group
'trimethylxanthine', peer '10.7.75.174:'
121029 16:28:15 [Note] WSREP: (50558e15-2220-11e2-0800-1792f80bd29d,
'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive
peers: tcp://10.7.74.195:4567 tcp://10.7.76.10:4567
121029 16:28:15 [Note] WSREP: (50558e15-2220-11e2-0800-1792f80bd29d,
'tcp://0.0.0.0:4567') cleaning up duplicate 0x14dd3f0 after
established 0x14d4d90
121029 16:28:15 [Note] WSREP: (50558e15-2220-11e2-0800-1792f80bd29d,
'tcp://0.0.0.0:4567') turning message relay requesting off
121029 16:28:16 [Note] WSREP: declaring
2f8a1896-1f93-11e2-0800-206d6e27783b stable
121029 16:28:16 [Note] WSREP: declaring
6c47b8cf-1f95-11e2-0800-f4fdffe1cfa7 stable
121029 16:28:16 [Note] WSREP: declaring
f7a93de1-1f94-11e2-0800-fb06dc57f95a stable
121029 16:28:16 [Note] WSREP:
view(view_id(PRIM,2f8a1896-1f93-11e2-0800-206d6e27783b,60) memb {
2f8a1896-1f93-11e2-0800-206d6e27783b,
50558e15-2220-11e2-0800-1792f80bd29d,
6c47b8cf-1f95-11e2-0800-f4fdffe1cfa7,
f7a93de1-1f94-11e2-0800-fb06dc57f95a,
} joined {
} left {
} partitioned {
})
121029 16:28:16 [Note] WSREP: gcomm: connected
121029 16:28:16 [Note] WSREP: Changing maximum packet size to 64500,
resulting msg size: 32636
121029 16:28:16 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
121029 16:28:16 [Note] WSREP: Opened channel 'trimethylxanthine'
121029 16:28:16 [Note] WSREP: New COMPONENT: primary = yes, bootstrap
= no, my_idx = 1, memb_num = 4
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
121029 16:28:16 [Note] WSREP: Waiting for SST to complete.
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: sent state msg:
508d6c4b-2220-11e2-0800-818688cdec35
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: got state msg:
508d6c4b-2220-11e2-0800-818688cdec35 from 0 (node1)
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: got state msg:
508d6c4b-2220-11e2-0800-818688cdec35 from 2 (node4)
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: got state msg:
508d6c4b-2220-11e2-0800-818688cdec35 from 3 (node3)
121029 16:28:16 [Note] WSREP: STATE EXCHANGE: got state msg:
508d6c4b-2220-11e2-0800-818688cdec35 from 1 (node4)
121029 16:28:16 [Note] WSREP: Quorum results:
version = 2,
component = PRIMARY,
conf_id = 53,
members = 4/4 (joined/total),
act_id = 62610891,
last_appl. = -1,
protocols = 0/4/2 (gcs/repl/appl),
group UUID = 2f8acf7d-1f93-11e2-0800-316fe0e276a3
121029 16:28:16 [Note] WSREP: Flow-control interval: [32, 32]
121029 16:28:16 [Note] WSREP: Restored state OPEN -> JOINED (62610891)
121029 16:28:16 [Note] WSREP: New cluster view: global state:
2f8acf7d-1f93-11e2-0800-316fe0e276a3:62610891, view# 54: Primary,
number of nodes: 4, my index: 1, protocol version 2
121029 16:28:16 [Note] WSREP: wsrep_notify_cmd is not defined,
skipping notification.
121029 16:28:16 [Note] WSREP: Assign initial position for
certification: 62610891, protocol version: 2
121029 16:28:16 [Note] WSREP: Member 1 (node4) synced with group.
121029 16:28:16 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 62610891)
121029 16:28:16 [Note] WSREP: Synchronized with group, ready for connections
121029 16:28:16 [Note] WSREP: wsrep_notify_cmd is not defined,
skipping notification.

On this stage nothing happens and as you see from the log, InnoDB and
socket were not initialized.

Thanks,
Vadim

Alex Yurchenko

unread,
Oct 30, 2012, 4:45:44 AM10/30/12
to Vadim Tkachenko, codersh...@googlegroups.com
I see. Do you have wsrep_slave_threads=1 there?

Jay Janssen

unread,
Oct 30, 2012, 8:40:50 AM10/30/12
to Alex Yurchenko, codersh...@googlegroups.com
On Oct 29, 2012, at 1:04 PM, Alex Yurchenko <alexey.y...@codership.com> wrote:

Hi Jay,

I think you're getting a slightly wrong profile here.

1) IST vs SST logic depends *solely* on the availability of required writesets in the donor cache. It does not depend on any other circumstances, except what "availability of required writesets" imply - that is that the joiner knows its GTID (so it knows what it required) and donor having a cache (so that it has them available)

So whenever (it is also important to understand what it means) you need a state transfer, this simple check will be done and donor will act accordingly. With the next MySQL-wsrep release almost any *individual* node crash/restart should end with IST.

2) "Whenever" means when the node is informed that the cluster it is connected to changed. In group communication speak it is called "configuration change" and does not necessarily involve membership change (but usually does). CC is an important checkpoint, sort of a "flush" method. E.g. if a node joins the cluster, it will see all messages after and including CC and no messages before.

'set global wsrep_provider_options="pc.bootstrap=true";' works by generating CC in which cluster state changes from non-primary to primary.

So with 2.2 release you now can use multiple (comma-separated) addresses in gcomm:// URL and create non-primary cluster just by starting the nodes and connecting to each other. (IIRC you could do it with 2.1 too, you just needed to specify the address of the peer in gcomm://) And then use the "pc.bootstrap=true" directive to make this cluster a primary component. We want to give it a test drive and when it works as expected, deprecate empty gcomm:// usage as too many people get burned by it and it seems unsafe.

I hope this starts answering your question. Now with your particular example there is a problem, totally unrelated to gcomm://.

As a trivial case of this consider shutting down an idle cluster. It will restart without SST or IST as there will be no need for state transfer at all. And this is regardless of whether you use gcomm:// or pc.bootstrap=1.

I know there is some distance between PXC and Galera-HEAD, but every time I've restarted the first node in a non-running cluster (i.e., no nodes up at all):

- mysqld can't start without gcomm://
- This always triggers SSTs 

However, based on what you said above, it sounds like you are already considering just deprecating gcomm:// and allowing non-primary sets of the cluster to restart.  


Now if we shut down (and with GTID recovery from InnoDB it almost does not matter, whether gracefully or not) a working cluster, different nodes may shut down at different GTIDs. So regardless of how you restart the cluster, the less updated nodes will need state transfer from the most updated nodes. But currently gcache is not persistent and does not survive even graceful node restart (that's something that can be fixed), so you will end up with SST even if you miss a single writeset.


--wsrep-recover is not documented on your wiki AFAICT.  Can you explain circumstances where it makes sense that this option is not always enabled?


Ultimately, the cluster should be capable of recovering to the best of its ability from an all-down scenario.  I recognize the technical limitations that may be inherent in this, but from a functional point of view if the power suddenly goes out on all nodes:

- They should all be in the same cluster config state
- They may have different writeset (i.e., wsrep_last_committed might vary slightly)
- I'd expect to be able to startup all the nodes and have the cluster sort out what is missing where and recover to the best of its ability (and tell me where it could not).  
- The worst case seems to me to be commits in Innodb on a single node that were only queued on all other nodes and did not persist in any node's gcache.  

How does this differ from what Galera is capable of today in the same scenario?

Vadim Tkachenko

unread,
Oct 30, 2012, 10:18:49 AM10/30/12
to Alex Yurchenko, codersh...@googlegroups.com
wsrep_slave_threads=4


On Tue, Oct 30, 2012 at 1:45 AM, Alex Yurchenko

Alex Yurchenko

unread,
Oct 30, 2012, 1:26:06 PM10/30/12
to Jay Janssen, codersh...@googlegroups.com
Indeed it can't. It times out when it can't connect to cluster. It does
not have to, but if it won't you'll either have a startup script waiting
(potentially eternally) for it to reach primary component, or return
success when it is not necessarily true. So our choice is to timeout if
primary component can't be connected to for a certain amount of time
(should be 30 seconds). It is an old behavior and is the same in PXC and
Galera-HEAD.

So you need to specify either gcomm:// or gcomm://<address of a running
node>

> - This always triggers SSTs

When the node fails to start it invalidates its GTID for safety
purposes (although in some cases that might be avoided, arguably). If
you had an unsuccessful node start, the next time it starts with a new
UUID and so is incomparable with other nodes.

However if you try this:
1) start the cluster anew.
2) wait for all nodes to become synced
3) shut all nodes down.
4) start the cluster as you did in 1)
- you won't have ANY state transfer.

> However, based on what you said above, it sounds like you are already
> considering just deprecating gcomm:// and allowing non-primary sets
> of
> the cluster to restart.

well, only to assemble initial cluster. If you use gcomm://<address
unreachable in 30 seconds> the node will timeout and shut down.

>>
>> Now if we shut down (and with GTID recovery from InnoDB it almost
>> does not matter, whether gracefully or not) a working cluster,
>> different nodes may shut down at different GTIDs. So regardless of how
>> you restart the cluster, the less updated nodes will need state
>> transfer from the most updated nodes. But currently gcache is not
>> persistent and does not survive even graceful node restart (that's
>> something that can be fixed), so you will end up with SST even if you
>> miss a single writeset.
>
>
> --wsrep-recover is not documented on your wiki AFAICT. Can you
> explain circumstances where it makes sense that this option is not
> always enabled?

It is basically the same circumstances that require you to have a valid
wsrep_cluster_address on startup - impossibility to turn storage engines
on and off in runtime. You have to start and stop the whole mysqld for
that.

In this particular case, in order to recover GTID you need to
initialize InnoDB storage engine. Then you can connect and request state
transfer. But all SSTs but mysqldump need to be run BEFORE storage
engine initialization. So you have to first run mysqld with
--wsrep-recover, shut it down, and run it again for real with a known
GTID.

> Ultimately, the cluster should be capable of recovering to the best
> of its ability from an all-down scenario. I recognize the technical
> limitations that may be inherent in this, but from a functional point
> of view if the power suddenly goes out on all nodes:
>
> - They should all be in the same cluster config state
> - They may have different writeset (i.e., wsrep_last_committed might
> vary slightly)
> - I'd expect to be able to startup all the nodes and have the cluster
> sort out what is missing where and recover to the best of its ability
> (and tell me where it could not).
> - The worst case seems to me to be commits in Innodb on a single node
> that were only queued on all other nodes and did not persist in any
> node's gcache.
>
> How does this differ from what Galera is capable of today in the same
> scenario?

Well, it differs in that you have to be very careful here and I think
what you're talking about is actually out of Galera scope - namely
choosing the seed node after all-down crash. This is probably not
something you want to do automatically, although I can recognize that
this process can be sped up and made more safe with an external cluster
management script, that would

- given the list of nodes recover GTIDs on them (that would also
recover InnoDB table space there)
- present the operator with a choice in which order the nodes should be
started

I know, "have the cluster sort out" sounds army-straightforward, but it
is not. Since what is "the cluster" is determined by the order and time
in which nodes start. Start it in different order and it will be a
different cluster.

Regards,
Alex

> Jay Janssen, Senior MySQL Consultant, Percona Inc.
> http://about.me/jay.janssen
> Percona Live in London, UK Dec 3-4th:
> http://www.percona.com/live/london-2012/

Jay Janssen

unread,
Oct 31, 2012, 6:22:31 AM10/31/12
to Alex Yurchenko, codersh...@googlegroups.com
On Oct 30, 2012, at 1:26 PM, Alex Yurchenko <alexey.y...@codership.com> wrote:

However if you try this:
1) start the cluster anew.
2) wait for all nodes to become synced
3) shut all nodes down.
4) start the cluster as you did in 1)
- you won't have ANY state transfer.

#2 implies SST, right?  I suppose this makes sense:  entire cluster goes down, we need to manually choose the new "seed" node, as you say.  Obviously the process to pick the "best" node may not be very deterministic at this point, but at least we must make some choice and resync the rest of the cluster to that.  

However, If I do a clean restart of the whole cluster in steps 3/4, won't the first node I start need a gcomm:// and therefore (as you said) create a new GTID for safety purposes?  If this is the case, wouldn't the other nodes think they need SST (again)?  

Alex Yurchenko

unread,
Oct 31, 2012, 10:06:29 AM10/31/12
to Jay Janssen, codersh...@googlegroups.com
On 2012-10-31 12:22, Jay Janssen wrote:
> On Oct 30, 2012, at 1:26 PM, Alex Yurchenko
> <alexey.y...@codership.com> wrote:
>
>> However if you try this:
>> 1) start the cluster anew.
>> 2) wait for all nodes to become synced
>> 3) shut all nodes down.
>> 4) start the cluster as you did in 1)
>> - you won't have ANY state transfer.
>
> #2 implies SST, right?

In general yes, unless you shut it down cleanly before (see 3-4)

> I suppose this makes sense: entire cluster
> goes down, we need to manually choose the new "seed" node, as you
> say.
> Obviously the process to pick the "best" node may not be very
> deterministic at this point, but at least we must make some choice
> and
> resync the rest of the cluster to that.

Exactly

> However, If I do a clean restart of the whole cluster in steps 3/4,
> won't the first node I start need a gcomm:// and therefore (as you
> said) create a new GTID for safety purposes? If this is the case,
> wouldn't the other nodes think they need SST (again)?

On a clean shutdown the node has a chance to record GTID to
grastate.dat file, and it will use that one on restart even with
gcomm://. New GTID is generated only if you fail to supply a valid GTID
on startup (there's even a wsrep_start_position variable to do it
manually).I think the reason you so far had little luck with it is that
it is very easy to lose that GTID record by making a mistake during
startup.

>
>
>
>
> Jay Janssen, Senior MySQL Consultant, Percona Inc.
> http://about.me/jay.janssen
> Percona Live in London, UK Dec 3-4th:
> http://www.percona.com/live/london-2012/

Jay Janssen

unread,
Nov 1, 2012, 9:50:28 AM11/1/12
to Alex Yurchenko, codersh...@googlegroups.com

On Oct 31, 2012, at 10:06 AM, Alex Yurchenko <alexey.y...@codership.com> wrote:

On a clean shutdown the node has a chance to record GTID to grastate.dat file, and it will use that one on restart even with gcomm://. New GTID is generated only if you fail to supply a valid GTID on startup (there's even a wsrep_start_position variable to do it manually).I think the reason you so far had little luck with it is that it is very easy to lose that GTID record by making a mistake during startup.

This is very correct, it is quite easy to lose state, even on some benign mistake.  For example, if I make a configuration change and need to restart mysql, if I misspell a configuration parameter mysqld will refuse to start and galera will drop state.    

I've attempted to manually update grastate.dat, but had limited success, and then remembered wsrep_start_position.  Which should I use?  Is there technically any difference?


Compounding this annoyance is if I restart mysql without realizing it (say, fixing said typo and quickly restarting), an SST starts happening immediately.  This, at least for the xtrabackup method, immediately starts overwriting the data directory, so I am stuck with a full SST when one could have been avoided.  Do you have any modifications coming that will help alleviate this pain?  If not, can I suggest a way to lag the start of an SST for some time (say 30 seconds) before it actually starts so I have a chance to abort it?  I'm not saying this is the best idea, but it's all I can think of currently.  

Ilias Bertsimas

unread,
Nov 1, 2012, 10:40:06 AM11/1/12
to codersh...@googlegroups.com, Alex Yurchenko
Yes it seems it is really easy to lose state. If you do one start up that fails the next one will at least not have the GTID position just the cluster id and from there it will do an SST.
I believe this is by design as galera considers the only way to get the node in sync is a full SST since the IST failed once. Please correct me if I am wrong.

Kind Regards,
Ilias.

Jay Janssen

unread,
Nov 1, 2012, 11:10:28 AM11/1/12
to Ilias Bertsimas, codersh...@googlegroups.com, Alex Yurchenko

On Nov 1, 2012, at 10:40 AM, Ilias Bertsimas <awar...@gmail.com> wrote:

I believe this is by design as galera considers the only way to get the node in sync is a full SST since the IST failed once. Please correct me if I am wrong.

I believe you are right.  This makes sense in many cases, like on a replication error: mysqld dies, looses state, and mysqld_safe auto-restart triggers an SST -- This is effectively a self-healing cluster, which is good.  

It's just a little over-zealous other times in clearing node state (IMHO).

Alex Yurchenko

unread,
Nov 1, 2012, 11:51:24 AM11/1/12
to Jay Janssen, Ilias Bertsimas, codersh...@googlegroups.com
Hi guys,

Here I'd like to refer you to the following bug:
https://bugs.launchpad.net/galera/+bug/1054171 and a discussion therein.
The reason is that the perceived "overzealousness" of Galera may be not
what it seems. Without detailed analysis of the situation (that is -
logs) it is impossible to say whether grastate.dat invalidation was
really called for or not. (well, without logs it is hard to argue what
really happened at all, so we need them logs)

We'd be happy to fix any _documented_ cases of such unnecessary
grastate.dat invalidation.

Thanks,
Alex

Ilias Bertsimas

unread,
Nov 1, 2012, 12:07:46 PM11/1/12
to codersh...@googlegroups.com, Jay Janssen, Ilias Bertsimas
Yes I see what you mean. In my case it usually fails to do an IST for whatever reason (not necessarily a bug) and it just aborts and clears the GTID by setting it to -1.

Kind Regards,
Ilias.

Alex Yurchenko

unread,
Nov 1, 2012, 12:26:43 PM11/1/12
to codersh...@googlegroups.com
On 2012-11-01 18:07, Ilias Bertsimas wrote:
> Yes I see what you mean. In my case it usually fails to do an IST for
> whatever reason (not necessarily a bug) and it just aborts and clears
> the
> GTID by setting it to -1.

To be precise it clears GTID _before_ engaging in IST. It just does not
restore it to old value on abort (and this probably can be fixed).

> Kind Regards,
> Ilias.
>
> On Thursday, November 1, 2012 3:51:26 PM UTC, Alexey Yurchenko wrote:
>>
>> Hi guys,
>>
>> Here I'd like to refer you to the following bug:
>> https://bugs.launchpad.net/galera/+bug/1054171 and a discussion
>> therein.
>> The reason is that the perceived "overzealousness" of Galera may be
>> not
>> what it seems. Without detailed analysis of the situation (that is -
>> logs) it is impossible to say whether grastate.dat invalidation was
>> really called for or not. (well, without logs it is hard to argue
>> what
>> really happened at all, so we need them logs)
>>
>> We'd be happy to fix any _documented_ cases of such unnecessary
>> grastate.dat invalidation.
>>
>> Thanks,
>> Alex
>>
>> On 2012-11-01 17:10, Jay Janssen wrote:
>> > On Nov 1, 2012, at 10:40 AM, Ilias Bertsimas
>> <awar...@gmail.com<javascript:>>

teemu....@codership.com

unread,
Nov 6, 2012, 5:49:16 AM11/6/12
to Vadim Tkachenko, Alex Yurchenko, codersh...@googlegroups.com

Vadim,

We have identified the reason for this hang, notifying main thread was
skipped when handling first view event from group.

This bug will be tracked in:
https://bugs.launchpad.net/codership-mysql/+bug/1075495

Thanks for reporting this!

- Teemu
> --


Reply all
Reply to author
Forward
0 new messages