temporarily remove server from cluster

Andy Thompson

unread,

May 9, 2012, 12:48:40 PM5/9/12

to codersh...@googlegroups.com

If I want to temporarily remove a server from my cluster to test various processes against, can I just set wsrep_on=off, do my testing and then re enable it? I can't recall if that will make the server stop responding entirely. Or do I need to shut it off and set it up in it's own cluster for a short time and then add it back to the live cluster? I just want to make sure if data is changed by some misfortune that those changes aren't replicated to the cluster.

thanks

Alex Yurchenko

unread,

May 9, 2012, 4:40:25 PM5/9/12

to codersh...@googlegroups.com

Yes, you can set wsrep_on=off globally and nothing will be replicated
TO cluster. However the node will still receive and apply all events
FROM cluster.

If you want the node to disconnect from cluster entirely, you need to
- either start a separate cluster by setting
wsrep_cluster_address='gcomm://'
- or unload wsrep provider completely by setting wsrep_provider='none'

Regards,
Alex

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Daniel Mauricio Guzmán Burgos

unread,

May 10, 2012, 11:47:13 AM5/10/12

to Alex Yurchenko, codersh...@googlegroups.com

Hi Alex

I've tried both options and this is what happened:

1. Setting wsrep_provider='none':

Node get disconnected, and functional. But when i set again the wsrep_provider with the path to libgalera_smm.so, the node get stalled in initialized state.

Node Log:
120510 14:33:41 [Note] WSREP: Stop replication
120510 14:33:43 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib64/libgalera_smm.so'
120510 14:33:43 [Note] WSREP: wsrep_load(): Galera 2.1dev(r109) by Codership Oy <in...@codership.com> loaded succesfully.
120510 14:33:43 [Note] WSREP: Preallocating 134219048/134219048 bytes in '/vol01/var//galera.cache'...
120510 14:33:43 [Note] WSREP: Passing config to GCS: gcache.dir = /vol01/var/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /vol01/var//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3

And nothing else happened

2. Setting wsrep_cluster_address='gcomm://'

Node get disconnected and boost a new cluster with it as only member. OK

I did some inserts on the "new" cluster and a delete on the "old" cluster (with 2 node as members). The rows deleted also exists on the disconnected node, but due that the node isn't in the old cluster, the rows keep existing on it. OK

Then, i restored wsrep_cluster_address value to the original one. The node joined the cluster with no problems but data never get synced: The rows i deleted on the old cluster (and that was present on disconnected node) still available on the rejoined node.
Still, the joined node can perform selects and new inserts with no problem.

But, when i did the same delete on the rejoined node, the entirely cluster fail (because the classic row replication error HA_ERR_KEY_NOT_FOUND) and the 2 nodes that was never being disconnected from original cluster ask for SST. In other words: it was like a new cluster was boostraped, with the aggravating that SST on one node failed, due Resource temporarily unavailable. SST method: Xtrabackup.
So, i stayed with a single cluster, but with a single node.

Log of one of the nodes from original cluster:

120510 14:51:19 [Note] WSREP: Flow-control interval: [12, 23]
120510 14:51:19 [ERROR] Slave SQL: Could not execute Delete_rows event on table test.dani; Can't find record in 'dani', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log FIRST, end_log_pos 1085, Error_code: 1032
120510 14:51:19 [Warning] WSREP: RBR event 2 Delete_rows apply warning: 120, 8728
120510 14:51:19 [ERROR] WSREP: Failed to apply trx: source: 1c0e4463-9aaf-11e1-0800-499c4e2eb871 version: 2 local: 0 state: CERTIFYING flags: 1 conn_id: 4 trx_id: 51510 seqnos (l: 8768, g: 8728, s: 8727, d: 8721, ts: 1336661498271159426)
120510 14:51:19 [ERROR] WSREP: Failed to apply app buffer: �իO , seqno: 8728, status: WSREP_FATAL
    at galera/src/replicator_smm.cpp:apply_wscoll():51
    at galera/src/replicator_smm.cpp:apply_trx_ws():122
120510 14:51:19 [ERROR] WSREP: Node consistency compromized, aborting...
120510 14:51:19 [Note] WSREP: Closing send monitor...
120510 14:51:19 [Note] WSREP: Closed send monitor.
120510 14:51:19 [Note] WSREP: gcomm: terminating thread
120510 14:51:19 [Note] WSREP: gcomm: joining thread
120510 14:51:19 [Note] WSREP: gcomm: closing backend
120510 14:51:19 [Note] WSREP: view(view_id(NON_PRIM,1c0e4463-9aaf-11e1-0800-499c4e2eb871,105) memb {
    719b79c6-9954-11e1-0800-f06b656b08da,
} joined {
} left {
} partitioned {
    1c0e4463-9aaf-11e1-0800-499c4e2eb871,
})
120510 14:51:19 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
120510 14:51:19 [Note] WSREP: view((empty))
120510 14:51:19 [Note] WSREP: gcomm: closed
120510 14:51:19 [Note] WSREP: Flow-control interval: [8, 16]
120510 14:51:19 [Note] WSREP: Received NON-PRIMARY.
120510 14:51:19 [Note] WSREP: Shifting SYNCED -> OPEN (TO: 8728)
120510 14:51:19 [Note] WSREP: Received self-leave message.
120510 14:51:19 [Note] WSREP: Flow-control interval: [0, 0]
120510 14:51:19 [Note] WSREP: Received SELF-LEAVE. Closing connection.
120510 14:51:19 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 8728)
120510 14:51:19 [Note] WSREP: RECV thread exiting 0: Success
120510 14:51:19 [Note] WSREP: recv_thread() joined.
120510 14:51:19 [Note] WSREP: Closing slave action queue.
120510 14:51:19 [Note] WSREP: /usr/sbin/mysqld: Terminated.
120510 14:51:19 mysqld_safe Number of processes running now: 0
120510 14:51:19 mysqld_safe mysqld restarted

My question is: When the node re join the cluster, this (the cluster) shouldn't realize that the sequence number on the joiner node (and i suppose, the UUID also different) in grastate.dat is different and request an SST?

Thank you!

Regards
Daniel

2012/5/9 Alex Yurchenko <alexey.y...@codership.com>

--
You received this message because you are subscribed to the Google Groups "codership" group.
To post to this group, send email to codership-team@googlegroups.com.
To unsubscribe from this group, send email to codership-team+unsubscribe@googlegroups.com.
For more options, visit this group at http://groups.google.com/group/codership-team?hl=en.

Alex Yurchenko

unread,

May 10, 2012, 2:09:46 PM5/10/12

to codersh...@googlegroups.com

Hi Daniel,

Alright, you didn't tell the new Galera instance where to connect to.
It is waiting for wsrep_cluster_address.

The problem here is that at no point there UUID becomes different,
sorry, I must have foreseen that. I guess setting wsrep_on=0 globally
would have been the safest option. There is an idea to generate a new
UUID every time wsrep_cluster_address is set to gcomm://, but that will
break GTID continuity between cluster restarts. And that may be not what
we want.

>> **/'

>> - or unload wsrep provider completely by setting
>> wsrep_provider='none'
>>
>> Regards,
>> Alex
>>
>> --
>> Alexey Yurchenko,
>> Codership Oy, www.codership.com
>> Skype: alexey.yurchenko, Phone: +358-400-516-011
>>
>>
>> --
>> You received this message because you are subscribed to the Google
>> Groups
>> "codership" group.
>> To post to this group, send email to

>> codership-team@googlegroups.**com<codersh...@googlegroups.com>
>> .

>> To unsubscribe from this group, send email to

>> codership-team+unsubscribe@*
>> *googlegroups.com <codership-team%2Bunsu...@googlegroups.com>.
>> For more options, visit this group at http://groups.google.com/**
>>
>> group/codership-team?hl=en<http://groups.google.com/group/codership-team?hl=en>
>> .

Andy Thompson

unread,

May 11, 2012, 12:32:17 PM5/11/12

to codersh...@googlegroups.com

So is the "cleanest" way to put the node back in the cluster to just
shut it down and restart it with a correct gcomm:// address?

-andy

>>> On 5/10/2012 at 02:09 PM, in message
<d85517f33452e768...@codership.com>, Alex Yurchenko

> To post to this group, send email to codersh...@googlegroups.com.

> To unsubscribe from this group, send email to

> codership-tea...@googlegroups.com.

> For more options, visit this group at

> http://groups.google.com/group/codership-team?hl=en.
>
>

Alex Yurchenko

unread,

May 11, 2012, 1:06:18 PM5/11/12

to codersh...@googlegroups.com

On 2012-05-11 19:32, Andy Thompson wrote:
> So is the "cleanest" way to put the node back in the cluster to just
> shut it down and restart it with a correct gcomm:// address?

The "cleanest" would also include deletion of grastate.dat file after
shutting the server down.

Reply all

Reply to author

Forward