Re: [codership-team] State transfer failed: -125(Operation canceled)

304 views
Skip to first unread message

Alex Yurchenko

unread,
Oct 5, 2012, 12:09:05 AM10/5/12
to codersh...@googlegroups.com

On 2012-10-04 14:50, Lee Henson wrote:
> Hi
>
> Our 3-node cluster has just broken down. It looks like one of the
> nodes
> dropped out of the cluster (not sure why) and when it tried to
> connect back
> up again it resulted in the output shown below. The important lines
> look
> (to me) like:
>
>
> mysqld: 121004 3:59:30 [Note] WSREP: You have configured 'rsync'
> state
> snapshot transfer method which cannot be performed on a running
> server.
> Wsrep provider won't be able to fall back to it if other means of
> state
> transfer are unavailable. In that case you will need to restart the
> server.
> .....
> mysqld: 121004 3:59:31 [Warning] WSREP: 0 (pwpdb1n3): State transfer
> to 1
> (pwpdb1n1) failed: -125 (Operation canceled)
> mysqld: 121004 3:59:31 [ERROR] WSREP:
> gcs/src/gcs_group.c:gcs_group_handle_join_msg():691: Will never
> receive
> state. Need to abort.
>
>
> What does code 125 mean?

That we will learn when we see the corresponding log from pwpdb1n3, who
failed to deliver and hence may know something special... ;)

> Full log output:
>
> mysqld: 121004 3:59:19 [Note] WSREP:
> (74650a7c-d6a9-11e1-0800-1c210e737947, 'tcp://0.0.0.0:4567') turning
> message relay requesting on, nonlive peers: tcp://10.183.171.211:4567
> tcp://10.183.171.238:4567
> mysqld: 121004 3:59:20 [Note] WSREP:
> (74650a7c-d6a9-11e1-0800-1c210e737947, 'tcp://0.0.0.0:4567')
> reconnecting
> to c0c08f1c-00a6-11e2-0800-39ac6663a85c (tcp://10.183.171.211:4567),
> attempt 0
> mysqld: 121004 3:59:20 [Note] WSREP:
> (74650a7c-d6a9-11e1-0800-1c210e737947, 'tcp://0.0.0.0:4567')
> reconnecting
> to 34392287-d6a3-11e1-0800-50c8d16e4128 (tcp://10.183.171.238:4567),
> attempt 0
> mysqld: 121004 3:59:20 [Note] WSREP:
> evs::proto(74650a7c-d6a9-11e1-0800-1c210e737947, OPERATIONAL,
> view_id(REG,34392287-d6a3-11e1-0800-50c8d16e4128,22)) suspecting
> node:
> 34392287-d6a3-11e1-0800-50c8d16e4128
> mysqld: 121004 3:59:20 [Note] WSREP:
> evs::proto(74650a7c-d6a9-11e1-0800-1c210e737947, OPERATIONAL,
> view_id(REG,34392287-d6a3-11e1-0800-50c8d16e4128,22)) suspecting
> node:
> c0c08f1c-00a6-11e2-0800-39ac6663a85c
> mysqld: 121004 3:59:21 [Note] WSREP: GMCast::handle_stable_view:
> view(view_id(NON_PRIM,34392287-d6a3-11e1-0800-50c8d16e4128,22) memb {
> mysqld: #01174650a7c-d6a9-11e1-0800-1c210e737947,
> mysqld: } joined {
> mysqld: } left {
> mysqld: } partitioned {
> mysqld: #01134392287-d6a3-11e1-0800-50c8d16e4128,
> mysqld: #011c0c08f1c-00a6-11e2-0800-39ac6663a85c,
> mysqld: })
> mysqld: 121004 3:59:21 [Note] WSREP: GMCast::handle_stable_view:
> view(view_id(NON_PRIM,74650a7c-d6a9-11e1-0800-1c210e737947,23) memb {
> mysqld: #01174650a7c-d6a9-11e1-0800-1c210e737947,
> mysqld: } joined {
> mysqld: } left {
> mysqld: } partitioned {
> mysqld: #01134392287-d6a3-11e1-0800-50c8d16e4128,
> mysqld: #011c0c08f1c-00a6-11e2-0800-39ac6663a85c,
> mysqld: })
> mysqld: 121004 3:59:21 [Note] WSREP: New COMPONENT: primary = no,
> my_idx =
> 0, memb_num = 1
> mysqld: 121004 3:59:21 [Note] WSREP: Flow-control interval: [8, 16]
> mysqld: 121004 3:59:21 [Note] WSREP: Received NON-PRIMARY.
> mysqld: 121004 3:59:21 [Note] WSREP: Shifting SYNCED -> OPEN (TO:
> 82422120)
> mysqld: 121004 3:59:21 [Note] WSREP: New COMPONENT: primary = no,
> my_idx =
> 0, memb_num = 1
> mysqld: 121004 3:59:21 [Note] WSREP: Flow-control interval: [8, 16]
> mysqld: 121004 3:59:21 [Note] WSREP: Received NON-PRIMARY.
> mysqld: 121004 3:59:21 [Note] WSREP: New cluster view: global state:
> 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422120, view# -1: non-Primary,
> number of nodes: 1, my index: 0, protocol version 1
> mysqld: 121004 3:59:21 [Note] WSREP: wsrep_notify_cmd is not
> defined,
> skipping notification.
> mysqld: 121004 3:59:21 [Note] WSREP: New cluster view: global state:
> 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422120, view# -1: non-Primary,
> number of nodes: 1, my index: 0, protocol version 1
> mysqld: 121004 3:59:21 [Note] WSREP: wsrep_notify_cmd is not
> defined,
> skipping notification.
> mysqld: 121004 3:59:23 [Note] WSREP:
> (74650a7c-d6a9-11e1-0800-1c210e737947, 'tcp://0.0.0.0:4567')
> reconnecting
> to c0c08f1c-00a6-11e2-0800-39ac6663a85c (tcp://10.183.171.211:4567),
> attempt 0
> mysqld: 121004 3:59:27 [Note] WSREP:
> (74650a7c-d6a9-11e1-0800-1c210e737947, 'tcp://0.0.0.0:4567') turning
> message relay requesting off
> mysqld: 121004 3:59:28 [Note] WSREP: New COMPONENT: primary = yes,
> my_idx
> = 1, memb_num = 3
> mysqld: 121004 3:59:28 [Note] WSREP: STATE EXCHANGE: Waiting for
> state
> UUID.
> mysqld: 121004 3:59:28 [Note] WSREP: GMCast::handle_stable_view:
> view(view_id(PRIM,34392287-d6a3-11e1-0800-50c8d16e4128,24) memb {
> mysqld: #01134392287-d6a3-11e1-0800-50c8d16e4128,
> mysqld: #01174650a7c-d6a9-11e1-0800-1c210e737947,
> mysqld: #011c0c08f1c-00a6-11e2-0800-39ac6663a85c,
> mysqld: } joined {
> mysqld: } left {
> mysqld: } partitioned {
> mysqld: })
> mysqld: 121004 3:59:28 [Note] WSREP: declaring
> 34392287-d6a3-11e1-0800-50c8d16e4128 stable
> mysqld: 121004 3:59:28 [Note] WSREP: declaring
> c0c08f1c-00a6-11e2-0800-39ac6663a85c stable
> mysqld: 121004 3:59:28 [Note] WSREP: STATE EXCHANGE: sent state msg:
> e4ccadbe-0dd7-11e2-0800-263d4b71cb69
> mysqld: 121004 3:59:28 [Note] WSREP: STATE EXCHANGE: got state msg:
> e4ccadbe-0dd7-11e2-0800-263d4b71cb69 from 0 (pwpdb1n3)
> mysqld: 121004 3:59:28 [Note] WSREP: STATE EXCHANGE: got state msg:
> e4ccadbe-0dd7-11e2-0800-263d4b71cb69 from 1 (pwpdb1n1)
> mysqld: 121004 3:59:28 [Note] WSREP: STATE EXCHANGE: got state msg:
> e4ccadbe-0dd7-11e2-0800-263d4b71cb69 from 2 (pwpdb1n2)
> mysqld: 121004 3:59:28 [Note] WSREP: Quorum results:
> mysqld: #011version = 2,
> mysqld: #011component = PRIMARY,
> mysqld: #011conf_id = 21,
> mysqld: #011members = 2/3 (joined/total),
> mysqld: #011act_id = 82422209,
> mysqld: #011last_appl. = 82421930,
> mysqld: #011protocols = 0/3/1 (gcs/repl/appl),
> mysqld: #011group UUID = 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2
> mysqld: 121004 3:59:28 [Note] WSREP: Flow-control interval: [14, 28]
> mysqld: 121004 3:59:28 [Note] WSREP: Shifting OPEN -> PRIMARY (TO:
> 82422209)
> mysqld: 121004 3:59:28 [Note] WSREP: State transfer required:
> mysqld: #011Group state:
> 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422209
> mysqld: #011Local state:
> 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422120
> mysqld: 121004 3:59:28 [Note] WSREP: New cluster view: global state:
> 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422209, view# 22: Primary,
> number of
> nodes: 3, my index: 1, protocol version 1
> mysqld: 121004 3:59:28 [Warning] WSREP: Gap in state sequence. Need
> state
> transfer.
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82397084
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82393286
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82390801
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82388878
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82385001
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82379491
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82375137
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82373471
> mysqld: 121004 3:59:30 [Note] WSREP: SST kill local trx: 82372237
> mysqld: 121004 3:59:30 [Note] WSREP: You have configured 'rsync'
> state
> snapshot transfer method which cannot be performed on a running
> server.
> Wsrep provider won't be able to fall back to it if other means of
> state
> transfer are unavailable. In that case you will need to restart the
> server.
> mysqld: 121004 3:59:30 [Note] WSREP: wsrep_notify_cmd is not
> defined,
> skipping notification.
> mysqld: 121004 3:59:30 [Note] WSREP: Assign initial position for
> certification: 82422209, protocol version: 2
> mysqld: 121004 3:59:30 [Note] WSREP: Prepared IST receiver,
> listening at:
> tcp://173.45.237.153:4568
> mysqld: 121004 3:59:30 [Note] WSREP: Node 1 (pwpdb1n1) requested
> state
> transfer from '*any*'. Selected 0 (pwpdb1n3)(SYNCED) as donor.
> mysqld: 121004 3:59:30 [Note] WSREP: Shifting PRIMARY -> JOINER (TO:
> 82422245)
> mysqld: 121004 3:59:30 [Note] WSREP: Requesting state transfer:
> success,
> donor: 0
> mysqld: 121004 3:59:30 [Note] WSREP: GCache DEBUG:
> RingBuffer::seqno_reset(): discarded 134196507 bytes
> mysqld: 121004 3:59:30 [Note] WSREP: GCache DEBUG:
> RingBuffer::seqno_reset(): found 0/36 locked buffers
> mysqld: 121004 3:59:30 [Note] WSREP: Receiving IST: 89 writesets,
> seqnos
> 82422120-82422209
> mysqld: 121004 3:59:31 [Warning] WSREP: 0 (pwpdb1n3): State transfer
> to 1
> (pwpdb1n1) failed: -125 (Operation canceled)
> mysqld: 121004 3:59:31 [ERROR] WSREP:
> gcs/src/gcs_group.c:gcs_group_handle_join_msg():691: Will never
> receive
> state. Need to abort.
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: terminating thread
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: joining thread
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: closing backend
> mysqld: 121004 3:59:31 [Note] WSREP:
> evs::proto(74650a7c-d6a9-11e1-0800-1c210e737947, LEAVING,
> view_id(REG,34392287-d6a3-11e1-0800-50c8d16e4128,24)) uuid
> 34392287-d6a3-11e1-0800-50c8d16e4128 missing from install message,
> assuming
> partitioned
> mysqld: 121004 3:59:31 [Note] WSREP:
> evs::proto(74650a7c-d6a9-11e1-0800-1c210e737947, LEAVING,
> view_id(REG,34392287-d6a3-11e1-0800-50c8d16e4128,24)) uuid
> c0c08f1c-00a6-11e2-0800-39ac6663a85c missing from install message,
> assuming
> partitioned
> mysqld: 121004 3:59:31 [Note] WSREP: GMCast::handle_stable_view:
> view(view_id(NON_PRIM,34392287-d6a3-11e1-0800-50c8d16e4128,24) memb {
> mysqld: #01174650a7c-d6a9-11e1-0800-1c210e737947,
> mysqld: } joined {
> mysqld: } left {
> mysqld: } partitioned {
> mysqld: #01134392287-d6a3-11e1-0800-50c8d16e4128,
> mysqld: #011c0c08f1c-00a6-11e2-0800-39ac6663a85c,
> mysqld: })
> mysqld: 121004 3:59:31 [Note] WSREP: GMCast::handle_stable_view:
> view((empty))
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: closed
> mysqld: 121004 3:59:31 [Note] WSREP: /usr/sbin/mysqld: Terminated.
> mysqld_safe: Number of processes running now: 0
> mysqld_safe: mysqld restarted
> mysqld: 121004 3:59:31 [Note] WSREP: wsrep_load(): loading provider
> library '/usr/lib/galera/libgalera_smm.so'
> mysqld: 121004 3:59:31 [Note] WSREP: wsrep_load(): Galera
> 23.2.0(r120) by
> Codership Oy <in...@codership.com> loaded succesfully.
> mysqld: 121004 3:59:31 [Note] WSREP: Reusing existing
> '/var/lib/mysql//galera.cache'.
> mysqld: 121004 3:59:31 [Note] WSREP: Passing config to GCS:
> gcache.dir =
> /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0;
> gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M;
> gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5;
> gcs.fc_limit =
> 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500;
> gcs.max_throttle
> = 0.25; gcs.recv_q_hard_limit = 9223372036854775807;
> gcs.recv_q_soft_limit
> = 0.25; replicator.causal_read_timeout = PT30S;
> replicator.commit_order = 3
> mysqld: 121004 3:59:31 [Note] WSREP: wsrep_sst_grab()
> mysqld: 121004 3:59:31 [Note] WSREP: Start replication
> mysqld: 121004 3:59:31 [Note] WSREP: Found saved state:
> 00000000-0000-0000-0000-000000000000:-1
> mysqld: 121004 3:59:31 [Note] WSREP: Assign initial position for
> certification: -1, protocol version: -1
> mysqld: 121004 3:59:31 [Note] WSREP: Setting initial position to
> 00000000-0000-0000-0000-000000000000:-1
> mysqld: 121004 3:59:31 [Note] WSREP: protonet asio version 0
> mysqld: 121004 3:59:31 [Note] WSREP: backend: asio
> mysqld: 121004 3:59:31 [Note] WSREP: GMCast version 0
> mysqld: 121004 3:59:31 [Note] WSREP:
> (e6ebeb87-0dd7-11e2-0800-1f5cfae992c7, 'tcp://0.0.0.0:4567')
> listening at
> tcp://0.0.0.0:4567
> mysqld: 121004 3:59:31 [Note] WSREP:
> (e6ebeb87-0dd7-11e2-0800-1f5cfae992c7, 'tcp://0.0.0.0:4567')
> multicast: ,
> ttl: 1
> mysqld: 121004 3:59:31 [Note] WSREP: EVS version 0
> mysqld: 121004 3:59:31 [Note] WSREP: PC version 0
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: connecting to group
> 'my_wsrep_cluster', peer ''
> mysqld: 121004 3:59:31 [Note] WSREP: GMCast::handle_stable_view:
> view(view_id(PRIM,e6ebeb87-0dd7-11e2-0800-1f5cfae992c7,1) memb {
> mysqld: #011e6ebeb87-0dd7-11e2-0800-1f5cfae992c7,
> mysqld: } joined {
> mysqld: } left {
> mysqld: } partitioned {
> mysqld: })
> mysqld: 121004 3:59:31 [Note] WSREP: gcomm: connected
> mysqld: 121004 3:59:31 [Note] WSREP: Changing maximum packet size to
> 64500, resulting msg size: 32636
> mysqld: 121004 3:59:31 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
> mysqld: 121004 3:59:31 [Note] WSREP: Opened channel
> 'my_wsrep_cluster'
> mysqld: 121004 3:59:31 [Note] WSREP: New COMPONENT: primary = yes,
> my_idx
> = 0, memb_num = 1
> mysqld: 121004 3:59:31 [Note] WSREP: Waiting for SST to complete.
> mysqld: 121004 3:59:31 [Note] WSREP: Starting new group from
> scratch:
> e6eca9d0-0dd7-11e2-0800-970a0d5930d3
> mysqld: 121004 3:59:31 [Note] WSREP: STATE_EXCHANGE: sent state
> UUID:
> e6ecca17-0dd7-11e2-0800-ff093cac64aa
> mysqld: 121004 3:59:31 [Note] WSREP: STATE EXCHANGE: sent state msg:
> e6ecca17-0dd7-11e2-0800-ff093cac64aa
> mysqld: 121004 3:59:31 [Note] WSREP: STATE EXCHANGE: got state msg:
> e6ecca17-0dd7-11e2-0800-ff093cac64aa from 0 (pwpdb1n1)
> mysqld: 121004 3:59:31 [Note] WSREP: Quorum results:
> mysqld: #011version = 2,
> mysqld: #011component = PRIMARY,
> mysqld: #011conf_id = 0,
> mysqld: #011members = 1/1 (joined/total),
> mysqld: #011act_id = 0,
> mysqld: #011last_appl. = -1,
> mysqld: #011protocols = 0/3/1 (gcs/repl/appl),
> mysqld: #011group UUID = e6eca9d0-0dd7-11e2-0800-970a0d5930d3
> mysqld: 121004 3:59:31 [Note] WSREP: Flow-control interval: [8, 16]
> mysqld: 121004 3:59:31 [Note] WSREP: Restored state OPEN -> JOINED
> (0)
> mysqld: 121004 3:59:31 [Note] WSREP: Member 0 (pwpdb1n1) synced with
> group.
> mysqld: 121004 3:59:31 [Note] WSREP: Shifting JOINED -> SYNCED (TO:
> 0)
> mysqld: 121004 3:59:31 [Note] WSREP: New cluster view: global state:
> e6eca9d0-0dd7-11e2-0800-970a0d5930d3:0, view# 1: Primary, number of
> nodes:
> 1, my index: 0, protocol version 1
> mysqld: 121004 3:59:31 [Note] WSREP: SST complete, seqno: 0
> mysqld: 121004 3:59:31 InnoDB: The InnoDB memory heap is disabled
> mysqld: 121004 3:59:31 InnoDB: Mutexes and rw_locks use GCC atomic
> builtins
> mysqld: 121004 3:59:31 InnoDB: Compressed tables use zlib 1.2.3.3
> mysqld: 121004 3:59:31 InnoDB: Using Linux native AIO
> mysqld: 121004 3:59:31 InnoDB: Initializing buffer pool, size =
> 128.0M
> mysqld: 121004 3:59:31 InnoDB: Completed initialization of buffer
> pool
> mysqld: 121004 3:59:31 InnoDB: highest supported file format is
> Barracuda.
> mysqld: InnoDB: The log sequence number in ibdata files does not
> match
> mysqld: InnoDB: the log sequence number in the ib_logfiles!
> mysqld: 121004 3:59:31 InnoDB: Database was not shut down normally!
> mysqld: InnoDB: Starting crash recovery.
> mysqld: InnoDB: Reading tablespace information from the .ibd files...
> ...
>
>
> Cheers
> Lee

--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Lee Henson

unread,
Oct 6, 2012, 4:11:17 AM10/6/12
to codersh...@googlegroups.com
Hi. 

Sorry for the delay.

Here's the relevant chunk from pwpdb1n3's syslog:

mysqld: 121004  3:59:31 [Note] WSREP: Node 1 (pwpdb1n1) requested state transfer from '*any*'. Selected 0 (pwpdb1n3)(SYNCED) as donor.
mysqld: 121004  3:59:31 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 82422245)
mysqld: 121004  3:59:31 [Note] WSREP: IST request: 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422120-82422209|tcp://173.45.237.153:4568
mysqld: 121004  3:59:31 [Note] WSREP: IST first seqno 82422121 not found from cache, falling back to SST
mysqld: 121004  3:59:31 [Warning] WSREP: SST request is null, SST canceled.
mysqld: 121004  3:59:31 [Note] WSREP: Deleted page /var/lib/mysql/gcache.page.77601426
mysqld: 121004  3:59:31 [Warning] WSREP: 0 (pwpdb1n3): State transfer to 1 (pwpdb1n1) failed: -125 (Operation canceled)
mysqld: 121004  3:59:31 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 82422245)
mysqld: 121004  3:59:31 [Note] WSREP: Member 0 (pwpdb1n3) synced with group.
mysqld: 121004  3:59:31 [Note] WSREP: Shifting JOINED -> SYNCED (TO: 82422245)
mysqld: 121004  3:59:31 [Note] WSREP: Synchronized with group, ready for connections
mysqld: 121004  3:59:31 [Note] WSREP: (34392287-d6a3-11e1-0800-50c8d16e4128, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.183.171.180:4567 
mysqld: 121004  3:59:31 [Note] WSREP: GMCast::handle_stable_view: view(view_id(PRIM,34392287-d6a3-11e1-0800-50c8d16e4128,25) memb {
mysqld: #01134392287-d6a3-11e1-0800-50c8d16e4128,
mysqld: #011c0c08f1c-00a6-11e2-0800-39ac6663a85c,
mysqld: } joined {
mysqld: } left {
mysqld: } partitioned {
mysqld: #01174650a7c-d6a9-11e1-0800-1c210e737947,
mysqld: })
mysqld: 121004  3:59:31 [Note] WSREP: forgetting 74650a7c-d6a9-11e1-0800-1c210e737947 (tcp://10.183.171.180:4567)
mysqld: 121004  3:59:31 [Note] WSREP: declaring c0c08f1c-00a6-11e2-0800-39ac6663a85c stable
mysqld: 121004  3:59:31 [Note] WSREP: (34392287-d6a3-11e1-0800-50c8d16e4128, 'tcp://0.0.0.0:4567') turning message relay requesting off
mysqld: 121004  3:59:31 [Note] WSREP: New COMPONENT: primary = yes, my_idx = 0, memb_num = 2
mysqld: 121004  3:59:31 [Note] WSREP: STATE_EXCHANGE: sent state UUID: e68790bb-0dd7-11e2-0800-cf83c5bd16cc
mysqld: 121004  3:59:31 [Note] WSREP: STATE EXCHANGE: sent state msg: e68790bb-0dd7-11e2-0800-cf83c5bd16cc
mysqld: 121004  3:59:31 [Note] WSREP: STATE EXCHANGE: got state msg: e68790bb-0dd7-11e2-0800-cf83c5bd16cc from 0 (pwpdb1n3)
mysqld: 121004  3:59:31 [Note] WSREP: STATE EXCHANGE: got state msg: e68790bb-0dd7-11e2-0800-cf83c5bd16cc from 1 (pwpdb1n2)
mysqld: 121004  3:59:31 [Note] WSREP: Quorum results:
mysqld: #011version    = 2,
mysqld: #011component  = PRIMARY,
mysqld: #011conf_id    = 22,
mysqld: #011members    = 2/2 (joined/total),
mysqld: #011act_id     = 82422245,
mysqld: #011last_appl. = 82421930,
mysqld: #011protocols  = 0/3/1 (gcs/repl/appl),
mysqld: #011group UUID = 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2
mysqld: 121004  3:59:31 [Note] WSREP: Flow-control interval: [12, 23]
mysqld: 121004  3:59:31 [Note] WSREP: New cluster view: global state: 1e42b63f-d6a3-11e1-0800-4a6fc40ca7a2:82422245, view# 23: Primary, number of nodes: 2, my index: 0, protocol version 1
mysqld: 121004  3:59:31 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
mysqld: 121004  3:59:31 [Note] WSREP: Assign initial position for certification: 82422245, protocol version: 2

I've figured out why pwpdb1n1 then started it's own cluster: we had accidentally left it's gcomm uri empty, rather than pointing at one of the other machines in the cluster. Once it had failed to sync up, the process died and was restarted by monit. On restart, it saw it's gcomm uri was empty and therefore started it's own group again. So that mystery is solved, but I'd still like to understand why it couldn't sync back up the first time it dropped out of the cluster.

Thanks for your help!

Alex Yurchenko

unread,
Oct 6, 2012, 4:42:33 AM10/6/12
to codersh...@googlegroups.com
right.

> but I'd still like to understand why it couldn't sync back up the
> first time it dropped out of the cluster.

> mysqld: 121004 3:59:31 [Note] WSREP: IST first seqno 82422121 not
> found
> from cache, falling back to SST
> mysqld: 121004 3:59:31 [Warning] WSREP: SST request is null, SST
> canceled.

SST request was null because the joining node was configured for rsync
SST - and live node can't accept rsync SST, so it intentionally sent
null SST request, indicating that it does not accept SST.

Now the joining node needed to replay only 89 transactions. The
question is why they were not found in cache and IST was canceled?

> mysqld: 121004 3:59:31 [Note] WSREP: Deleted page
> /var/lib/mysql/gcache.page.77601426

Turns out the donor node was heavily "paging" - that is it had no space
in the static ring buffer and needed to allocate additional pages (77
million???) to hold transactions. With this gcache is under pressure to
free space in ring buffer as soon as transaction is no longer referenced
locally.

77M pages sounds very much like you have gcache.size=0 there.

> Thanks for your help!

Lee Henson

unread,
Oct 8, 2012, 8:17:24 AM10/8/12
to codersh...@googlegroups.com

Turns out the donor node was heavily "paging" - that is it had no space
in the static ring buffer and needed to allocate additional pages (77
million???) to hold transactions. With this gcache is under pressure to
free space in ring buffer as soon as transaction is no longer referenced
locally.

77M pages sounds very much like you have gcache.size=0 there.


I've just checked the mysql/wsrep config on every node in the cluster. There is no explicit assignment of gcache.size anywhere, so I presume that it will be using the default which is 128MB right? I saw on another thread it looked like a reasonable gcache.size would be equal to the size of the databases held on a node. We've got about 4GB of data on these nodes, so would you recommend setting "gcache.size=4G" ?

Alex Yurchenko

unread,
Oct 8, 2012, 9:14:20 AM10/8/12
to codersh...@googlegroups.com
On 2012-10-08 15:17, Lee Henson wrote:
>> Turns out the donor node was heavily "paging" - that is it had no
>> space
>> in the static ring buffer and needed to allocate additional pages
>> (77
>> million???) to hold transactions. With this gcache is under pressure
>> to
>> free space in ring buffer as soon as transaction is no longer
>> referenced
>> locally.
>>
>> 77M pages sounds very much like you have gcache.size=0 there.
>>
>>
> I've just checked the mysql/wsrep config on every node in the
> cluster.
> There is no explicit assignment of gcache.size anywhere, so I presume
> that
> it will be using the default which is 128MB right?

What does SELECT @@wsrep_provider_options\G say?

Could you post more of the preceding log? especially the line about
creation of gcache.page?

Which Galera version are you using?

> I saw on another thread
> it looked like a reasonable gcache.size would be equal to the size of
> the
> databases held on a node. We've got about 4GB of data on these nodes,
> so
> would you recommend setting "gcache.size=4G" ?

Yes, something like that.

Lee Henson

unread,
Oct 12, 2012, 4:34:58 AM10/12/12
to codersh...@googlegroups.com

What does SELECT @@wsrep_provider_options\G say?
 
mysql> SELECT @@wsrep_provider_options\G
*************************** 1. row ***************************
@@wsrep_provider_options: base_port = 4567; evs.debug_log_mask = 0x1; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.info_log_mask = 0; evs.install_timeout = PT15S; evs.join_retrans_period = PT0.3S; evs.keepalive_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.use_aggregate = true; evs.user_send_window = 2; evs.version = 0; evs.view_forget_timeout = PT5M; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gmcast.listen_addr = tcp://0.0.0.0:4567; gmcast.mcast_addr = ; gmcast.mcast_ttl = 1; gmcast.peer_timeout = PT3S; gmcast.time_wait = PT5S; gmcast.version = 0; ist.recv_addr = 173.45.237.153; pc.checksum = true; pc.ignore_quorum = false; pc.ignore_sb = false; pc.linger = PT2S; pc.npvo = false; pc.version = 0; protonet.backend = asio; protonet.version = 0; replicator.causal_read_timeout = PT30S; replicator.commit_order = 3
1 row in set (0.00 sec)

Could you post more of the preceding log? especially the line about
creation of gcache.page?
 
Whoops, I totally forgot to backup that log and it's been logrotated out. Sorry!

Which Galera version are you using?

dpkg -s galera:

Package: galera
Status: install ok installed
Maintainer: Codership Oy
Architecture: amd64
Version: 23.2.0
Replaces: galera
Provides: wsrep, galera
Depends: libc6 (>= 2.4), libstdc++6 (>= 4.1.1), libgcc1 (>= 4.1.1), libssl0.9.8
Conffiles:
 /etc/default/garb b50555bd995e118d2f944b64cab0ecab
Description: Galera Replication Framework
 Copyright: 2007-2011 by Codership Oy, All Rights Reserved
 Replication framework for transactional applications. Implements wsrep interface.
 
dpkg -s mysql-server-wsrep:

Package: mysql-server-wsrep
Status: install ok installed
Maintainer: Codership Oy
Architecture: amd64
Version: 5.5.20-23.4
Replaces: mysql-server-core (<= 5.5.20), mysql-server-core-5.0 (<= 5.5.20), mysql-server-core-5.1 (<= 5.5.20), mysql-server (<= 5.5.20), mysql-server-5.0 (<= 5.5.20), mysql-server-5.1 (<= 5.5.20), mysql-server-core-5.5 (<= 5.5.20), mysql-server-5.5 (<= 5.5.20)
Provides: mysql-server-core, mysql-server, mysql-server-core-5.5, mysql-server-5.5
Depends: psmisc, debianutils (>= 1.6), libc6 (>= 2.4), libdbi-perl, libdbd-mysql-perl (>= 1.2202), libgcc1 (>= 4.1.1), libncurses5 (>= 5.6), libstdc++6 (>= 4.1.1), libwrap0 (>= 7.6), perl, zlib1g (>= 1.1.4), libaio1, mysql-client
Conffiles:
 /etc/mysql/conf.d/wsrep.cnf 66200619107be141d9796223ce719884
Description: wsrep-enabled MySQL server
 Copyright: MySQL AB, Codership Oy, All Rights Reserved
 MySQL server + wsrep patch (https://launchpad.net/codership-mysql)

Yes, something like that. 

 Ok, we'll tweak that. 

Alex Yurchenko

unread,
Oct 12, 2012, 6:59:19 AM10/12/12
to codersh...@googlegroups.com
Hm, 77 million of 128Mb pages... something is not right there, but I
have no idea yet.
There have been no related changes in gcache since 2.0, but upgrading
to the latest release is a right thing to do - quite a number of bugs
were fixed.

Increase gcache.size and keep an eye on gcache "paging" - it should not
happen during normal operation.

Regards,
Alex

Lee Henson

unread,
Oct 14, 2012, 5:24:45 AM10/14/12
to codersh...@googlegroups.com

There have been no related changes in gcache since 2.0, but upgrading
to the latest release is a right thing to do - quite a number of bugs
were fixed.

Increase gcache.size and keep an eye on gcache "paging" - it should not
happen during normal operation.

Ok, we'll see how that goes. Thanks for your help.

Reply all
Reply to author
Forward
0 new messages