node will not rejoin cluster

444 views

Skip to first unread message

Shawn Bright

unread,

Apr 24, 2015, 6:05:44 PM4/24/15

to codersh...@googlegroups.com

I had to stop a node in my four node cluster.

I cannot for the life of me figgure out from the logs how to reconnect.

according to the logs, it seems it will not enable a Query cashe due to resize or similar command in progress, but there isn't on happening.

I did rename a table earlier, but that was a couple of hours ago and this node seemed to be fine.

i really don't mind having to initiate a new state transfer is there a way i can force that?

here is the output of the error log.

thanks for any suggestions.

150424 16:38:47 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql

150424 16:38:47 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.YXcEpb' --pid-file='/var/lib/mysql/dot129-recover.pid'

nohup: ignoring input

/usr/sbin/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later

150424 16:39:00 mysqld_safe WSREP: Recovered position 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984373331

150424 16:39:01 [Note] WSREP: wsrep_start_position var submitted: '4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984373331'

/usr/sbin/mysqld: Query cache is disabled (resize or similar command in progress); repeat this command later

150424 16:39:01 [Note] WSREP: Read nil XID from storage engines, skipping position init

150424 16:39:01 [Note] WSREP: wsrep_load(): loading provider library '/usr/lib/galera/libgalera_smm.so'

150424 16:39:01 [Note] WSREP: wsrep_load(): Galera 25.3.5-wheezy(rXXXX) by Codership Oy <in...@codership.com> loaded successfully.

150424 16:39:01 [Note] WSREP: CRC-32C: using hardware acceleration.

150424 16:39:01 [Note] WSREP: Found saved state: 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984373331

150424 16:39:01 [Note] WSREP: Passing config to GCS: base_host = 192.168.1.129; base_port = 4567; cert.log_conflicts = no; debug = no; evs.inactive_check_period = PT0.5S; evs.inactive_timeout = PT15S; evs.join_retrans_period = PT1S; evs.max_install_timeouts = 1; evs.send_window = 4; evs.stats_report_period = PT1M; evs.suspect_timeout = PT5S; evs.user_send_window = 2; evs.view_forget_timeout = PT24H; gcache.dir = /var/lib/mysql/; gcache.keep_pages_size = 0; gcache.mem_size = 0; gcache.name = /var/lib/mysql//galera.cache; gcache.page_size = 128M; gcache.size = 128M; gcs.fc_debug = 0; gcs.fc_factor = 1.0; gcs.fc_limit = 16; gcs.fc_master_slave = no; gcs.max_packet_size = 64500; gcs.max_throttle = 0.25; gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit = 0.25; gcs.sync_donor = no; gmcast.segment = 0; gmcast.version = 0; pc.announce_timeout = PT3S; pc.checksum = false; pc.ignore_quorum = false; pc.ignore_sb = false; pc.npvo = false; pc.version = 0; pc.wait_prim = true; pc.wait_prim_timeout = P30S; pc.weight = 1; proton

150424 16:39:01 [Note] WSREP: Service thread queue flushed.

150424 16:39:01 [Note] WSREP: Assign initial position for certification: 4984373331, protocol version: -1

150424 16:39:01 [Note] WSREP: wsrep_sst_grab()

150424 16:39:01 [Note] WSREP: Start replication

150424 16:39:01 [Note] WSREP: Setting initial position to 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984373331

150424 16:39:01 [Note] WSREP: protonet asio version 0

150424 16:39:01 [Note] WSREP: Using CRC-32C (optimized) for message checksums.

150424 16:39:01 [Note] WSREP: backend: asio

150424 16:39:01 [Note] WSREP: GMCast version 0

150424 16:39:01 [Note] WSREP: (52251c1e-eaca-11e4-afaa-be5b30d718da, 'tcp://0.0.0.0:4567') listening at tcp://0.0.0.0:4567

150424 16:39:01 [Note] WSREP: (52251c1e-eaca-11e4-afaa-be5b30d718da, 'tcp://0.0.0.0:4567') multicast: , ttl: 1

150424 16:39:01 [Note] WSREP: EVS version 0

150424 16:39:01 [Note] WSREP: PC version 0a

150424 16:39:01 [Note] WSREP: gcomm: connecting to group 'pivotrac_cluster', peer '192.168.1.113:,192.168.1.119:,192.168.1.126:,192.168.1.129:'

150424 16:39:01 [Warning] WSREP: (52251c1e-eaca-11e4-afaa-be5b30d718da, 'tcp://0.0.0.0:4567') address 'tcp://192.168.1.129:4567' points to own listening address, blacklisting

150424 16:39:01 [Note] WSREP: (52251c1e-eaca-11e4-afaa-be5b30d718da, 'tcp://0.0.0.0:4567') address 'tcp://192.168.1.129:4567' pointing to uuid 52251c1e-eaca-11e4-afaa-be5b30d718da is blacklisted, skipping

150424 16:39:01 [Note] WSREP: declaring 0dfc66c2-acb3-11e4-b2e4-2ec7507bc014 stable

150424 16:39:01 [Note] WSREP: declaring 1a161bd2-c05d-11e4-95d0-2fe34f370ecf stable

150424 16:39:01 [Note] WSREP: declaring 45b5db39-a5d9-11e4-9c7e-86608fa5b128 stable

150424 16:39:01 [Note] WSREP: Node 0dfc66c2-acb3-11e4-b2e4-2ec7507bc014 state prim

150424 16:39:01 [Note] WSREP: view(view_id(PRIM,0dfc66c2-acb3-11e4-b2e4-2ec7507bc014,434) memb {

0dfc66c2-acb3-11e4-b2e4-2ec7507bc014,0

1a161bd2-c05d-11e4-95d0-2fe34f370ecf,0

45b5db39-a5d9-11e4-9c7e-86608fa5b128,0

52251c1e-eaca-11e4-afaa-be5b30d718da,0

} joined {

} left {

} partitioned {

})

150424 16:39:01 [Note] WSREP: gcomm: connected

150424 16:39:01 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636

150424 16:39:01 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)

150424 16:39:01 [Note] WSREP: Opened channel 'pivotrac_cluster'

150424 16:39:01 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 3, memb_num = 4

150424 16:39:01 [Note] WSREP: Waiting for SST to complete.

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: sent state msg: eec43f17-eaca-11e4-8ef0-06276ff16a39

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: got state msg: eec43f17-eaca-11e4-8ef0-06276ff16a39 from 0 (dot113)

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: got state msg: eec43f17-eaca-11e4-8ef0-06276ff16a39 from 1 ()

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: got state msg: eec43f17-eaca-11e4-8ef0-06276ff16a39 from 2 (dot126)

150424 16:39:01 [Note] WSREP: STATE EXCHANGE: got state msg: eec43f17-eaca-11e4-8ef0-06276ff16a39 from 3 ()

150424 16:39:01 [Note] WSREP: Quorum results:

version = 3,

component = PRIMARY,

conf_id = 282,

members = 3/4 (joined/total),

act_id = 4984816135,

last_appl. = -1,

protocols = 0/5/2 (gcs/repl/appl),

group UUID = 4f93c528-ac04-11e3-ae2c-1eedce9e1c84

150424 16:39:01 [Note] WSREP: Flow-control interval: [32, 32]

150424 16:39:01 [Note] WSREP: Shifting OPEN -> PRIMARY (TO: 4984816135)

150424 16:39:01 [Note] WSREP: State transfer required:

Group state: 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984816135

Local state: 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984373331

150424 16:39:01 [Note] WSREP: New cluster view: global state: 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:4984816135, view# 283: Primary, number of nodes: 4, my index: 3, protocol version 2

150424 16:39:01 [Note] WSREP: closing client connections for protocol change 3 -> 2

150424 16:39:03 [Warning] WSREP: Gap in state sequence. Need state transfer.

150424 16:39:05 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'joiner' --address '192.168.1.129' --auth 'geek:snape99' --datadir '/var/lib/mysql/' --defaults-file '/etc/mysql/my.cnf' --parent '23362''

150424 16:39:05 [Note] WSREP: Prepared SST request: rsync|192.168.1.129:4444/rsync_sst

150424 16:39:05 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

150424 16:39:05 [Note] WSREP: REPL Protocols: 5 (3, 1)

150424 16:39:05 [Note] WSREP: Service thread queue flushed.

150424 16:39:05 [Note] WSREP: Assign initial position for certification: 4984816135, protocol version: 3

150424 16:39:05 [Note] WSREP: Service thread queue flushed.

150424 16:39:05 [Note] WSREP: Prepared IST receiver, listening at: tcp://192.168.1.129:4568

150424 16:39:05 [ERROR] WSREP: Requesting state transfer failed: -113(No route to host)

150424 16:39:05 [ERROR] WSREP: State transfer request failed unrecoverably: 113 (No route to host). Most likely it is due to inability to communicate with the cluster primary component. Restart required.

150424 16:39:05 [Note] WSREP: Closing send monitor...

150424 16:39:05 [Note] WSREP: Closed send monitor.

150424 16:39:05 [Note] WSREP: gcomm: terminating thread

150424 16:39:05 [Note] WSREP: gcomm: joining thread

150424 16:39:05 [Note] WSREP: view(view_id(NON_PRIM,0dfc66c2-acb3-11e4-b2e4-2ec7507bc014,434) memb {

52251c1e-eaca-11e4-afaa-be5b30d718da,0

} joined {

} left {

} partitioned {

0dfc66c2-acb3-11e4-b2e4-2ec7507bc014,0

1a161bd2-c05d-11e4-95d0-2fe34f370ecf,0

45b5db39-a5d9-11e4-9c7e-86608fa5b128,0

})

150424 16:39:05 [Note] WSREP: view((empty))

150424 16:39:05 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1

150424 16:39:05 [Note] WSREP: gcomm: closed

150424 16:39:05 [Note] WSREP: Flow-control interval: [16, 16]

150424 16:39:05 [Note] WSREP: Received NON-PRIMARY.

150424 16:39:05 [Note] WSREP: Shifting PRIMARY -> OPEN (TO: 4984816529)

150424 16:39:05 [Note] WSREP: Received self-leave message.

150424 16:39:05 [Note] WSREP: Flow-control interval: [0, 0]

150424 16:39:05 [Note] WSREP: Received SELF-LEAVE. Closing connection.

150424 16:39:05 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 4984816529)

150424 16:39:05 [Note] WSREP: RECV thread exiting 0: Success

150424 16:39:05 [Note] WSREP: recv_thread() joined.

150424 16:39:05 [Note] WSREP: Closing replication queue.

150424 16:39:05 [Note] WSREP: Closing slave action queue.

150424 16:39:05 [Note] WSREP: /usr/sbin/mysqld: Terminated.

150424 16:39:05 mysqld_safe mysqld from pid file /var/lib/mysql/dot129.pid ended

WSREP_SST: [ERROR] Parent mysqld process (PID:23362) terminated unexpectedly. (20150424 16:39:06.673)

WSREP_SST: [INFO] Joiner cleanup. (20150424 16:39:06.676)

WSREP_SST: [INFO] Joiner cleanup done. (20150424 16:39:07.185)

Shawn Bright

unread,

Apr 25, 2015, 12:03:10 PM4/25/15

to codersh...@googlegroups.com

More info on this. From the CLI, this was the output.
It mentions a core dump and that /usr/bin/mysqld_safe aborted at line 182 and my computer has not file /usr/bin/mysqld_safe (yet it executes)

here is the output
sed: -e expression #1, char 40: unknown option to `s'
150425 10:49:22 mysqld_safe Can't log to error log and syslog at the same time. Remove all --log-error configuration options for --syslog to take effect.
150425 10:49:22 mysqld_safe Logging to '/var/log/mysql/error.log'.
150425 10:49:22 mysqld_safe Starting mysqld daemon with databases from /var/lib/mysql
150425 10:49:22 mysqld_safe WSREP: Running position recovery with --log_error='/var/lib/mysql/wsrep_recovery.IjyacC' --pid-file='/var/lib/mysql/dot129-recover.pid'
150425 10:49:36 mysqld_safe WSREP: Recovered position 00000000-0000-0000-0000-000000000000:-1
/usr/bin/mysqld_safe: line 182: 27995 Aborted (core dumped) nohup /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --wsrep-provider=/usr/lib/galera/libgalera_smm.so --log-error=/var/log/mysql/error.log --pid-file=dot129.pid --socket=/var/run/mysqld/mysqld.sock --port=3306 --wsrep_start_position=00000000-0000-0000-0000-000000000000:-1 < /dev/null >> /var/log/mysql/error.log 2>&1
150425 10:49:38 mysqld_safe mysqld from pid file /var/lib/mysql/dot129.pid ended

thanks for any tips

Umarzuki Mochlis

unread,

Apr 26, 2015, 9:25:25 AM4/26/15

to Shawn Bright, codersh...@googlegroups.com

there's these messages

150424 16:39:05 [ERROR] WSREP: Requesting state transfer failed:
-113(No route to host)
150424 16:39:05 [ERROR] WSREP: State transfer request failed
unrecoverably: 113 (No route to host). Most likely it is due to
inability to communicate with the cluster primary component. Restart
required.

check configured IP on my.cnf on every node

> --
> You received this message because you are subscribed to the Google Groups
> "codership" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to codership-tea...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

Shawn Bright

unread,

Apr 27, 2015, 9:46:18 AM4/27/15

to codersh...@googlegroups.com, sh...@skrite.net

you were correct. I am not sure why yet, but if i use 119 as the donor, it fails. I am getting a state transfer right now from 126.

Their configs are the same.. gremlins. I will know in about 20 minutes (how long the state transfer usually takes) to see if it all goes well, but so far so good.

thanks

Shawn Bright

unread,

Apr 27, 2015, 10:59:36 AM4/27/15

to codersh...@googlegroups.com, sh...@skrite.net

OK, i have more trouble now.

I had to use a different computer to rejoin my node. i have four nodes, dot113, dot119, dot126, and dot129

129 is the original one i had trouble with, it could not rejoin the cluster using dot119 as the donor. It failed before state transfer.

When i tried using dot126 as the donor, the state transfer happened (took about 40 minutes), but when it completed, the donor crashed.

is it possible that hardware trouble of dot119 is causing all this trouble?

is there a problem if i used 129 to be the donor for 126, if i try to use 126 to be the donor of 129?

here is the error log output of dot126

Please help.

150427 8:38:41 [Note] WSREP: Flow-control interval: [32, 32]

150427 8:38:41 [Note] WSREP: New cluster view: global state: 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:5021643101, view# 295: Primary, num

150427 8:38:41 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

150427 8:38:41 [Note] WSREP: REPL Protocols: 5 (3, 1)

150427 8:38:41 [Note] WSREP: Assign initial position for certification: 5021643101, protocol version: 3

150427 8:38:41 [Note] WSREP: Service thread queue flushed.

150427 8:38:43 [Note] WSREP: Node 0.0 (dot129) requested state transfer from 'dot126'. Selected 3.0 (dot126)(SYNCED) as donor.

150427 8:38:43 [Note] WSREP: Shifting SYNCED -> DONOR/DESYNCED (TO: 5021644627)

150427 8:38:43 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.

150427 8:38:43 [Note] WSREP: Running: 'wsrep_sst_rsync --role 'donor' --address '192.168.1.129:4444/rsync_sst' --auth 'geek:snape99'

150427 8:38:43 [Note] WSREP: sst_donor_thread signaled with 0

150427 8:38:43 [Note] WSREP: Flushing tables for SST...

150427 8:38:43 [Note] WSREP: Provider paused at 4f93c528-ac04-11e3-ae2c-1eedce9e1c84:5021644780

150427 8:38:43 [Note] WSREP: Tables flushed.

150427 8:54:17 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000000 of size 134217728 bytes

150427 9:09:29 [Note] WSREP: Created page /var/lib/mysql/gcache.page.000001 of size 134217728 bytes

150427 9:11:09 [Note] WSREP: Provider resumed.

150427 9:11:09 [Note] WSREP: 3.0 (dot126): State transfer to 0.0 (dot129) complete.

150427 9:11:09 [Note] WSREP: Shifting DONOR/DESYNCED -> JOINED (TO: 5022335111)

150427 9:11:10 [ERROR] mysqld got signal 6 ;

This could be because you hit a bug. It is also possible that this binary

or one of the libraries it was linked against is corrupt, improperly built,

or misconfigured. This error can also be caused by malfunctioning hardware.

To report this bug, see http://kb.askmonty.org/en/reporting-bugs

We will try our best to scrape up some info that will hopefully help

diagnose the problem, but since we have already crashed,

something is definitely wrong and this may fail.

Server version: 5.5.35-MariaDB-1~precise-wsrep

key_buffer_size=16777216

read_buffer_size=131072

max_used_connections=306

max_threads=153

thread_count=12

It is possible that mysqld could use up to

key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 360322 K bytes of memory

Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0x7f0b36412000

Attempting backtrace. You can use the following information to find out

where mysqld died. If you see no messages after this, something went

terribly wrong...

stack_bottom = 0x7f0b373caa20 thread_stack 0x30000

(my_addr_resolve failure: fork)

mysqld(my_print_stacktrace+0x2b) [0x7f0b4739db0b]

mysqld(handle_fatal_signal+0x471) [0x7f0b46fb9e41]

/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f0b45805cb0]

/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f0b44e70425]

/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f0b44e73b8b]

/usr/lib/galera/libgalera_smm.so(galera::Certification::purge_for_trx_v3(galera::TrxHandle*)+0x255) [0x7f0b42b28f55]

/usr/lib/galera/libgalera_smm.so(galera::Certification::PurgeAndDiscard::operator()(std::pair<long const, galera::TrxHandle*>&) const+

/usr/lib/galera/libgalera_smm.so(galera::Certification::PurgeAndDiscard std::for_each<std::_Rb_tree_iterator<std::pair<long const, gal

/usr/lib/galera/libgalera_smm.so(galera::Certification::purge_trxs_upto_(long, bool)+0x72) [0x7f0b42b2a722]

/usr/lib/galera/libgalera_smm.so(galera::Certification::append_trx(galera::TrxHandle*)+0x24e) [0x7f0b42b2e6de]

/usr/lib/galera/libgalera_smm.so(galera::ReplicatorSMM::cert(galera::TrxHandle*)+0x8a) [0x7f0b42b5441a]

/usr/lib/galera/libgalera_smm.so(galera::ReplicatorSMM::process_trx(void*, galera::TrxHandle*)+0x2c) [0x7f0b42b54b0c]

/usr/lib/galera/libgalera_smm.so(galera::GcsActionSource::dispatch(void*, gcs_action const&, bool&)+0x3f4) [0x7f0b42b36864]

/usr/lib/galera/libgalera_smm.so(galera::GcsActionSource::process(void*, bool&)+0x5b) [0x7f0b42b374eb]

/usr/lib/galera/libgalera_smm.so(galera::ReplicatorSMM::async_recv(void*)+0x63) [0x7f0b42b56d03]

/usr/lib/galera/libgalera_smm.so(galera_recv+0x23) [0x7f0b42b66eb3]

mysqld(+0x4a25d1) [0x7f0b46f6f5d1]

mysqld(start_wsrep_THD+0x409) [0x7f0b46dd63f9]

/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f0b457fde9a]

/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f0b44f2e3fd]

Trying to get some variables.

Some pointers may be invalid and cause the dump to abort.

Query (0x0): is an invalid pointer

Connection ID (thread ID): 2

Status: NOT_KILLED

Optimizer switch: index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,index_merge_sort_intersect

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains

information that should help you find out what is causing the crash.

Umarzuki Mochlis

unread,

Apr 27, 2015, 11:05:13 AM4/27/15

to Shawn Bright, codersh...@googlegroups.com

I might be wrong but I'm not sure that wsrep sst method rsync needed auth.

Reply all

Reply to author

Forward

0 new messages