Hi all (again!) :)
I've successfully deployed galera in 3 nodes.
However, after ~1h, with very light db usage (no more than 10 requests
in app) all 3 mysql nodes have crashed.
I've added to /etc/mysql/conf.d/wsrep.cnf the initial cluster url in
all nodes to contain the ip of some other node, so if it restarts, it
will join automatically.
This approach requires a manual intervention in one of them to
temporarily set it to gcomm:// to bootstrap the cluster (BTW, any
better suggestions for the cluster to automatically restart after
rebooting all machines?).
I'm attaching the log without doing that: all nodes were offline after
the crash.
Any help?
Thanks.
Here's the log:
...
110811 10:37:40 [Note] WSREP: GMCast::handle_stable_view:
view(view_id(PRIM,78372883-c418-11e0-0800-04f541689ba3,13) memb {
78372883-c418-11e0-0800-04f541689ba3,
7abaf0ea-c418-11e0-0800-ae1fc9c3d7c6,
b5b26d3c-c41d-11e0-0800-4615aedbfb68,
} joined {
} left {
} partitioned {
})
110811 10:37:40 [Note] WSREP: New COMPONENT: primary = yes, my_idx =
2, memb_num = 3
110811 10:37:40 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
110811 10:37:40 [Note] WSREP: declaring 78372883-
c418-11e0-0800-04f541689ba3 stable
110811 10:37:40 [Note] WSREP: declaring 7abaf0ea-c418-11e0-0800-
ae1fc9c3d7c6 stable
110811 10:37:40 [Note] WSREP: STATE EXCHANGE: sent state msg: 158a2ddb-
c41f-11e0-0800-50f42779c2ce
110811 10:37:40 [Note] WSREP: STATE EXCHANGE: got state msg: 158a2ddb-
c41f-11e0-0800-50f42779c2ce from 0 (server2)
110811 10:37:40 [Note] WSREP: STATE EXCHANGE: got state msg: 158a2ddb-
c41f-11e0-0800-50f42779c2ce from 1 (server3)
110811 10:37:40 [Note] WSREP: STATE EXCHANGE: got state msg: 158a2ddb-
c41f-11e0-0800-50f42779c2ce from 2 (server1)
110811 10:37:40 [Warning] WSREP: Quorum: No node with complete state:
Version : 0
Flags : 1
Protocols : 0 / 1 / 0
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : c7709851-c41d-11e0-0800-b25958df6cd9
Prim JOINED : 1
Prim seqno : 9
Global seqno : 17
State UUID : 158a2ddb-c41f-11e0-0800-50f42779c2ce
Group UUID : d214b6c5-c41a-11e0-0800-9ec4a040a0e1
Name : 'server2'
Incoming addr: '192.168.12.2'
Version : 0
Flags : 0
Protocols : 0 / 1 / 0
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : b5f96b3e-c41d-11e0-0800-1d6122f3fd5e
Prim JOINED : 2
Prim seqno : 8
Global seqno : 17
State UUID : 158a2ddb-c41f-11e0-0800-50f42779c2ce
Group UUID : d214b6c5-c41a-11e0-0800-9ec4a040a0e1
Name : 'server3'
Incoming addr: '192.168.12.3'
Version : 0
Flags : 0
Protocols : 0 / 1 / 0
State : NON-PRIMARY
Prim state : SYNCED
Prim UUID : c7709851-c41d-11e0-0800-b25958df6cd9
Prim JOINED : 1
Prim seqno : 9
Globa
110811 10:37:40 [Note] WSREP: Complete merge of primary c7709851-
c41d-11e0-0800-b25958df6cd9 found: 1 of 1.
110811 10:37:40 [ERROR] WSREP: gcs/src/
gcs_state_msg.c:state_quorum_remerge():511: Found more than one re-
merged primary component candidate.
110811 10:37:40 [ERROR] WSREP: gcs/src/
gcs_state_msg.c:gcs_state_msg_get_quorum():537: Failed to establish
quorum.
110811 10:37:40 [Note] WSREP: Quorum results:
version = 4294967295,
component = NON-PRIMARY,
conf_id = -1,
members = 1/3 (joined/total),
act_id = -1,
last_appl. = 0,
protocols = -1/-1/-1 (gcs/repl/appl),
group UUID = 00000000-0000-0000-0000-000000000000
110811 10:37:40 [Note] WSREP: Flow-control interval: [14, 28]
110811 10:37:40 [Note] WSREP: Received NON-PRIMARY.
110811 10:37:40 [ERROR] WSREP: Unsupported application protocol
version: -1
110811 10:37:40 - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this
binary
or one of the libraries it was linked against is corrupt, improperly
built,
or misconfigured. This error can also be caused by malfunctioning
hardware.
We will try our best to scrape up some info that will hopefully help
diagnose
the problem, but since we have already crashed, something is
definitely wrong
and this may fail.
key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=10
max_threads=151
threads_connected=9
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads =
346566 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x147cf70
Attempting backtrace. You can use the following information to find
out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0x7fb4c28d9ea8 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0x88e6a5]
/usr/sbin/mysqld(handle_segfault+0x344)[0x5cbad4]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfc60)[0x7fb4c6bf0c60]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35)[0x7fb4c5fa4d05]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x186)[0x7fb4c5fa8ab6]
/usr/sbin/mysqld[0x746c7b]
/usr/lib/galera/
libgalera_smm.so(_ZN6galera13ReplicatorSMM17process_view_infoEPvRK15wsrep_view_infoNS_10Replicator5StateEl
+0x202)[0x7fb4c4cdd752]
/usr/lib/galera/
libgalera_smm.so(_ZN6galera15GcsActionSource8dispatchEPvPKvm12gcs_act_typell
+0x37e)[0x7fb4c4cd53be]
/usr/lib/galera/libgalera_smm.so(_ZN6galera15GcsActionSource7processEPv
+0x81)[0x7fb4c4cd5551]
/usr/lib/galera/
libgalera_smm.so(_ZN6galera13ReplicatorSMM10async_recvEPv+0x118)
[0x7fb4c4cd8558]
/usr/lib/galera/libgalera_smm.so(galera_recv+0x23)[0x7fb4c4ceb953]
/usr/sbin/mysqld(_Z25wsrep_replication_processP3THD+0x10e)[0x5dcd8e]
/usr/sbin/mysqld(start_wsrep_THD+0x28a)[0x5d1e2a]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x6d8c)[0x7fb4c6be7d8c]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb4c605704d]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query ((nil)): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED
The manual page at
http://dev.mysql.com/doc/mysql/en/crashing.html
contains
information that should help you find out what is causing the crash.
110811 10:37:40 [Note] WSREP: wsrep_load(): loading provider library '/
usr/lib/galera/libgalera_smm.so'
110811 10:37:40 [Note] WSREP: wsrep_load(): Galera 0.8.1(r57) by
Codership Oy <
in...@codership.com> loaded succesfully.
110811 10:37:40 [Note] WSREP: Passing config to GCS: gcs.fc_debug = 0;
gcs.fc_factor = 0.5; gcs.fc_limit = 16; gcs.fc_master_slave = NO;
gcs.max_packet_size = 64500; gcs.max_throttle = 0.25;
gcs.recv_q_hard_limit = 9223372036854775807; gcs.recv_q_soft_limit =
0.25; replicator.commit_order = 3
110811 10:37:40 [Note] WSREP: wsrep_sst_grab()
110811 10:37:40 [Note] WSREP: Start replication
110811 10:37:40 [Note] WSREP: Found saved state: d214b6c5-
c41a-11e0-0800-9ec4a040a0e1:17
110811 10:37:40 [Note] WSREP: Assign initial position for
certification: 17, protocol version: 1
110811 10:37:40 [Note] WSREP: Setting initial position to d214b6c5-
c41a-11e0-0800-9ec4a040a0e1:17
110811 10:37:40 [Note] WSREP: protonet asio version 0
110811 10:37:40 [Note] WSREP: backend: asio
110811 10:37:40 [Note] WSREP: GMCast version 0
110811 10:37:40 [Note] WSREP: (1598b0e5-c41f-11e0-0800-171bd22bc8c5,
'tcp://
0.0.0.0:4567') listening at tcp://
0.0.0.0:4567
110811 10:37:40 [Note] WSREP: (1598b0e5-c41f-11e0-0800-171bd22bc8c5,
'tcp://
0.0.0.0:4567') multicast: , ttl: 1
110811 10:37:40 [Note] WSREP: EVS version 0
110811 10:37:40 [Note] WSREP: PC version 0
110811 10:37:40 [Note] WSREP: gcomm: connecting to group 'cyclos',
peer '
192.168.12.2:'
110811 10:37:41 [Warning] WSREP: no nodes coming from prim view, prim
not possible