Galera and unstable network.

1,059 views
Skip to first unread message

Ilias Bertsimas

unread,
Oct 30, 2012, 6:07:02 PM10/30/12
to codersh...@googlegroups.com
Hello,

We have the bad luck of a generally unstable network in the hosting company we have our servers. Although smokeping does not show any packet loss in the vlan we are using, it seems galera does not play well from time to time.
We get a lot of:

121030 18:14:08 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121030 18:15:25 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121030 18:15:58 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.60:4567 tcp://192.168.0.65:4567 tcp://192.168.0.66:4567 tcp://192.168.0.75:4567
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') reconnecting to 43601a40-200e-11e2-0800-54159196bef8 (tcp://192.168.0.65:4567), attempt 0
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') reconnecting to bcc1cad8-1c71-11e2-0800-40d8b972528a (tcp://192.168.0.66:4567), attempt 0
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') reconnecting to 114424cd-1c35-11e2-0800-5391d86ade2c (tcp://192.168.0.75:4567), attempt 0
121030 18:46:24 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121030 18:46:24 [Note] WSREP: (a9467c96-228a-11e2-0800-ec09a1ce58db, 'tcp://0.0.0.0:4567') turning message relay requesting off
121030 18:46:25 [Note] WSREP: evs::proto(a9467c96-228a-11e2-0800-ec09a1ce58db, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,270)) suspecting node: 114424cd-1c35-11e2-0800-5391d86ade2c
121030 18:46:25 [Note] WSREP: evs::proto(a9467c96-228a-11e2-0800-ec09a1ce58db, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,270)) suspecting node: 43601a40-200e-11e2-0800-54159196bef8
121030 18:46:25 [Note] WSREP: evs::proto(a9467c96-228a-11e2-0800-ec09a1ce58db, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,270)) suspecting node: 98987112-206b-11e2-0800-fb9cb9f8da35
121030 18:46:25 [Note] WSREP: evs::proto(a9467c96-228a-11e2-0800-ec09a1ce58db, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,270)) suspecting node: bcc1cad8-1c71-11e2-0800-40d8b972528a

The cluster renegotiates and a lot of slow downs happen. I was wandering if there is any tuning I can do to the timeout settings that can maybe make it a bit more stable and if so what are the downsides of that tuning.
Also I noticed galera supports UDP but it runs using TCP by default. Is there any chance changing it to UDP might help in my case ?

Kind Regards,
Ilias.


Haris Zukanovic

unread,
Oct 31, 2012, 3:47:27 AM10/31/12
to codersh...@googlegroups.com
http://www.codership.com/wiki/doku.php?id=configuration_tips
Take a look at WAN replication settings, experiment and find what suits your level of network instability
I am running 3 node cluster in WAN, betweeen different datacenters and after applying these settings it works like a charm
--
 
 

-- 
--
Haris Zukanovic

Alex Yurchenko

unread,
Oct 31, 2012, 9:56:39 AM10/31/12
to codersh...@googlegroups.com
Hi Ilias,

It may well be that the problem is not in the network:

On 2012-10-31 00:07, Ilias Bertsimas wrote:
> Hello,
>
> We have the bad luck of a generally unstable network in the hosting
> company
> we have our servers. Although smokeping does not show any packet loss
> in
> the vlan we are using, it seems galera does not play well from time
> to time.
> We get a lot of:
>
> 121030 18:14:08 [Warning] WSREP: last inactive check more than PT1.5S
> ago,
> skipping check
> 121030 18:15:25 [Warning] WSREP: last inactive check more than PT1.5S
> ago,
> skipping check
> 121030 18:15:58 [Warning] WSREP: last inactive check more than PT1.5S
> ago,
> skipping check

The above is a bad sign, which (sans possible bugs) means that the
system either blocks on IO (swapping, reading/writing huge amount of
data) or is severely overloaded otherwise. Literally that means that one
of Galera key threads could not get CPU time for more than 1.5 seconds
(there is a small bug that makes it print this once on configuration
change, but this one is for real). Now, if Galera has troubles checking
for keepalives from peers, it must have troubles sending keepalives to
them. Hence you have an appearance of an unstable network.

Check if your servers swap or otherwise do some very heavy IO. It does
not have to be mysqld, some other process can be the culprit.
--
Alexey Yurchenko,
Codership Oy, www.codership.com
Skype: alexey.yurchenko, Phone: +358-400-516-011

Ilias Bertsimas

unread,
Oct 31, 2012, 10:11:32 AM10/31/12
to codersh...@googlegroups.com
Hi Alexey,

I checked cacti graphs on the host and there is no excessive cpu or i/o load on the host during those events. No swapping at all and a lot of free RAM.
That node was a dormant slave at the time and it had no load other than keeping in sync with the rest of the cluster.
Also it seems we are getting a lot of TCP retransmitions "480 segments retransmited".

It happened again during the night and more than once:

121031  4:58:49 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121031  5:18:32 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121031  5:18:32 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.65:4567 tcp://192.168.0.66:4567
121031  5:18:32 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') turning message relay requesting off
121031  6:07:56 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121031  6:12:21 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121031  6:12:22 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.60:4567 tcp://192.168.0.65:4567 tcp://192.168.0.66:4567 tcp://192.168.0.75:456
7
121031  6:12:22 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121031  6:12:22 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 43601a40-200e-11e2-0800-54159196bef8 (tcp://192.168.0.65:4567), attempt 0
121031  6:12:22 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to bcc1cad8-1c71-11e2-0800-40d8b972528a (tcp://192.168.0.66:4567), attempt 0
121031  6:12:22 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 114424cd-1c35-11e2-0800-5391d86ade2c (tcp://192.168.0.75:4567), attempt 0
121031  6:12:24 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121031  6:12:24 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 43601a40-200e-11e2-0800-54159196bef8 (tcp://192.168.0.65:4567), attempt 0
121031  6:12:24 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to bcc1cad8-1c71-11e2-0800-40d8b972528a (tcp://192.168.0.66:4567), attempt 0
121031  6:12:24 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 114424cd-1c35-11e2-0800-5391d86ade2c (tcp://192.168.0.75:4567), attempt 0
121031  6:12:25 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121031  6:12:25 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 43601a40-200e-11e2-0800-54159196bef8 (tcp://192.168.0.65:4567), attempt 0
121031  6:12:25 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to bcc1cad8-1c71-11e2-0800-40d8b972528a (tcp://192.168.0.66:4567), attempt 0
121031  6:12:25 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') reconnecting to 114424cd-1c35-11e2-0800-5391d86ade2c (tcp://192.168.0.75:4567), attempt 0
121031  6:12:25 [Note] WSREP: (6d38875b-22e7-11e2-0800-407588169a7e, 'tcp://0.0.0.0:4567') turning message relay requesting off
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: 114424cd-1c35-11e2-0800-5391d86ade2c
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: 43601a40-200e-11e2-0800-54159196bef8
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, OPERATIONAL, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: bcc1cad8-1c71-11e2-0800-40d8b972528a
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: 114424cd-1c35-11e2-0800-5391d86ade2c
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: 43601a40-200e-11e2-0800-54159196bef8
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: 98987112-206b-11e2-0800-fb9cb9f8da35
121031  6:12:26 [Note] WSREP: evs::proto(6d38875b-22e7-11e2-0800-407588169a7e, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,277)) suspecting node: bcc1cad8-1c71-11e2-0800-40d8b972528a


Kind Regards,
Ilias.

Ilias Bertsimas

unread,
Nov 11, 2012, 10:47:29 AM11/11/12
to codersh...@googlegroups.com
Hi,

I keep getting network issues on my galera cluster again no excessive load or anything out of the ordinary with the system.


Here is one happned just now:

121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: evs::proto(98987112-206b-11e2-0800-fb9cb9f8da35, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363))lu (3094639) <= safe_seq(3094639), can't recover message
121111 16:41:49 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check
121111 16:41:50 [Note] WSREP: (98987112-206b-11e2-0800-fb9cb9f8da35, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.64:4567
121111 16:41:50 [Note] WSREP: (98987112-206b-11e2-0800-fb9cb9f8da35, 'tcp://0.0.0.0:4567') turning message relay requesting off
121111 16:41:50 [Warning] WSREP: subsequent views have same members, prev view view(view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,363) memb {
        114424cd-1c35-11e2-0800-5391d86ade2c,
        1f7318e2-28b9-11e2-0800-ae585dffbf7a,
        45737e50-2418-11e2-0800-1b1995a09f69,
        98987112-206b-11e2-0800-fb9cb9f8da35,
} joined {
} left {
} partitioned {
}) current view view(view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,364) memb {
        114424cd-1c35-11e2-0800-5391d86ade2c,
        1f7318e2-28b9-11e2-0800-ae585dffbf7a,
        45737e50-2418-11e2-0800-1b1995a09f69,
        98987112-206b-11e2-0800-fb9cb9f8da35,
} joined {
} left {
} partitioned {
})
121111 16:41:50 [Note] WSREP: declaring 114424cd-1c35-11e2-0800-5391d86ade2c stable
121111 16:41:50 [Note] WSREP: declaring 1f7318e2-28b9-11e2-0800-ae585dffbf7a stable
121111 16:41:50 [Note] WSREP: declaring 45737e50-2418-11e2-0800-1b1995a09f69 stable
121111 16:41:50 [Note] WSREP: view(view_id(PRIM,114424cd-1c35-11e2-0800-5391d86ade2c,364) memb {
        114424cd-1c35-11e2-0800-5391d86ade2c,
        1f7318e2-28b9-11e2-0800-ae585dffbf7a,
        45737e50-2418-11e2-0800-1b1995a09f69,
        98987112-206b-11e2-0800-fb9cb9f8da35,
} joined {
} left {
} partitioned {
})
121111 16:41:50 [Note] WSREP: New COMPONENT: primary = yes, bootstrap = no, my_idx = 3, memb_num = 4
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: Waiting for state UUID.
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: sent state msg: 4f30fb4d-2c16-11e2-0800-46ff82986637
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: got state msg: 4f30fb4d-2c16-11e2-0800-46ff82986637 from 1 (node2)
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: got state msg: 4f30fb4d-2c16-11e2-0800-46ff82986637 from 2 (node3)
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: got state msg: 4f30fb4d-2c16-11e2-0800-46ff82986637 from 3 (node1)
121111 16:41:50 [Note] WSREP: STATE EXCHANGE: got state msg: 4f30fb4d-2c16-11e2-0800-46ff82986637 from 0 (garb)
121111 16:41:50 [Note] WSREP: Quorum results:
        version    = 2,
        component  = PRIMARY,
        conf_id    = 275,
        members    = 4/4 (joined/total),
        act_id     = 619022340,
        last_appl. = 619022208,
        protocols  = 0/4/2 (gcs/repl/appl),
        group UUID = 15422535-0dc2-11e2-0800-1f79137ae519
121111 16:41:50 [Note] WSREP: Flow-control interval: [253, 256]
121111 16:41:50 [Note] WSREP: New cluster view: global state: 15422535-0dc2-11e2-0800-1f79137ae519:619022340, view# 276: Primary, number of nodes: 4, my index: 3, protocol version 2
121111 16:41:50 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
121111 16:41:50 [Note] WSREP: Assign initial position for certification: 619022340, protocol version: 2
121111 16:42:05 [Warning] WSREP: last inactive check more than PT1.5S ago, skipping check

Any way to debug the issue further ?


Kind Regards,
Ilias.

Ilias Bertsimas

unread,
Nov 11, 2012, 3:34:43 PM11/11/12
to codersh...@googlegroups.com
New weird disconnect/split:


121111 21:16:34 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.60:4567
121111 21:16:35 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121111 21:16:35 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting off
121111 21:17:13 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.60:4567
121111 21:17:14 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121111 21:17:19 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') cleaning up duplicate 0x7fa025c269c0 after established 0x7f9f7d3edb60
121111 21:17:19 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting off
121111 21:17:28 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://192.168.0.60:4567
121111 21:17:29 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') reconnecting to 98987112-206b-11e2-0800-fb9cb9f8da35 (tcp://192.168.0.60:4567), attempt 0
121111 21:17:35 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') cleaning up established 0x7fa00b20aed0 which is duplicate of 0x7fa009534590
121111 21:17:35 [Note] WSREP: (1f7318e2-28b9-11e2-0800-ae585dffbf7a, 'tcp://0.0.0.0:4567') turning message relay requesting off
121111 21:18:15 [Warning] WSREP: evs::proto(1f7318e2-28b9-11e2-0800-ae585dffbf7a, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366)) install timer expired
evs::proto(evs::proto(1f7318e2-28b9-11e2-0800-ae585dffbf7a, GATHER, view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366)), GATHER) {
current_view=view(view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366) memb {
        114424cd-1c35-11e2-0800-5391d86ade2c,
        1f7318e2-28b9-11e2-0800-ae585dffbf7a,
        45737e50-2418-11e2-0800-1b1995a09f69,
        98987112-206b-11e2-0800-fb9cb9f8da35,
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=457088,safe_seq=457077,node_index=node: {idx=0,range=[457089,457088],safe_seq=457088} node: {idx=1,range=[457089,457088],safe_seq=457088} node: {idx=2,range=[457089,457088],safe_seq=457088} node: {idx=3
,range=[457089,457088],safe_seq=457077} ,msg_index=     (3,457078),evs::msg{version=0,type=1,user_type=1,order=4,seq=457078,seq_range=0,aru_seq=457077,flags=6,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424c
d-1c35-11e2-0800-5391d86ade2c,366),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833149,node_list=()
}
        (3,457079),evs::msg{version=0,type=1,user_type=1,order=4,seq=457079,seq_range=0,aru_seq=457077,flags=6,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=00
000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833150,node_list=()
}
        (3,457080),evs::msg{version=0,type=1,user_type=255,order=0,seq=457080,seq_range=0,aru_seq=457077,flags=6,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833151,node_list=()
}
        (3,457081),evs::msg{version=0,type=1,user_type=255,order=0,seq=457081,seq_range=0,aru_seq=457077,flags=6,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833152,node_list=()
}
        (3,457082),evs::msg{version=0,type=1,user_type=1,order=4,seq=457082,seq_range=0,aru_seq=457077,flags=5,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=00
000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833153,node_list=()
}
        (3,457083),evs::msg{version=0,type=1,user_type=1,order=4,seq=457083,seq_range=0,aru_seq=457077,flags=5,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=00
000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833154,node_list=()
}
        (3,457084),evs::msg{version=0,type=1,user_type=255,order=4,seq=457084,seq_range=0,aru_seq=457077,flags=12,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid
=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833155,node_list=()
}
        (3,457085),evs::msg{version=0,type=1,user_type=255,order=0,seq=457085,seq_range=3,aru_seq=457077,flags=4,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=473833156,node_list=()
}
        (3,457086),evs::msg{version=0,type=1,user_type=255,order=0,seq=457086,seq_range=0,aru_seq=457077,flags=0,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=-1,node_list=()
}
        (3,457087),evs::msg{version=0,type=1,user_type=255,order=0,seq=457087,seq_range=0,aru_seq=457077,flags=0,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=-1,node_list=()
}
        (3,457088),evs::msg{version=0,type=1,user_type=255,order=0,seq=457088,seq_range=0,aru_seq=457077,flags=0,source=98987112-206b-11e2-0800-fb9cb9f8da35,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=-1,node_list=()
}
,recovery_index=        (0,457078),evs::msg{version=0,type=1,user_type=255,order=0,seq=457078,seq_range=0,aru_seq=457077,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,
366),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053750,node_list=()
}
        (1,457078),evs::msg{version=0,type=1,user_type=255,order=0,seq=457078,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911920,node_list=()
}
        (2,457078),evs::msg{version=0,type=1,user_type=255,order=0,seq=457078,seq_range=0,aru_seq=457077,flags=4,source=45737e50-2418-11e2-0800-1b1995a09f69,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=335511489,node_list=()
}
        (0,457079),evs::msg{version=0,type=1,user_type=255,order=0,seq=457079,seq_range=0,aru_seq=457077,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053751,node_list=()
}
        (1,457079),evs::msg{version=0,type=1,user_type=255,order=0,seq=457079,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911921,node_list=()
}
        (2,457079),evs::msg{version=0,type=1,user_type=255,order=0,seq=457079,seq_range=0,aru_seq=457077,flags=4,source=45737e50-2418-11e2-0800-1b1995a09f69,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=335511490,node_list=()
}
        (0,457080),evs::msg{version=0,type=1,user_type=255,order=0,seq=457080,seq_range=0,aru_seq=457078,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053753,node_list=()
}
        (1,457080),evs::msg{version=0,type=1,user_type=255,order=0,seq=457080,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911922,node_list=()
}
        (2,457080),evs::msg{version=0,type=1,user_type=255,order=0,seq=457080,seq_range=0,aru_seq=457077,flags=4,source=45737e50-2418-11e2-0800-1b1995a09f69,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=335511491,node_list=()
}
        (0,457081),evs::msg{version=0,type=1,user_type=255,order=0,seq=457081,seq_range=0,aru_seq=457078,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053754,node_list=()
}
        (1,457081),evs::msg{version=0,type=1,user_type=255,order=0,seq=457081,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911923,node_list=()
}
        (2,457081),evs::msg{version=0,type=1,user_type=255,order=0,seq=457081,seq_range=0,aru_seq=457077,flags=4,source=45737e50-2418-11e2-0800-1b1995a09f69,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=335511492,node_list=()
}
        (0,457082),evs::msg{version=0,type=1,user_type=255,order=0,seq=457082,seq_range=0,aru_seq=457078,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053755,node_list=()
}
        (1,457082),evs::msg{version=0,type=1,user_type=255,order=0,seq=457082,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911924,node_list=()
}
        (2,457082),evs::msg{version=0,type=1,user_type=255,order=0,seq=457082,seq_range=0,aru_seq=457077,flags=4,source=45737e50-2418-11e2-0800-1b1995a09f69,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=335511493,node_list=()
}
        (0,457083),evs::msg{version=0,type=1,user_type=255,order=0,seq=457083,seq_range=0,aru_seq=457078,flags=4,source=114424cd-1c35-11e2-0800-5391d86ade2c,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=645053756,node_list=()
}
        (1,457083),evs::msg{version=0,type=1,user_type=255,order=0,seq=457083,seq_range=0,aru_seq=457077,flags=0,source=1f7318e2-28b9-11e2-0800-ae585dffbf7a,source_view_id=view_id(REG,114424cd-1c35-11e2-0800-5391d86ade2c,366),range_uuid=
00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=150911925,node_list=()

I have not seen the above repeated message before... what does it mean ?

Teemu Ollakka

unread,
Nov 12, 2012, 12:26:19 PM11/12/12
to codersh...@googlegroups.com

Hi,

From the logs you sent it seems that every now or then connection between two nodes might get stuck. Is there a firewall between nodes that could cause that?

Anyway, cluster should recover easily from that kind of connection breakages. Could you send wsrep_provider_options for review to rule out issues with configuration?

- Teemu

Ilias Bertsimas

unread,
Nov 12, 2012, 12:45:02 PM11/12/12
to codersh...@googlegroups.com
Hi Teemu,

There is a firewall but all the traffic inside the vlan is unfiltered by default (skip vlan interface).

Following my wsrep options :
wsrep_provider_options = "gcs.fc_limit = 256; gcs.fc_factor = 0.99; gcs.fc_master_slave = yes; gcache.size=5G; gcs.sync_donor=1;"

We use only 1 node for writes hence the above config as suggested in the wiki.

Kind Regards,
Ilias.
Reply all
Reply to author
Forward
0 new messages