"exception from gcomm, backend must be restarted: nmi != known_.end(): node X not found from known map (FATAL)"

68 views
Skip to first unread message

rma...@pivotal.io

unread,
Feb 2, 2015, 5:29:43 PM2/2/15
to codersh...@googlegroups.com, cf-core-services-eng
Hello,

Recently, we experienced a sudden failure on a 3-node MariaDB Galera cluster, and can't find any documentation or discussion of these particular logs. Prior to the crash, our cluster seemed to be experiencing connectivity issues, and this particular node was partitioned from both its peers. It was in a non-primary state when the following occurred:

 150126  6:26:16 [Note] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, GATHER, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4)) suspecting node: b3156374-a4dc-11e4-90ef-7745fb12a381
150126  6:26:17 [Warning] WSREP: subsequent views have same members, prev view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
}) current view view(view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
})
150126  6:26:17 [Note] WSREP: view(view_id(NON_PRIM,fee139da-a4dc-11e4-896b-13e5de7439e0,5) memb {
        fee139da-a4dc-11e4-896b-13e5de7439e0,0
} joined {
} left {
} partitioned {
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
        b3156374-a4dc-11e4-90ef-7745fb12a381,0
})
150126  6:26:17 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
150126  6:26:17 [Note] WSREP: Flow-control interval: [16, 16]
150126  6:26:17 [Note] WSREP: Received NON-PRIMARY.
150126  6:26:17 [Note] WSREP: New cluster view: global state: 7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version 3
150126  6:26:17 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150126  6:26:20 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') reconnecting to 7ef72f2a-a4dc-11e4-8f5b-bb400fba8578 (tcp://10.85.49.128:4567), attempt 0
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') address 'tcp://10.85.49.130:4567' pointing to uuid fee139da-a4dc-11e4-896b-13e5de7439e0 is blacklisted, skipping
150126  6:26:28 [Note] WSREP: (fee139da-a4dc-11e4-896b-13e5de7439e0, 'tcp://0.0.0.0:4567') turning message relay requesting off
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [Warning] WSREP: evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5)) dropping foreign message from b3156374-a4dc-11e4-90ef-7745fb12a381 in install state
150126  6:26:28 [ERROR] WSREP: exception caused by message: evs::msg{version=0,type=3,user_type=255,order=1,seq=-1,seq_range=-1,aru_seq=-1,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176868,node_list=()
}
 state after handling message: evs::proto(evs::proto(fee139da-a4dc-11e4-896b-13e5de7439e0, INSTALL, view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6)), INSTALL) {
current_view=view(view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,6) memb {
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,0
} joined {
} left {
} partitioned {
}),
input_map=evs::input_map: {aru_seq=-1,safe_seq=-1,node_index=},
fifo_seq=176418,
last_sent=11,
known={
        7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,evs::node{operational=1,suspected=0,installed=1,fifo_seq=176868,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176864,node_list=(   7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        b3156374-a4dc-11e4-90ef-7745fb12a381,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,4),safe_seq=0,im_range=[1,0],}
)
},
}
        fee139da-a4dc-11e4-896b-13e5de7439e0,evs::node{operational=1,suspected=0,installed=1,fifo_seq=-1,join_message=
evs::msg{version=0,type=4,user_type=255,order=1,seq=11,seq_range=-1,aru_seq=11,flags=0,source=fee139da-a4dc-11e4-896b-13e5de7439e0,source_view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176416,node_list=( 7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
},
}
 }
install msg=evs::msg{version=0,type=5,user_type=255,order=1,seq=4,seq_range=-1,aru_seq=4,flags=4,source=7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,source_view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),range_uuid=00000000-0000-0000-0000-000000000000,range=[-1,-1],fifo_seq=176866,node_list=(       7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[10,9],}
        b3156374-a4dc-11e4-90ef-7745fb12a381,node: {operational=0,suspected=1,leave_seq=-1,view_id=view_id(REG,7ef72f2a-a4dc-11e4-8f5b-bb400fba8578,5),safe_seq=4,im_range=[5,4],}
        fee139da-a4dc-11e4-896b-13e5de7439e0,node: {operational=1,suspected=0,leave_seq=-1,view_id=view_id(REG,fee139da-a4dc-11e4-896b-13e5de7439e0,5),safe_seq=11,im_range=[12,11],}
)
}
 }
 
150126  6:26:28 [ERROR] WSREP: exception from gcomm, backend must be restarted: nmi != known_.end(): node b3156374-a4dc-11e4-90ef-7745fb12a381 not found from known map (FATAL)
         at gcomm/src/evs_proto.cpp:shift_to():2363
150126  6:26:28 [Note] WSREP: Received self-leave message.
150126  6:26:28 [Note] WSREP: Flow-control interval: [0, 0]
150126  6:26:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150126  6:26:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 43)
150126  6:26:28 [Note] WSREP: RECV thread exiting 0: Success
150126  6:28:16 [Note] WSREP: New cluster view: global state: 7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
150126  6:28:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150126  6:28:16 [Note] WSREP: applier thread exiting (code:0)

At this point, the mysqld process was still running but no further logging occurred. We understand that we can repair this cluster by bootstrapping a single node, but would like to understand what may have happened to cause this crash. We've highlighted some specific log messages that seemed more significant; we don't understand why a node would be in an 'INSTALL' state after having been up for a while. The logs of the node in question, b3156374-a4dc-11e4-90ef-7745fb12a381, show that the node continued to function (attempting to find cluster members) for quite some time after this crash. It was unable to connect to fee139da-a4dc-11e4-896b-13e5de7439e0 and is still running, still trying to connect at the moment. 

Any insight into what's going on here? Looking to see whether this is an effect of the connectivity issues we've seen, or something else entirely. 

Thanks,

Raina and Lyle
Cloud Foundry Services Team

Teemu Ollakka

unread,
Feb 3, 2015, 9:05:37 AM2/3/15
to codersh...@googlegroups.com, cf-core-se...@pivotal.io

Hi,


On Tuesday, February 3, 2015 at 12:29:43 AM UTC+2, rma...@pivotal.io wrote:
Hello,

Recently, we experienced a sudden failure on a 3-node MariaDB Galera cluster, and can't find any documentation or discussion of these particular logs. Prior to the crash, our cluster seemed to be experiencing connectivity issues, and this particular node was partitioned from both its peers. It was in a non-primary state when the following occurred:


[...] 
 
150126  6:26:28 [ERROR] WSREP: exception from gcomm, backend must be restarted: nmi != known_.end(): node b3156374-a4dc-11e4-90ef-7745fb12a381 not found from known map (FATAL)
         at gcomm/src/evs_proto.cpp:shift_to():2363
150126  6:26:28 [Note] WSREP: Received self-leave message.
150126  6:26:28 [Note] WSREP: Flow-control interval: [0, 0]
150126  6:26:28 [Note] WSREP: Received SELF-LEAVE. Closing connection.
150126  6:26:28 [Note] WSREP: Shifting OPEN -> CLOSED (TO: 43)
150126  6:26:28 [Note] WSREP: RECV thread exiting 0: Success
150126  6:28:16 [Note] WSREP: New cluster view: global state: 7ef7ed24-a4dc-11e4-be97-6ee0620c8266:43, view# -1: non-Primary, number of nodes: 0, my index: -1, protocol version 3
150126  6:28:16 [Note] WSREP: wsrep_notify_cmd is not defined, skipping notification.
150126  6:28:16 [Note] WSREP: applier thread exiting (code:0)

At this point, the mysqld process was still running but no further logging occurred. We understand that we can repair this cluster by bootstrapping a single node, but would like to understand what may have happened to cause this crash. We've highlighted some specific log messages that seemed more significant; we don't understand why a node would be in an 'INSTALL' state after having been up for a while.

INSTALL state is an internal state of group communication and does not reflect the actual state of the database node.
 
The logs of the node in question, b3156374-a4dc-11e4-90ef-7745fb12a381, show that the node continued to function (attempting to find cluster members) for quite some time after this crash. It was unable to connect to fee139da-a4dc-11e4-896b-13e5de7439e0 and is still running, still trying to connect at the moment. 

Any insight into what's going on here? Looking to see whether this is an effect of the connectivity issues we've seen, or something else entirely. 



This was a group communication bug in earlier Galera 2.x an 3.x library versions, sometimes triggered by intermittent network problems. Unfortunately I could not find any public bug report about this or any specific commit that would have fixed the bug, but this issue should have been fixed by the improvements done to group communication after 25.3.5 release.

- Teemu

rma...@pivotal.io

unread,
Feb 3, 2015, 1:49:25 PM2/3/15
to codersh...@googlegroups.com, cf-core-se...@pivotal.io
Hi Teemu,

We were running the Galera 25.3.5 release when we experienced this issue. Would you suggest we update to a later version from the GitHub repo? If so, I see there are tags up to 25.3.9, which tag would be the latest stable release? 

Thanks,

Raina and Lyle
Cloud Foundry Services Team

Teemu Ollakka

unread,
Feb 3, 2015, 4:55:11 PM2/3/15
to codersh...@googlegroups.com, cf-core-se...@pivotal.io

Hi,

I would suggest upgrading to 25.3.9 once it is out.

Official release of 25.3.9 will be announced very soon, official binaries will be available via galeracluster.com website.

- Teemu
Reply all
Reply to author
Forward
0 new messages