the raft_is_connected state of a raft server stays as false and cannot recover

8 views
Skip to first unread message

Yun Zhou

unread,
Aug 13, 2020, 8:26:34 PM8/13/20
to ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Girish Moodalbail
Hi,

Need expert's view to address a problem we are seeing now and then: A ovsdb-server node in a 3-nodes raft cluster keeps printing out the "raft_is_connected: false" message, and its "connected" state in its _Server DB stays as false.

According to the ovsdb-server(5) manpage, it means this server is not contacting with a majority of its cluster.

Except its "connected" state, from what we can see, this server is in the follower state and works fine, and connection between it and the other two servers appear healthy as well.

Below is its raft structure snapshot at the time of the problem. Note that its candidate_retrying field stays as true.

Hopefully the provide information can help to figure out what goes wrong here. Unfortunately we don't have a solid case to reproduce it:

(gdb) print *(struct raft *)0xa872c0
$19 = {
hmap_node = {
hash = 2911123117,
next = 0x0
},
log = 0xa83690,
cid = {
parts = {2699238234, 2258650653, 3035282424, 813064186}
},
sid = {
parts = {1071328836, 400573240, 2626104521, 1746414343}
},
local_address = 0xa874e0 "tcp:10.8.51.55:6643",
local_nickname = 0xa876d0 "3fdb",
name = 0xa876b0 "OVN_Northbound",
servers = {
buckets = 0xad4bc0,
one = 0x0,
mask = 3,
n = 3
},
election_timer = 1000,
election_timer_new = 0,
term = 3,
vote = {
parts = {1071328836, 400573240, 2626104521, 1746414343}
},
synced_term = 3,
synced_vote = {
parts = {1071328836, 400573240, 2626104521, 1746414343}
},
entries = 0xbf0fe0,
log_start = 2,
log_end = 312,
log_synced = 311,
allocated_log = 512,
snap = {
term = 1,
data = 0xaafb10,
eid = {
parts = {1838862864, 1569866528, 2969429118, 3021055395}
},
servers = 0xaafa70,
election_timer = 1000
},
role = RAFT_FOLLOWER,
commit_index = 311,
last_applied = 311,
leader_sid = {
parts = {642765114, 43797788, 2533161504, 3088745929}
},
election_base = 6043283367,
election_timeout = 6043284593,
joining = false,
remote_addresses = {
map = {
buckets = 0xa87410,
one = 0xa879c0,
mask = 0,
n = 1
}
},
join_timeout = 6037634820,
leaving = false,
left = false,
leave_timeout = 0,
failed = false,
waiters = {
prev = 0xa87448,
next = 0xa87448
},
listener = 0xaafad0,
listen_backoff = -9223372036854775808,
conns = {
prev = 0xbcd660,
next = 0xaafc20
},
add_servers = {
buckets = 0xa87480,
one = 0x0,
mask = 0,
n = 0
},
remove_server = 0x0,
commands = {
buckets = 0xa874a8,
one = 0x0,
mask = 0,
n = 0
},
ping_timeout = 6043283700,
n_votes = 1,
candidate_retrying = true,
had_leader = false,
ever_had_leader = true
}

Thanks
- Yun

Han Zhou

unread,
Aug 17, 2020, 1:14:18 AM8/17/20
to Yun Zhou, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Girish Moodalbail
On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou <yu...@nvidia.com> wrote:
Hi,

Need expert's view to address a problem we are seeing now and then:  A ovsdb-server node in a 3-nodes raft cluster keeps printing out the "raft_is_connected: false" message, and its "connected" state in its _Server DB stays as false.

According to the ovsdb-server(5) manpage, it means this server is not contacting with a majority of its cluster.

Except its "connected" state, from what we can see, this server is in the follower state and works fine, and connection between it and the other two servers appear healthy as well.

Below is its raft structure snapshot at the time of the problem. Note that its candidate_retrying field stays as true.

Hopefully the provide information can help to figure out what goes wrong here. Unfortunately we don't have a solid case to reproduce it:

Thanks for reporting the issue. This looks really strange. In the below state, leader_sid is non-zero, but candidate_retrying is true.
According to the latest code, whenever leader_sid is set to non-zero (in raft_set_leader()), candidate_retrying will be set to false; whenever candidate_retrying is set to true (in raft_start_election()), leader_sid will be set to UUID_ZERO. And the data struct is initialized with xzalloc, making sure candidate_retrying is false in the beginning. So, sorry that I can't explain how it ends up with this conflict situation. It would be helpful if there is a way to reproduce. How often does it happen?

Thanks,
Han

--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com.

Yun Zhou

unread,
Aug 17, 2020, 12:18:05 PM8/17/20
to Han Zhou, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Girish Moodalbail

Han,

 

Thanks for your reply, and thanks for confirming my reading of code at the time as well: “from what I cans see, that raft.leader_sid are also updated in the only two places where raft.candidate_retrying (raft_start_election() and raft_set_leader()) is set. Which means it is not possible that raft.candidate_retrying is set to TRUE but raft->leader_sid is set to non-Zero”.

 

We saw it not very often, probably every half month or so. If it happens again, what information you think we should collect that can help with further investigation?

 

Thanks

Yun

 

 

 

From: Han Zhou <zho...@gmail.com>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <yu...@nvidia.com>
Cc: ovs-d...@openvswitch.org; ovn-kub...@googlegroups.com; Girish Moodalbail <gmood...@nvidia.com>
Subject: Re: the raft_is_connected state of a raft server stays as false and cannot recover

 

External email: Use caution opening links or attachments

Yun Zhou

unread,
Aug 20, 2020, 7:34:24 PM8/20/20
to Han Zhou, ovs-d...@openvswitch.org, ovn-kub...@googlegroups.com, Girish Moodalbail

Han,

 

I just find out that we are using the ovs directly build from upstream branch-2.13. It seems this branch does not have the following commit:

 

# git log -p -1 cdae6100f8
commit cdae6100f89d04c5c29dc86a490b936a204622b7
Author: Han Zhou <hz...@ovn.org>
Date:   Thu Mar 5 23:48:46 2020 -0800    

 

  raft: Unset leader when starting election.

 

From my read of the code, lack of this commit could cause a missing raft_set_leader() call therefore candidate_retrying could stay as true

 

Please let me know if my understand is correct. If so, the problem should be fixed in upstream/master branch.

 

Sorry for the confusion about the version of the ovs.

 

Thanks

Yun

 

 

 

 

From: Han Zhou <zho...@gmail.com>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <yu...@nvidia.com>
Cc: ovs-d...@openvswitch.org; ovn-kub...@googlegroups.com; Girish Moodalbail <gmood...@nvidia.com>
Subject: Re: the raft_is_connected state of a raft server stays as false and cannot recover

 

External email: Use caution opening links or attachments

 

 

 

On Thu, Aug 13, 2020 at 5:26 PM Yun Zhou <yu...@nvidia.com> wrote:

Reply all
Reply to author
Forward
0 new messages