Hi,
Need expert's view to address a problem we are seeing now and then: A ovsdb-server node in a 3-nodes raft cluster keeps printing out the "raft_is_connected: false" message, and its "connected" state in its _Server DB stays as false.
According to the ovsdb-server(5) manpage, it means this server is not contacting with a majority of its cluster.
Except its "connected" state, from what we can see, this server is in the follower state and works fine, and connection between it and the other two servers appear healthy as well.
Below is its raft structure snapshot at the time of the problem. Note that its candidate_retrying field stays as true.
Hopefully the provide information can help to figure out what goes wrong here. Unfortunately we don't have a solid case to reproduce it:
--
You received this message because you are subscribed to the Google Groups "ovn-kubernetes" group.
To unsubscribe from this group and stop receiving emails from it, send an email to ovn-kubernete...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/ovn-kubernetes/BY5PR12MB4132F190E4BFE9F381BC5A82B0400%40BY5PR12MB4132.namprd12.prod.outlook.com.
Han,
Thanks for your reply, and thanks for confirming my reading of code at the time as well: “from what I cans see, that raft.leader_sid are also updated in the only two places where raft.candidate_retrying (raft_start_election() and raft_set_leader()) is set. Which means it is not possible that raft.candidate_retrying is set to TRUE but raft->leader_sid is set to non-Zero”.
We saw it not very often, probably every half month or so. If it happens again, what information you think we should collect that can help with further investigation?
Thanks
Yun
From: Han Zhou <zho...@gmail.com>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <yu...@nvidia.com>
Cc: ovs-d...@openvswitch.org; ovn-kub...@googlegroups.com; Girish Moodalbail <gmood...@nvidia.com>
Subject: Re: the raft_is_connected state of a raft server stays as false and cannot recover
External email: Use caution opening links or attachments |
Han,
I just find out that we are using the ovs directly build from upstream branch-2.13. It seems this branch does not have the following commit:
# git log -p -1 cdae6100f8
commit cdae6100f89d04c5c29dc86a490b936a204622b7
Author: Han Zhou <hz...@ovn.org>
Date: Thu Mar 5 23:48:46 2020 -0800
raft: Unset leader when starting election.
From my read of the code, lack of this commit could cause a missing raft_set_leader() call therefore candidate_retrying could stay as true
Please let me know if my understand is correct. If so, the problem should be fixed in upstream/master branch.
Sorry for the confusion about the version of the ovs.
Thanks
Yun
From: Han Zhou <zho...@gmail.com>
Sent: Sunday, August 16, 2020 10:14 PM
To: Yun Zhou <yu...@nvidia.com>
Cc: ovs-d...@openvswitch.org; ovn-kub...@googlegroups.com; Girish Moodalbail <gmood...@nvidia.com>
Subject: Re: the raft_is_connected state of a raft server stays as false and cannot recover
External email: Use caution opening links or attachments |