Upgrades and read-only nodes

55 views
Skip to first unread message

Jérôme Gravel-Niquet

unread,
Nov 14, 2020, 7:20:29 PM11/14/20
to rqlite
We're thinking of maybe using rqlite but we have a few questions:

- What's the best way to upgrade without downtime for a read-only node? If we need to shut down our read-only node and start the new one back up, it'll have to read the whole raft log to get up to date and that would cause downtime, right? Writing to disk might "fix" this?

- We're planning on running 3-5 voting nodes, but probably ~100 (eventually 1000+) read-only nodes. Is that feasible? We're assuming read-only nodes can scale much more than r/w (voting) nodes.

- If we're using a read-only node and writing the database to disk, is it safe to connect directly to that sqlite file if we're running read-only operations on it?

Philip O'Toole

unread,
Nov 16, 2020, 10:17:38 AM11/16/20
to rql...@googlegroups.com
Thanks for your questions -- answers inline.

On Sat, Nov 14, 2020 at 7:20 PM Jérôme Gravel-Niquet <jero...@gmail.com> wrote:
We're thinking of maybe using rqlite but we have a few questions:

- What's the best way to upgrade without downtime for a read-only node? If we need to shut down our read-only node and start the new one back up, it'll have to read the whole raft log to get up to date and that would cause downtime, right? Writing to disk might "fix" this?

Yes, that would cause some downtime. Can you bring up a second node, running the newer version of code than the first, and switch over clients to that one once it's fully up?

Writing to disk won't fix this, as the SQLite database is erased on restart, and completely rebuilt from the log, regardless of whether it's on-disk or in-memory. While this might seem like overkill, it allows you to deal with a whole host of potential problems by simply restarting a node.

However it may not be as slow as you think, as recovery is a combination of installing a snapshot from the leader, and replaying remaining log entries. Have you actually measured how much downtime it would be, in your current setup?
 

- We're planning on running 3-5 voting nodes, but probably ~100 (eventually 1000+) read-only nodes. Is that feasible? We're assuming read-only nodes can scale much more than r/w (voting) nodes.

Yes, that is the goal of read-only nodes. I have not done extensive testing of scale, but the read-only functionality uses the same code as Hashicorp Consul, and it claims to scale very well with read-only nodes. I would be very interested in hearing about well the read-only nodes work for you.


- If we're using a read-only node and writing the database to disk, is it safe to connect directly to that sqlite file if we're running read-only operations on it?

I have not done much testing of that, but in practise it should work fine. Again, I would be very interested in learning if it works out well for you.

--
You received this message because you are subscribed to the Google Groups "rqlite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqlite+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqlite/6b5b9ffc-2bad-4f47-9a59-826ac2074aa1n%40googlegroups.com.

Jérôme Gravel-Niquet

unread,
Nov 16, 2020, 10:36:05 AM11/16/20
to rqlite
Thanks! Will let you know how it turns out if we end up using it :)

Philip O'Toole

unread,
Nov 16, 2020, 11:35:06 AM11/16/20
to rql...@googlegroups.com
Great, let me know if you think some changes to rqlite would help in your use case.

Jérôme Gravel-Niquet

unread,
Dec 5, 2020, 11:29:54 AM12/5/20
to rql...@googlegroups.com
If you don't mind, I'll use this thread for questions and issues?

I was finally able to make rqlite work on our platform, but I had to start a single node, wait for it to become leader and then start the 2 other nodes. I thought raft could handle multiple nodes coming up at the same time?

However, I'm unable to restart my nodes and return to a good state afterwards. I'm restarting them 1 by 1, but maybe too fast?

This is my rqlited command:

rqlited \
        -disco-id $DISCO_ID \
        -http-addr "0.0.0.0:4001" -http-adv-addr "[$PRIVATE_NET_IP]:4001" \
        -raft-addr "0.0.0.0:4002" -raft-adv-addr "[$PRIVATE_NET_IP]:4002" \
        -raft-timeout 5s -raft-election-timeout 10s \
        -on-disk \
        /data/db

(I upped the raft timeouts, but I don't think it helped)

I'm getting the following logs for a single node after all nodes have restarted:

2020-12-05T16:21:50.006Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:21:55.668Z [INFO]  raft: duplicate requestVote for same term: term=16
2020-12-05T16:21:55.671Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:22:00.005Z [ERROR] raft: failed to appendEntries to: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"
2020-12-05T16:22:00.891Z [ERROR] raft: failed to heartbeat to: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002 error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"
2020-12-05T16:22:10.022Z [ERROR] raft: failed to appendEntries to: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"
2020-12-05T16:22:11.849Z [ERROR] raft: failed to heartbeat to: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002 error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"
[store] 2020/12/05 16:22:22 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:23 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:22:27 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:28 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:22:32 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:34 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:22:38 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:39 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:22:43 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:44 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:22:55 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:22:56 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:23:00 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:23:01 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:23:05 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:23:07 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:23:11 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:23:12 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:23:16 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:23:17 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:23:33.323Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=[::]:4002
2020-12-05T16:23:33.326Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=18
2020-12-05T16:23:33.330Z [INFO]  raft: duplicate requestVote for same term: term=18
2020-12-05T16:23:33.332Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:23:33.334Z [INFO]  raft: election won: tally=2
2020-12-05T16:23:33.336Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:23:33.337Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:23:33.339Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:23:33.342Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:23:40.730Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:23:40.733Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=19
2020-12-05T16:23:40.738Z [INFO]  raft: duplicate requestVote for same term: term=19
2020-12-05T16:23:40.740Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:23:40.742Z [INFO]  raft: election won: tally=2
2020-12-05T16:23:40.743Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:23:40.744Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:23:40.746Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:23:40.751Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:23:40.822Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=14
2020-12-05T16:23:50.692Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:23:50.694Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=20
2020-12-05T16:23:50.698Z [INFO]  raft: duplicate requestVote for same term: term=20
2020-12-05T16:23:50.700Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:23:50.701Z [INFO]  raft: election won: tally=2
2020-12-05T16:23:50.702Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:23:50.704Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:23:50.706Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:23:50.709Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:23:50.778Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=14
2020-12-05T16:23:56.298Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:23:56.305Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=21
2020-12-05T16:23:56.310Z [INFO]  raft: duplicate requestVote for same term: term=21
2020-12-05T16:23:56.312Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:23:56.313Z [INFO]  raft: election won: tally=2
2020-12-05T16:23:56.314Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:23:56.315Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:23:56.317Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:23:56.319Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:23:56.390Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=14
2020-12-05T16:24:05.151Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:24:05.152Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=22
2020-12-05T16:24:05.160Z [INFO]  raft: duplicate requestVote for same term: term=22
2020-12-05T16:24:05.162Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:24:05.164Z [INFO]  raft: election won: tally=2
2020-12-05T16:24:05.165Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:24:05.166Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:05.168Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:24:05.171Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:24:05.242Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=14
2020-12-05T16:24:12.606Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:24:12.608Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=23
2020-12-05T16:24:12.614Z [INFO]  raft: duplicate requestVote for same term: term=23
2020-12-05T16:24:12.615Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:24:12.617Z [INFO]  raft: election won: tally=2
2020-12-05T16:24:12.618Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:24:12.619Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:12.621Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:24:12.624Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:24:12.692Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=14
2020-12-05T16:24:19.933Z [WARN]  raft: rejecting vote request since our last term is greater: candidate=[::]:4002 last-term=23 last-candidate-term=17
[store] 2020/12/05 16:24:25 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:26.387Z [WARN]  raft: previous log term mis-match: ours=18 remote=24
[store] 2020/12/05 16:24:30 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:24:35 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:24:35.907Z [WARN]  raft: previous log term mis-match: ours=19 remote=25
[store] 2020/12/05 16:24:35 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:38.173Z [INFO]  raft: duplicate requestVote for same term: term=26
2020-12-05T16:24:38.176Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:24:38.220Z [WARN]  raft: clearing log suffix: from=14 to=19
[store] 2020/12/05 16:24:40 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:24:41 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:24:45 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:24:46 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
[store] 2020/12/05 16:24:51 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:24:57 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:57.707Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=[::]:4002
2020-12-05T16:24:57.711Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=27
2020-12-05T16:24:57.717Z [INFO]  raft: duplicate requestVote for same term: term=27
2020-12-05T16:24:57.719Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:24:57.720Z [INFO]  raft: election won: tally=2
2020-12-05T16:24:57.721Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:24:57.722Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:24:57.724Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:24:57.726Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
[store] 2020/12/05 16:25:01 received request to join node at [fdaa:0:33:a7b:85:0:3d3:2]:4002
[store] 2020/12/05 16:25:02 received request to join node at [fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:25:03.130Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:25:03.133Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=28
2020-12-05T16:25:03.139Z [INFO]  raft: duplicate requestVote for same term: term=28
2020-12-05T16:25:03.141Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:25:03.142Z [INFO]  raft: election won: tally=2
2020-12-05T16:25:03.143Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:25:03.144Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:25:03.146Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:25:03.148Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:25:03.221Z [WARN]  raft: appendEntries rejected, sending older logs: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" next=15
2020-12-05T16:25:09.179Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:25:09.182Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=29
2020-12-05T16:25:09.188Z [INFO]  raft: duplicate requestVote for same term: term=29
2020-12-05T16:25:09.190Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:25:09.191Z [ERROR] raft: failed to make requestVote RPC: target="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error=EOF
2020-12-05T16:25:09.194Z [INFO]  raft: election won: tally=2
2020-12-05T16:25:09.195Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:25:09.196Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:25:09.198Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:25:09.200Z [ERROR] raft: failed to appendEntries to: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error=EOF
2020-12-05T16:25:09.203Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:25:15.045Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:25:15.048Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=30
2020-12-05T16:25:15.054Z [INFO]  raft: duplicate requestVote for same term: term=30
2020-12-05T16:25:15.056Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:25:15.057Z [INFO]  raft: election won: tally=2
2020-12-05T16:25:15.058Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:25:15.060Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:25:15.061Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:25:15.064Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:25:20.928Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2020-12-05T16:25:20.931Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=31
2020-12-05T16:25:20.936Z [INFO]  raft: duplicate requestVote for same term: term=31
2020-12-05T16:25:20.937Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:25:20.940Z [INFO]  raft: election won: tally=2
2020-12-05T16:25:20.941Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
2020-12-05T16:25:20.943Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:ab2:0:3d2:2]:4002
2020-12-05T16:25:20.945Z [INFO]  raft: added peer, starting replication: peer=[fdaa:0:33:a7b:85:0:3d3:2]:4002
2020-12-05T16:25:20.947Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2020-12-05T16:25:25.001Z [INFO]  raft: duplicate requestVote for same term: term=31
2020-12-05T16:25:25.003Z [WARN]  raft: duplicate requestVote from: candidate=[::]:4002
2020-12-05T16:25:25.027Z [WARN]  raft: clearing log suffix: from=15 to=19
2020-12-05T16:25:25.057Z [ERROR] raft: failed to make requestVote RPC: target="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"
2020-12-05T16:25:25.064Z [ERROR] raft: failed to appendEntries to: peer="{Voter [fdaa:0:33:a7b:85:0:3d3:2]:4002 [fdaa:0:33:a7b:85:0:3d3:2]:4002}" error="dial tcp [fdaa:0:33:a7b:85:0:3d3:2]:4002: i/o timeout"

It appears to be going in a loop, not being able to do its thing.

Here's lsof showing this node is listening on the right ports and has established connections:

rqlited 501 root    7u  IPv6    542      0t0  TCP *:4002 (LISTEN)
rqlited 501 root    8u  IPv6    860      0t0  TCP fly-local-6pn:47490->[fdaa:0:33:a7b:85:0:3d3:2]:4002 (ESTABLISHED)
rqlited 501 root    9u  IPv6    882      0t0  TCP fly-local-6pn:4002->[fdaa:0:33:a7b:85:0:3d3:2]:42576 (ESTABLISHED)
rqlited 501 root   12u  IPv6    863      0t0  TCP fly-local-6pn:47492->[fdaa:0:33:a7b:85:0:3d3:2]:4002 (ESTABLISHED)
rqlited 501 root   13u  IPv6    596      0t0  TCP [::1]:59110->[::1]:4002 (ESTABLISHED)
rqlited 501 root   14u  IPv6    597      0t0  TCP [::1]:4002->[::1]:59110 (ESTABLISHED)
rqlited 501 root   15u  IPv6    562      0t0  TCP *:4001 (LISTEN)
rqlited 501 root   16u  IPv6    884      0t0  TCP fly-local-6pn:4002->[fdaa:0:33:a7b:85:0:3d3:2]:42580 (ESTABLISHED)
rqlited 501 root   17u  IPv6    866      0t0  TCP fly-local-6pn:4001->[fdaa:0:33:a7b:85:0:3d3:2]:40416 (ESTABLISHED)
rqlited 501 root   20u  IPv6    872      0t0  TCP fly-local-6pn:4001->[fdaa:0:33:a7b:85:0:3d3:2]:40420 (ESTABLISHED)
rqlited 501 root   23u  IPv6    876      0t0  TCP fly-local-6pn:4001->[fdaa:0:33:a7b:85:0:3d3:2]:40426 (ESTABLISHED)

Any ideas what's causing this?

You received this message because you are subscribed to a topic in the Google Groups "rqlite" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/rqlite/p9wV-h3l9Wc/unsubscribe.
To unsubscribe from this group and all its topics, send an email to rqlite+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqlite/CAEajhJPqfySbwhvxeGYuNivQCAhcnz8DpiWB%2B_kFPWtBBkXGcw%40mail.gmail.com.

Philip O'Toole

unread,
Dec 6, 2020, 12:08:23 PM12/6/20
to rql...@googlegroups.com
What version are you running? On what OS?

More inline.

On Sat, Dec 5, 2020 at 11:29 AM Jérôme Gravel-Niquet <jero...@gmail.com> wrote:
If you don't mind, I'll use this thread for questions and issues?

I was finally able to make rqlite work on our platform, but I had to start a single node, wait for it to become leader and then start the 2 other nodes. I thought raft could handle multiple nodes coming up at the same time?

What do you actually mean by this? Just so we're all clear what we're talking about, can you show me the 3 launch commands you expect to work if all three are executed at the same time? Once I can see that, I can discuss some more.
 

However, I'm unable to restart my nodes and return to a good state afterwards. I'm restarting them 1 by 1, but maybe too fast? 

This is my rqlited command:

rqlited \
        -disco-id $DISCO_ID \

I see you're using the discovery service. Maybe something funky is going on there. Why did you decide to use the disco service?

Jérôme Gravel-Niquet

unread,
Dec 6, 2020, 3:15:53 PM12/6/20
to rql...@googlegroups.com
Responses inline.


On Sun, Dec 6, 2020 at 12:08 PM 'Philip O'Toole' via rqlite <rql...@googlegroups.com> wrote:
What version are you running? On what OS?

 
Latest version (5.6.0)
 
More inline.

On Sat, Dec 5, 2020 at 11:29 AM Jérôme Gravel-Niquet <jero...@gmail.com> wrote:
If you don't mind, I'll use this thread for questions and issues?

I was finally able to make rqlite work on our platform, but I had to start a single node, wait for it to become leader and then start the 2 other nodes. I thought raft could handle multiple nodes coming up at the same time?

What do you actually mean by this? Just so we're all clear what we're talking about, can you show me the 3 launch commands you expect to work if all three are executed at the same time? Once I can see that, I can discuss some more.

The 3 commands are identical to the one I pasted. The discovery ID doesn't change between them, just the PRIVATE_NET_IP.
 
 

However, I'm unable to restart my nodes and return to a good state afterwards. I'm restarting them 1 by 1, but maybe too fast? 

This is my rqlited command:

rqlited \
        -disco-id $DISCO_ID \

I see you're using the discovery service. Maybe something funky is going on there. Why did you decide to use the disco service?

To simplify! The IPs are unknown to begin with. I'm launching this service on our own platform (fly.io) and the IPs are not known until the instances have booted. So we can't write static join addresses or even DNS ones because we don't know the instance ID we'll get.

However, now that I've booted my 3 servers, the IPs have stabilized. They are static even through restarts, so maybe I could hard-code them in my docker entrypoint.
 

Philip O'Toole

unread,
Dec 6, 2020, 4:12:06 PM12/6/20
to rql...@googlegroups.com
This sounds like it might be some issue related to the Discovery service. Are you suing the one hosted in the cloud, or one you deployed yourself?

Jérôme Gravel-Niquet

unread,
Dec 6, 2020, 4:17:43 PM12/6/20
to rql...@googlegroups.com
Using yours (the cloud-hosted one) to begin with. I'm not yet sure how to self-host.

Philip O'Toole

unread,
Dec 7, 2020, 8:52:45 AM12/7/20
to rql...@googlegroups.com
Hmmm, I wonder if it's something with the Disco service, as I haven't deeply tested it. Always possible there is some interaction between it and a cluster that I haven't accounted for.


Reply all
Reply to author
Forward
0 new messages