Getting up and running, hitting "unknown rpc type 80"

74 views
Skip to first unread message

Geoffrey Thorpe

unread,
May 21, 2021, 5:53:34 PM5/21/21
to rqlite
Hi all

First off: rqlite looks awesome. I'm hacking up something where sqlite is The Right Thing To Use(tm) but then started head-scratching over how to eliminate SPOF (single point of failure). It didn't take me much searching to find rqlite.

The scenario:
  • I'm using v5.12.1 binaries from github inside containers. However I'm not using the dockerhub (rqlite-docker) images, as I need to run all this on a particular platform baseline, so I'm installing rqlite into my baseline images by following the same incantation as I found in the rqlite-docker/5.12.1 source.
  • I'm starting two nodes, one at a time. rqlite-node-001 and rqlite-node-006. Both have fairly standard command-lines, though it says as much in the log messages.
Here are the relevant snippets from the two log files;

-- the first node --
[rqlited] 2021/05/21 21:25:45 rqlited starting, version v5.12.1, commit 181761d953fad4ce05d52ceeee70718be18cbee1, branch master
[rqlited] 2021/05/21 21:25:45 go1.15.7, target architecture is amd64, operating system target is linux
[rqlited] 2021/05/21 21:25:45 launch command: rqlited -node-id rqlite-node-001 -http-addr 0.0.0.0:4001 -raft-addr 0.0.0.0:4002 -join  -on-disk /rqlite/file/data
[rqlited] 2021/05/21 21:25:45 no preexisting node state detected in /rqlite/file/data, node may be bootstrapping
[rqlited] 2021/05/21 21:25:45 no join addresses set
[store] 2021/05/21 21:25:45 opening store with node ID rqlite-node-001
[store] 2021/05/21 21:25:45 ensuring directory at /rqlite/file/data exists
[store] 2021/05/21 21:25:45 0 preexisting snapshots present
[store] 2021/05/21 21:25:45 first log index: 0, last log index: 0, last command log index: 0:
2021-05-21T21:25:45.731Z [INFO]  raft: initial configuration: index=0 servers=[]
[store] 2021/05/21 21:25:45 executing new cluster bootstrap
2021-05-21T21:25:45.731Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2021-05-21T21:25:47.430Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2021-05-21T21:25:47.430Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=2
2021-05-21T21:25:47.445Z [INFO]  raft: election won: tally=1
2021-05-21T21:25:47.445Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
[store] 2021/05/21 21:25:47 waiting for up to 2m0s for application of initial logs
[http] 2021/05/21 21:25:47 service listening on [::]:4001
[rqlited] 2021/05/21 21:25:47 node is ready
2021-05-21T21:26:04.824Z [ERROR] raft-net: failed to decode incoming command: error="unknown rpc type 80"
2021-05-21T21:26:09.827Z [ERROR] raft-net: failed to decode incoming command: error="unknown rpc type 80"
2021-05-21T21:26:14.835Z [ERROR] raft-net: failed to decode incoming command: error="unknown rpc type 80"
[etc]

-- the second node, which is trying to "join" the first --
[rqlited] 2021/05/21 21:26:04 rqlited starting, version v5.12.1, commit 181761d953fad4ce05d52ceeee70718be18cbee1, branch master
[rqlited] 2021/05/21 21:26:04 go1.15.7, target architecture is amd64, operating system target is linux
[rqlited] 2021/05/21 21:26:04 launch command: rqlited -node-id rqlite-node-006 -http-addr 0.0.0.0:4001 -raft-addr 0.0.0.0:4002 -join rqlite-node-001:4002 -on-disk /rqlite/file/data
[rqlited] 2021/05/21 21:26:04 no preexisting node state detected in /rqlite/file/data, node may be bootstrapping
[rqlited] 2021/05/21 21:26:04 join addresses specified, node is not bootstrapping
[store] 2021/05/21 21:26:04 opening store with node ID rqlite-node-006
[store] 2021/05/21 21:26:04 ensuring directory at /rqlite/file/data exists
[store] 2021/05/21 21:26:04 0 preexisting snapshots present
[store] 2021/05/21 21:26:04 first log index: 0, last log index: 0, last command log index: 0:
2021-05-21T21:26:04.823Z [INFO]  raft: initial configuration: index=0 servers=[]
[store] 2021/05/21 21:26:04 no cluster bootstrap requested
[rqlited] 2021/05/21 21:26:04 join addresses are: [rqlite-node-001:4002]
2021-05-21T21:26:04.823Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
[cluster-join] 2021/05/21 21:26:04 failed to join cluster at [rqlite-node-001:4002]: Post "http://rqlite-node-001:4002/join": EOF, sleeping 5s before retry
2021-05-21T21:26:05.901Z [WARN]  raft: no known peers, aborting election
[cluster-join] 2021/05/21 21:26:09 failed to join cluster at [rqlite-node-001:4002]: Post "http://rqlite-node-001:4002/join": EOF, sleeping 5s before retry
[cluster-join] 2021/05/21 21:26:14 failed to join cluster at [rqlite-node-001:4002]: Post "http://rqlite-node-001:4002/join": EOF, sleeping 5s before retry
[etc]

Judging from some search results I found, the typical explanation for the "raft-net: failed to decode [...]" errors would be confusion the data and cluster ports, but I don't think I've done that here. Perhaps there's some dependency on RPC config in the underlying O/S? I'll keep digging, but if this is obvious to someone then thanks in advance for any help.

Best regards
Geoff

Geoffrey Thorpe

unread,
May 21, 2021, 6:17:43 PM5/21/21
to rqlite
Small update, I was able to reproduce the issue directly with the rqlite-docker image from dockerhub;

First;
$ docker network create foobar

Then in one shell;
$ docker run --name=rqlite1 --network=foobar rqlite/rqlite

And once that's started, the following in another shell;
$ docker run --name=rqlite2 --network=foobar rqlite/rqlite -join rqlite1:4002

It produces the same error messages as I showed in my previous post, and should be much easier for others to reproduce. (Apologies I didn't try this out first, it must be friday.)

Regards,
Geoff

Philip O'Toole

unread,
May 21, 2021, 10:57:27 PM5/21/21
to rql...@googlegroups.com
Make sure you pass the HTTP API address of the other node to the -join command, not the Raft address of the other node. I suspect that's what you're doing.

--
You received this message because you are subscribed to the Google Groups "rqlite" group.
To unsubscribe from this group and stop receiving emails from it, send an email to rqlite+un...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/rqlite/966be772-5ab9-41c8-a0e2-87e3e21edc3bn%40googlegroups.com.

Philip O'Toole

unread,
May 21, 2021, 10:58:55 PM5/21/21
to rql...@googlegroups.com
Your second command should probably be:

rqlited -node-id rqlite-node-006 -http-addr 0.0.0.0:4001 -raft-addr 0.0.0.0:4002 -join rqlite-node-001:4001 -on-disk /rqlite/file/data

(4001 instead of 4002)

Geoffrey Thorpe

unread,
May 22, 2021, 11:57:27 AM5/22/21
to rqlite
Hey Philip, thanks for responding. Sorry for my confusion there, and thanks for clarifying how it's supposed to work. Makes intuitive sense now that I think a little harder about the cluster-control problem.

In my defense though... :-) ... I had already tried what you suggested and it also fails with a fairly dire-looking error, which made it easier for me to fall into the wrong assumption. I just retried per your suggestion in a freshly-minted GCE VM, and I still see the same error in that vanilla environment. Here are the steps, in case the problem is still in my own (mis)understanding, and/or if anyone wants to try.

[...apt-based install of docker.io, add user to the "docker" group, etc...]

In shell A:
$ docker network create foobar
$ docker run --rm --name=rqlite1 --network=foobar rqlite/rqlite

In shell B:
$ docker run --rm --name=rqlite2 --network=foobar rqlite/rqlite -join rqlite1:4001

That's pretty simple - so I'm curious if you think that the above _should_ work? Or can you suggest a corrected but equally simple variant of it? (FWIW, a couple of reference command-lines of this sort would be really helpful on the markdown README for rqlite-docker and/or the dockerhub page for rqlite/rqlite.)

The first instance starts up OK. This is its output prior to starting the second instance;
[rqlited] 2021/05/22 15:39:03 rqlited starting, version v5.12.1, commit 181761d953fad4ce05d52ceeee70718be18cbee1, branch master
[rqlited] 2021/05/22 15:39:03 go1.15.7, target architecture is amd64, operating system target is linux
[rqlited] 2021/05/22 15:39:03 launch command: rqlited -http-addr 0.0.0.0:4001 -raft-addr 0.0.0.0:4002 /rqlite/file/data
[rqlited] 2021/05/22 15:39:03 no preexisting node state detected in /rqlite/file/data, node may be bootstrapping
[rqlited] 2021/05/22 15:39:03 no join addresses set
[store] 2021/05/22 15:39:03 opening store with node ID 0.0.0.0:4002
[store] 2021/05/22 15:39:03 ensuring directory at /rqlite/file/data exists
[store] 2021/05/22 15:39:03 0 preexisting snapshots present
[store] 2021/05/22 15:39:03 first log index: 0, last log index: 0, last command log index: 0:
2021-05-22T15:39:03.522Z [INFO]  raft: initial configuration: index=0 servers=[]
[store] 2021/05/22 15:39:03 executing new cluster bootstrap
2021-05-22T15:39:03.522Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
2021-05-22T15:39:05.521Z [WARN]  raft: heartbeat timeout reached, starting election: last-leader=
2021-05-22T15:39:05.521Z [INFO]  raft: entering candidate state: node="Node at [::]:4002 [Candidate]" term=2
2021-05-22T15:39:05.525Z [INFO]  raft: election won: tally=1
2021-05-22T15:39:05.525Z [INFO]  raft: entering leader state: leader="Node at [::]:4002 [Leader]"
[store] 2021/05/22 15:39:05 waiting for up to 2m0s for application of initial logs
[http] 2021/05/22 15:39:05 service listening on [::]:4001
[rqlited] 2021/05/22 15:39:05 node is ready

When I start the second instance, the first instance starts repeating the following;
[store] 2021/05/22 15:42:03 received request to join node at 0.0.0.0:4002
[store] 2021/05/22 15:42:03 failed to remove node: need at least one voter in configuration: {[]}
[store] 2021/05/22 15:42:09 received request to join node at 0.0.0.0:4002
[store] 2021/05/22 15:42:09 failed to remove node: need at least one voter in configuration: {[]}
[store] 2021/05/22 15:42:14 received request to join node at 0.0.0.0:4002
[store] 2021/05/22 15:42:14 failed to remove node: need at least one voter in configuration: {[]}

And here is the output of the second instance;
[store] 2021/05/22 15:42:03 first log index: 0, last log index: 0, last command log index: 0:
2021-05-22T15:42:03.997Z [INFO]  raft: initial configuration: index=0 servers=[]
[store] 2021/05/22 15:42:03 no cluster bootstrap requested
[rqlited] 2021/05/22 15:42:03 join addresses are: [rqlite1:4001]
2021-05-22T15:42:03.997Z [INFO]  raft: entering follower state: follower="Node at [::]:4002 [Follower]" leader=
[cluster-join] 2021/05/22 15:42:03 failed to join cluster at [rqlite1:4001]: failed to join, node returned: 500 Internal Server Error: (need at least one voter in configuration: {[]}), sleeping 5s before retry
2021-05-22T15:42:05.817Z [WARN]  raft: no known peers, aborting election
[cluster-join] 2021/05/22 15:42:09 failed to join cluster at [rqlite1:4001]: failed to join, node returned: 500 Internal Server Error: (need at least one voter in configuration: {[]}), sleeping 5s before retry
[cluster-join] 2021/05/22 15:42:14 failed to join cluster at [rqlite1:4001]: failed to join, node returned: 500 Internal Server Error: (need at least one voter in configuration: {[]}), sleeping 5s before retry

BTW, this is the first time I tried to run this in an un-hacked-up environment, and there had also been plenty of other moving parts up till now, so one thing I didn't look at was trying different rqlite versions. I figured that "rqlite/rqlite" on dockerhub was likely to map to a stable release, but if I'm unwittingly running a dev snapshot or should check different versions for any other reason, I'm thankful for any suggestions.

Best,
Geoff

Geoffrey Thorpe

unread,
May 24, 2021, 11:20:35 AM5/24/21
to rqlite
OK, I figured it out, the -http-addr and -raft-addr args seem necessary in all circumstances. I had the impression that they were defaulting to the <hostname>:4001/4002 (when listening on 0.0.0.0:4001/4002), as is the case with the choice of the default node-id, and which would seem the most natural + plug-n-play choice here too. But it appears to be defaulting to localhost:4001/4002 instead. This certainly seems like a non-obvious behavior to me, but I haven't considered all the ramifications.

BTW, once I had a cluster running and a table with a couple of entries in it, I had some "fun" convincing curl to play follow-the-leader. ("-L --post301" were the main tricks to find, though I needed "-D foobar" first in order to figure it out.)

I'd be happy to submit a PR to a README.md or whatever, if you (a) are interested in having some noob cmd-lines for this kind of usage, and (b) a pointer to where you'd most like it to reside?

Thanks
Geoff

Philip O'Toole

unread,
May 24, 2021, 11:47:44 AM5/24/21
to rql...@googlegroups.com
OK, glad you got it sorted out. Yes, those two params default to localhost, see:


I guess defaulting to hostname might be a reasonable default too. Let me think about it for release 6.

Yes, PRs are welcome. In this case, I would propose your update here:


Reply all
Reply to author
Forward
0 new messages