Problems joing RAFT Cluster using TCP

14 views
Skip to first unread message

Martin Tauber

unread,
Apr 24, 2025, 7:17:45 AMApr 24
to jgroups-raft
Hi everybody,

I hope this is the right group to address my issue.

I am using jgroups raft to form a cluster using three nodes uvuyo1, uvuyo2 uvuyo3. When I try to  fire up my second node (uvuyo2) to join the cluster I get the following error message in the log file:

java.lang.IllegalStateException: raft-id uvuyo2 is not listed in members [uvuyo1]
org.jgroups.protocols.raft.RAFT.start(RAFT.java:573) ~[jgroups-raft-1.0.14.Final.jar!/:na]
org.jgroups.stack.ProtocolStack.startStack(ProtocolStack.java:908) ~[jgroups-5.4.5.Final.jar!/:5.4.5.Final]
...

which obviously looks like the node the cluster thinks it only has one member (uvuyo1) and the node that I am starting is uvuyo2.

The problem I have is:
I configured to use uvuyo1,uvuyo2,uvuyo3 on all nodes in the members list. I can't post the xml since the configuration of the protocols is done in java. (The cluster works in my development environment, but not in pre production ...) so i assume there is nothing wrong with the the configuration done in java.

When I look into the logs, the first thing done is that the state is being rebuild - so far so good.

In my Development environment I then normally see that the the connection to the jgroups cluster ...

-------------------------------------------------------------------
GMS: address=gateway2, cluster=uvuyo-gateway-2yetis, physical address=192.168.2.42:7801
-------------------------------------------------------------------

I don't see this in production.

So my assumption would be that this error could only come from the local configuration of the second node - since it never really contacted the first node or started the jgroups cluster. But that looks fine ...

Any ideas are welcome!
Thanks
Martin

Martin Tauber

unread,
Apr 24, 2025, 7:35:53 AMApr 24
to jgroups-raft
Here is some additional Information.

(1) The code that builds the protocol stack ...


private Protocol[] getProtocolStack() {
log.info("Using cluster protocol '" + this.protocol + "'.");

List<String> membersList;
if ("".equals(members) || members == null) {
membersList = List.of(nodeId);
} else {
membersList = List.of(members.split(","));
}

log.info("Using cluster members '" + membersList + "'.");

TP tp;
Protocol ping;

if (protocol.equals("tcp")) {
log.info("Using port '" + this.port + "'.");
log.info("Using port range '" + portRange + "'.");
log.info("Using bootstrap server '" + bootstrapServers + "'.");

List<IpAddress> initialHosts = getInitialHosts(bootstrapServers);

for (IpAddress ipAddress : initialHosts) {
log.info("Contacting node using address '" + ipAddress.printIpAddress() + "'.");
}

tp = new TCP();
tp.setBindPort(port);

ping = new TCPPING()
.setValue("initial_hosts", initialHosts)
.setValue("port_range", portRange);
} else {
tp = new UDP();
ping = new PING();
}

if (!"".equals(externalAddress)) {
log.info("Using external address '" + externalAddress + "'.");

try {
tp.setExternalAddr(InetAddress.getByName(externalAddress));
} catch (UnknownHostException e) {
log.error("External address '" + externalAddress + "' was not found.");
}
}

RAFT raft = new RAFT() // Raft protocol
.members(membersList)
.setValue("raft_id", nodeId)
.setValue("log_dir", uvuyoHome + File.separator + "data")
.setValue("log_prefix", "memdb." + nodeId)
.setValue("max_log_size", maxLogSize)
.setValue("log_class", "org.jgroups.protocols.raft.FileBasedLog");

return new Protocol[]{
tp,
ping, // Discovery protocol
new MERGE3(), // Handles cluster splits
new FD_SOCK(), // Failure detection based on nodes connecting to neighbors
new FD_ALL() // Failure detection based on heartbeats
.setValue("timeout_check_interval", failureDetectionTimeout),
new VERIFY_SUSPECT(), // After detecting Failure check if node is really gone
new NAKACK2(), // Reliable FIFO message transfer
new UNICAST3(), // Reliable FIFO message transfer
new STABLE(), // Garbage Collect message
new GMS(), // Handle Group Membership
new UFC(), // Unicast Flow Control
new MFC(), // Multicast Flow Control
new FRAG2(), // Fragment large messages
new BARRIER(), // Needed by state transfer
new STATE_TRANSFER(), // Allow states to be transferred between nodes
new ELECTION(),
raft,
new REDIRECT(),
new CLIENT()
};
}


And here is the log output to show the values ...

2025-04-24T11:42:39.350+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Using cluster protocol 'tcp'.
2025-04-24T11:42:39.350+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Using cluster members '[uvuyo1, uvuyo2, uvuyo3]'.
2025-04-24T11:42:39.350+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Using port '7800'.
2025-04-24T11:42:39.350+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Using port range '0'.
2025-04-24T11:42:39.350+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Using bootstrap server 'bmchlx-stg-tool1:7800,bmchlx-stg-tool2:7800,bmchlx-stg-tool3:7800'.
2025-04-24T11:42:39.358+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Contacting node using address '10.144.19.190:7800'.
2025-04-24T11:42:39.358+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Contacting node using address '10.144.24.85:7800'.
2025-04-24T11:42:39.358+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Contacting node using address '10.144.21.127:7800'.
2025-04-24T11:42:39.584+02:00  INFO 1070980 --- [           main] n.t.uvuyo.cluster.ClusterManager         : Joining RAFT cluster 'uvuyo-gateway-2yetis' ...
2025-04-24T11:42:39.587+02:00  INFO 1070980 --- [           main] org.jgroups.JChannel                     : local_addr: uvuyo2, name: bmchlx-stg-tool2-42215

José Bolina

unread,
Apr 24, 2025, 8:59:34 AMApr 24
to Martin Tauber, jgroups-raft
Hey, Martin, thanks for reporting.

Can you enable trace logging for RAFT? Something like `.level("trace")` should do the trick.
During the protocol initialization, it will read the previous log. Setting to trace level will help identify if a file in that folder location is changing the member list.
Another possibility is to start node "uvuyo1" and check with JMX if the `RAFT.members` list contains the correct value. This should only be different if there were commands to change membership.


On a side note, the stack shouldn't need `BARRIER` and `STATE_TRANSFER`. RAFT has the mechanism to transfer the state embedded in the algorithm.


Cheers,


--
You received this message because you are subscribed to the Google Groups "jgroups-raft" group.
To unsubscribe from this group and stop receiving emails from it, send an email to jgroups-raft...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/jgroups-raft/b0a3e240-dd23-47a4-900e-9b16bb3b59ddn%40googlegroups.com.


--
José Bolina

Bela Ban

unread,
Apr 24, 2025, 9:13:39 AMApr 24
to jgroup...@googlegroups.com
Note that you could also use `probe.sh RAFT.members,raft_id` to print these 2 attributes of all cluster members.

Martin Tauber

unread,
Apr 24, 2025, 9:45:26 AMApr 24
to jgroups-raft
Thanks for the quick response .... what I found out in the meantime, is that the members seam to be stored in the state file and it looks like this is why I am getting this error ... I'll try to gather more information...

Bela Ban

unread,
Apr 24, 2025, 9:49:02 AMApr 24
to jgroup...@googlegroups.com
The members are stored in the persistent state because we want to be able to dynamically add/remove members from the 'members' list [1]. Addition / removal of members is nothing else than a state modification, and thus performed like any other state-changing operation.

[1] https://belaban.github.io/jgroups-raft/manual/index.html#DynamicMembership

Martin Tauber

unread,
Apr 24, 2025, 9:59:51 AMApr 24
to jgroups-raft
I wonder if this is the issue. We started with one node and then decided to create a "real" cluster. Therfore we shutdown the node and added the two other members in the configuration file. After that we started the node again. I would think that the node originally had the majority (since it was only one node ...) and I ask myself if during startup the two nodes are then added one by one to the new cluster (even though now it does not have a majority anymore ... at least after adding the first node ...).

The idea was then to start the second node and then the third node ... could this be the issue?

Bela Ban

unread,
Apr 24, 2025, 11:46:14 AMApr 24
to jgroup...@googlegroups.com
Yes. If you want to add / remove nodes from then 'static' membership, then follow the instructions in [1] plus update the configs.

Bela Ban

unread,
Apr 24, 2025, 11:50:39 AMApr 24
to jgroup...@googlegroups.com
In addition, nodes #2 and #3 would never get a majority because the log has only node #1. You need to add them via [1].

@Jose: I'm wondering if we can't add some sanity checking code which tries to reconcile configuration and real state in the log, and throws an exception if this fails?

On 24.04.2025 15:59, Martin Tauber wrote:

Martin Tauber

unread,
Apr 24, 2025, 11:58:47 AMApr 24
to jgroups-raft
Thanks for the info. I'm quite new to this so I was naively hopping that when I change the members on all nodes it would just work ;) I will start the three servers from scratch with the members correctly configured ...

José Bolina

unread,
Apr 24, 2025, 12:24:34 PMApr 24
to Martin Tauber, jgroups-raft
These operations should be utilized sparingly. They are focused on maintenance and require a quorum to commit the new configuration.
Running the cluster with all 3 members for the start would be the intended approach. The cluster would become ready to accept commands once the quorum is live.


@Bela +1 on adding some validations around these. I'll think a little bit on this and start a discussion.



--
José Bolina

Martin Tauber

unread,
Apr 30, 2025, 3:54:02 AMApr 30
to jgroups-raft
Hi everybody,

I just wanted to confirm that this was the issue. The cluster was created with on node and then the configuration was changed without announcing the new nodes to the cluster.

Thanks for your help!
Kind Regards
Martin
Reply all
Reply to author
Forward
0 new messages