Send/receive message passing or atomic communication in distributed systems

Jack Vanlightly

unread,

Feb 23, 2024, 2:11:49 AMFeb 23

to tlaplus

Hi all,

I was going over the MongoDB logless reconfiguration spec today (https://github.com/will62794/logless-reconfig) and I saw how much more compact it was compared to my own various specs of Raft-based systems with reconfiguration. The principle reason is that this spec does not model message passing where send and receive are separate actions, communications between nodes is atomic. I have seen a few other specifications use atomic communication as well.

For my part, I have always modeled send/receive message passing as it seemed possible to miss certain edge cases when communication was atomic. However, message passing does make specifications larger, more complex with a much larger state space so there is a real cost.

I'd love to hear others opinions on whether to model send/receive message passing or use atomic comms in distributed systems.

Thanks

Jack

divyanshu ranjan

unread,

Feb 23, 2024, 2:26:40 AMFeb 23

to tla...@googlegroups.com

Hi Jack,

Lamport has written a paper when such an assumption is valid one in terms of five-six conditions.

[1] https://lamport.azurewebsites.net/pubs/lamport-theorem.pdf

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/6c76fd7c-6355-48ae-9e0a-480ff8d4a797n%40googlegroups.com.

Message has been deleted

Willy Schultz

unread,

Feb 24, 2024, 10:43:53 PMFeb 24

to tlaplus

Jack,

The abstraction level of those specs was influenced both by our intuition and by pragmatic concerns. Overall, I would say that our work was originally driven by a desire to come up with the most abstract model of the underlying algorithm that was also "useful" from an engineering standpoint. In our context, that meant having a model that gave us a good understanding of the abstract behavior of the algorithm, and one for which model checking was mostly feasible for small/medium sized protocol instances, and also efficient enough to provide prompt feedback while iterating on candidate designs.

More concretely, I think our approach was also influenced by the fact that many of the bug scenarios described in the original Raft reconfig algorithm [6] can essentially be represented even in this abstract model. So, we felt it was a good starting place to express our protocol at this abstraction level so that we could clearly understand these types of bugs at a high level, and work towards a design that would avoid similar issues in MongoDB's system/protocol.

As a general design methodology, I think this is a relatively effective/efficient approach, especially when you are using a spec to iterate on the design of a new protocol (as was the case in our situation). It is roughly analogous to a kind of refinement driven approach i.e. designing a protocol at increasingly lower levels of abstraction until you get what you need. In our case, we didn't really do any explicit refinement steps, but I think it was still useful to execute most of our design work at a "well chosen" level of the abstraction hierarchy first, before moving to levels that are more complex and/or may make model checking infeasible.

As an additional side note, formalizing the intuition around how to reduce asynchronous message protocols to equivalent synchronous/sequentialized versions is doable, but in my view is tricky/nontrivial [1,2,3]. The relevant ideas are, arguably, straightforward, and old/well-known [4,5], but I view their formalizations as subtle. I think Giuliano's post [7] presents another good example of this type of reasoning, again based on some amount of intuition to justify a type of reduction/sequentialization.

Will

[1] https://dl.acm.org/doi/10.1145/3385412.3385980
[2] https://www.di.ens.fr/~cezarad/cav19.pdf
[3] https://members.loria.fr/SMerz/papers/rp2009.pdf
[4] https://dl.acm.org/doi/10.1145/361227.361234
[5] https://core.ac.uk/download/pdf/82311765.pdf
[6] https://groups.google.com/g/raft-dev/c/t4xj6dJTP6E/m/d2D9LrWRza8J
[7] https://www.losa.fr/blog/streamlet-in-tla+
[8] https://dl.acm.org/doi/abs/10.1145/3497775.3503688

Aman Shaikh

unread,

Feb 25, 2024, 7:30:07 PMFeb 25

to tlaplus

Hi Jack

I have written (or in the process of writing) TLA+ specs of three distributed systems. For each of these systems, I specify the system as a set of nodes and channels between the nodes. Each channel is point-to-point as it facilitates communication between a pair of nodes, and hence consists of two queues for each direction of the communication. A TLA+ action (which is essentially an 'atomic action') usually consists of a node "processing" a pending message at one of its channels, updating its own state as a result of the message processing, and enqueuing resulting messages for other nodes via appropriate channels. I then have the 'next' state in the TLA+ spec randomly pick a node that has at least one message to process. For my purpose, the fact that messages are sent point-to-point and because my focus is on what happens when nodes process messages, this way of specifying sending and receiving of messages seems sufficient. That said, I can refine my specs to deal with idiosyncrasies of the actual message transmission. For example, in one of the systems, the channel is a TCP connection which is a fairly involved protocol, but for my purposes, it is enough to assume that the TCP channel provides a loss-free, in-order message delivery.

Overall, I feel that you can write your TLA+ spec to capture as much detail of your (distributed) system as you want, but the more fine-grained your spec is, the longer it's likely to become, and more states the spec will have to grapple with. The latter is of great practical significance in my experience due to the (possibility of) state-space explosion that occurs when you model-check your spec with TLC.

aman

Jack Vanlightly

unread,

Feb 26, 2024, 4:20:14 AMFeb 26

to tla...@googlegroups.com

Thanks for the responses, this has all been very helpful (with lots of reading material).

Regarding the logless reconfig spec, I can understand those arguments well, especially considering that the Raft algorithm is so well understood, with preexisting message passing specs and a proof.

In general, my recent work on Kafka protocols (pull-based Raft for the controller, Kafka replication protocol, new replication protocol in Kora and the new consumer group protocol), all are asynchronous message passing, all have state-spaces so large that brute-force model checking is generally unviable and so I have been using a lot of simulation. The first three specs have very large state-spaces because I have modeled most of the protocol, for example, implementing leader election, replication, reconfiguration and pre-vote for the pull-based Raft. The consumer group protocol has a very large state-space simply because it requires a large number of messages to be exchanged to reach convergence. You could say that I've gone the "faithful" route, which has consistently been requested by the engineers I work with. Despite depending on simulation, I have been able to find numerous design flaws, some in the pre-existing protocol designs that have been out there for a number of years. I feel that simulation has worked very well at undercovering issues.

Having said that, I am in the process of evaluating my approach taken for recent work. With the rise of deterministic simulation testing, I am considering whether I should step back from the lower level faithful approach and use TLA+ at a higher abstraction level, then rely on simulation testing to catch bugs (in the implementation) that could be missed by higher level TLA+ specs. In this approach, the TLA+ adds value by helping me reason about the problem and validate the general approach, while the simulation testing catches the low level gotchas.

Any further comments would be welcome.

Thanks

Jack

To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/3efd54ac-0280-4355-972a-ab331caa4227n%40googlegroups.com.

Felipe Oliveira Carvalho

unread,

Feb 26, 2024, 8:35:19 AMFeb 26

to tla...@googlegroups.com

Imagine being able to change whether messaging is assumed to be atomic
or multi-step with a flag passed to the spec/checker. It would speed
up checking and "debugging" of the spec during development but allow
for a more faithful run (if the implementation wants to leverage
non-atomic communication) when you have the time to wait for the
checker to go through the huge state space that non-atomic
communication creates.

I know the right solution for this is refinement mappings, but
something hacky like C pre-processor #ifdefs would go a long way here.

--
Felipe

> To view this discussion on the web visit https://groups.google.com/d/msgid/tlaplus/CAEJcUcEn5VZgEuSSLfoXWOp6Q5iYOixfET4vF70H4PKweFkRMg%40mail.gmail.com.

Guo Hua

unread,

Apr 10, 2024, 4:04:55 AMApr 10

to tlaplus

Hi Jack，

I read the MongoRaftReconfig (https://arxiv.org/pdf/2102.11960.pdf), and I like this idea. I think this paper is about designing a new protocol and high-level abstraction is preferred and necessary to polish thought. I also agree with you that the asynchronous receive/send style is favorable from the engineering aspect,

The consistency of the spec and TLA+ has value in software engineering; all systems design must ultimately be implemented in source code.

In my view, the appropriate approach is to abstract from a higher level when designing a new protocol, which is more scalable and straightforward. However, when implementing an already well-designed protocol, it is better to implement it as an asynchronous send/receive approach to close it to implementation.

I have some unproven ideas about changing MongoRaftReconfig specification from atomic to asynchronous communication. In the atomic communication style, we assume the system has some global states that can be checked. However, real-world systems obtain these global states through unreliable communication, such as send/receive.

For example, the committed variable stores the committed log index and term. Updating this variable corresponds to the AdvancedCommitIndex in the original Raft. The commit index is implemented by receiving confirmation from followers. Some variables need to be added to transform it to asynchronous.

The atomic communication version of the config variable is divided into two variables：

configCommitted, the current configuration of the cluster application;

configNew, the new configuration to be changed.

Thus, ConfigQuorumCheck, and TermQuorumCheck in spec (corresponding to Q1 and Q2 in the paper), the leader must keep additional variables:

followerConfig, the configCommitted status of other nodes collected by the leader;

followerTerm, the currentTerm status of other nodes collected by the leader.

The committed variable will be removed, and the confirmation of each AppendLogEntries guarantees the OplogCommitment in spec (corresponding to P1 in Paper) property. The AppendLog message flow may be added with additional version and term values of the config. QuorumsOverlap also requires similar processing, such as checking the config version and term when invoking AppendLog and RequestVote RPC.

It is trivial.

I also want to share my thoughts about using level abstraction to deal with scale by staged model check. First, implement high-level abstractions, such as assuming atomic communication, and use them to generate a relatively small state space. Then, lower-level abstractions will be implemented, with send and receive to describe messages. It is best to make the lower-level abstractions a refinement of the high-level abstractions, although I have little experience ensuring perfect and precise refinement. The states generated by the high-level abstractions will become the initial states of the lower-level abstractions, and then model checking will be done in stages. For example, Raft is divided into AppendLog and LeaderElection. We first generate the initial state from the high-level abstraction, plus the AppendLog operator, and perform model checking. Then, we perform model checking in the next stage by using the initial state from the high-level, plus the LeaderElection operator. I do this stage approach only through intuition. IPA(https://arxiv.org/pdf/2202.11385.pdf) may provide the tools to get it more rigorous.

Guo Hua

Reply all

Reply to author

Forward