Model network using set or bag

100 views
Skip to first unread message

Alberto Carretero

unread,
Mar 18, 2026, 7:21:49 AM (7 days ago) Mar 18
to tlaplus
I see several discussions here on the topic of message passing or atomic communication but I have not found any on which representation is best for message passing and I wonder if there is a standard way, or at least, one that is better for TLC.

In general I see that there are two different possibilities. There are several specs like PaxosMadeSimple[1] that use a set of all the messages ever sent. Duplicates do not need to be modeled specifically as messages are never deleted and message loss is equivalent to never receiving the message. There is even a comment written presumably by Lamport in the spec that states:

> (* The algorithm is easiest to understand in terms of the set msgs of all  *)
  (* messages that have ever been sent.  A more accurate model would use one *)
  (* or more variables to represent the messages actually in transit, and it *)
  (* would include actions representing message loss and duplication as well *)
  (* as message receipt.                                                     *)

I read "more accurate" as closer to a real implementation but not necessarily more expressive as I think that all the safety properties can also be modeled using a set. Is that right?

Now, on the other hand we have specs such as Raft[2][3] that use a bag of messages to model the network (correspondence between msg and count). Duplicates are handled by incrementing the count and drops by decrementing it. To me, this representation should be much less performant when model checking in TLC as each duplicate / drop changes the state and can be interleaved with any other action. Even if the network is constrained to disallow duplicates, more states will be generated compared to the set as messages can be received by the destination and then resent by the sender; effectively duplicating the number of generated states.

At this point, it looks to me like using a set would be preferred in general, but then I see most specs preferring a bag. What am I missing? Are there any other advantages? Or maybe sets are more efficient but the effect is not meaningful enough.

[1]: https://github.com/tlaplus/tlaplus/blob/master/general/performance/PaxosMadeSimple/PaxosMadeSimple.tla
[2]: https://github.com/ongardie/raft.tla
[3]: https://github.com/etcd-io/raft/blob/main/tla/etcdraft.tla

Andrew Helwer

unread,
Mar 18, 2026, 11:28:15 AM (7 days ago) Mar 18
to tla...@googlegroups.com
This is a good question and I've been meaning to write a blog post summarizing & comparing all these for basically forever. Off the top of my head, in order of roughly increasing fidelity we have:
  1. Processes directly read each others' state; Lamport uses this approach in many of his specs IIRC. I suspect it is more effective at modeling distributed algorithms than we think!
  2. To send, messages are put in a single buffer value (either on sender, recipient, or shared); if the buffer is currently occupied, messages cannot yet be sent. Recipients pull messages off the buffer value and clear it. More common for modeling communication over a bus within the same computer.
  3. Each process has a message receive queue or (if out-of-order messages are to be modeled) a message receive unordered set, into which messages are added and then removed.
  4. A single queue or unordered set for the entire network, into which messages are added and removed; this usually requires wrapping each message in a header identifying the sender & recipient. Strictly superior to 3 in my opinion because there are some weird artificial deadlocks that are avoided.
  5. A single bag for the entire network, as detailed in your post; this has the drawback of unbounded state, as each counter can be incremented to infinity.
  6. (My favored option) Basically the same as 4 using a single unordered set for the entire network, but sends can nondeterministically fail to add a message to the set (representing message loss) and receives can nondeterministically fail to remove a message from the set (representing message duplication). The unordered set inherently gives you out-of-order messaging. You can also tune these constraints to make your model larger or smaller using the exact same API for both sends & receives, including using a queue for the network instead of a set. I have a nice implementation of this I've been meaning to add to the community modules for a while.
There are probably others but those are the ones I've seen. On the PlusCal side there is also Distributed PlusCal [PDF] which as I understand it adds message passing primitives to the language. It is being worked on by Stephan's group.

Andrew Helwer

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tlaplus/fa267fbb-f81c-4739-9801-5c5f9852da41n%40googlegroups.com.

Stephan Merz

unread,
Mar 18, 2026, 11:54:55 AM (7 days ago) Mar 18
to tla...@googlegroups.com
Thanks Andrew, great description. In summary, there is not a single "best" way to model communication, one has to balance abstraction (favoring smaller state spaces) and precision. At the extreme end, (1) doesn’t really model communication at all but can lead to very readable high-level specifications that can then be refined based on some of the higher-numbered variants.

A small niggle: I don’t really understand what "artificial deadlocks" you refer to for (3): as long as message queues are unbounded, why would sending lead to deadlock?

On the PlusCal side there is also Distributed PlusCal [PDF] which as I understand it adds message passing primitives to the language

It’s essentially based on your (3) or (4), depending on how processes use channels, with variants for FIFO or unordered message delivery (mapping to sequences respectively bags in TLA+).

Stephan

Markus Kuppe

unread,
Mar 18, 2026, 12:30:57 PM (7 days ago) Mar 18
to tla...@googlegroups.com
Building on Stephan's point that there isn't a single “best" way: in many formalisms, this creates a real modeling dilemma, since you have to pick the "right" level of detail up front. In TLA, refinement lets us turn that into a modeling strategy instead.

In particular, approaches like (1), where processes directly read each other's state, are sometimes dismissed as "not really modeling communication". But in TLA, that's actually useful. It lets us start with a simple, high-level spec that captures the essential safety and liveness properties, without committing early to a specific communication mechanism.

From there, we can introduce lower-level specs using any of the more detailed variants you list (queues, sets, lossy networks, etc.), and then formally show that they refine the more abstract specs. In other words, we can show that the more concrete model is a correct implementation of the high-level one.

This is where TLA+ distinguishes itself: the ability to formally connect different points in the abstraction/precision tradeoff, instead of picking one and hoping it was the right choice. A typical flow is:

* start with a simple, readable model (often close to (1)),
* validate the core properties,
* and then refine toward more concrete communication models like (4) or your favored variant.

So you end up not just with multiple models, but with a clear relationship between them, avoiding both a single model at the wrong granularity and a set of disconnected ones.

This kind of abstraction is also how some algorithms were found in the first place. As Lamport himself writes about Paxos, "It would have been very hard to find this algorithm had my mind been distracted by the irrelevant details introduced by having the processes communicate by messages." [1,2]

Suppressing communication details isn't just about keeping the state space small. It can make the problem easier to think about. Refinement then gives you a way to add the details back in a controlled way.

M.

[1] https://github.com/tlaplus/Examples/blob/master/specifications/PaxosHowToWinATuringAward/Voting.tla#L8-L10
[2] https://lamport.azurewebsites.net/tla/paxos-algorithm.html

Andrew Helwer

unread,
Mar 18, 2026, 12:40:16 PM (7 days ago) Mar 18
to tla...@googlegroups.com
Ah good catch, Stephan - this was all based on a spec I wrote more than two years ago, which is coming back to me a bit more now. I think I mixed up the queue approaches in 3 and 4. Basically: global queue = bad, queue-per-receiver = acceptable. When I had a global queue it interacted with application logic in an unintuitive way to produce a deadlock that I felt was kind of artificial. Moving to a per-receiver queue fixed the problem. So don't use global queues unless it really matches your network topology!

Related here is something A. Jesse Jiryu Davis has been working on lately, based on the 1990 paper Knowledge and common knowledge in a distributed environment by Joseph Y. Halpern and Yoram Moses. It is about models for knowledge in a distributed network, which is perhaps the more fundamental thing we are actually trying to capture with this kind of message passing stuff. Jesse gave a talk at UCLA about this recently but I don't think it was recorded anywhere.

Andrew Helwer 

divyanshu ranjan

unread,
Mar 19, 2026, 12:07:01 AM (7 days ago) Mar 19
to tla...@googlegroups.com
In the first way to model a network, message loss is not a possibility whereas in the second it is. Depending on how safety property is specified, it is
possible for safety property to fail model checking in the second model but not in the first.

Consider two hosts which store a "common set" and communicate addition of elements in their set through messages. If safety property is
(host1_set + network_set) == (host2_st + network_set), where network_set is all elements which are in transit, to specify that eventually the "common set" 
is equal. This property might fail to model check in second but not in first. But this is too toy an example.


Message has been deleted
Message has been deleted
Message has been deleted

Alberto Carretero

unread,
Mar 24, 2026, 10:23:58 AM (yesterday) Mar 24
to tla...@googlegroups.com

(I tried sending this message multiple times but somehow it shows up as deleted instantly, hopefully it is not duplicated)

Thank you everyone for the responses! As Markus says, there is no "best formalism" and no need to pick the "right" level of detail up front; instead, refinement can be used to add incremental details. However, I still think some of the approaches are worse than others. For example, depending on the way duplicates* and drops are formalized, we end up effectively model-checking the network abstraction and not the algorithm itself. In approaches (2) to (6) from Andrew's message, the following:

Send(m) -> Duplicate(m) -> Send(m) -> ... -> Drop(m) -> Receive(m)

generate distinct states for the _network_ but from the point of view of our algorithm this is equivalent to:

Send(m) -> Receive(m)

Instead, when using a set with no removal of messages, new states are only generated if sending or receiving multiple times affects the algorithm state itself. To me, this makes using a set agnostic to the algorithm as new states are generated only if relevant to the specifics of the algorithm.

Alberto.

*: At the protocol level or at the network level.

Andrew Helwer

unread,
Mar 24, 2026, 12:11:45 PM (yesterday) Mar 24
to tla...@googlegroups.com
Apologies about your messages being deleted. This is a known issue with google groups. I usually manage to avoid it by not going through the google groups web interface, but instead replying from my email account. Then at least if the message is deleted I will be able to find it in my sent messages folder and re-send it. The web interface is also generally awful and best to be avoided.

The reason I like option 6 is that it avoids the problem you mentioned, because sends and receives nondeterministically fail without generating additional steps. In that method, send(msg) has two possible successor states:
  1. The stuttering state, where sending the message failed (equivalent to dropping the message in transit) and no change is made to the network (although other state variables may have changed).
  2. The send succeeds and the message is added to the network set.
Additionally, receive(msg) also has two successor states:
  1. The message is removed from the network set, representing an ordinary receive.
  2. The message is not removed from the network set, representing message duplication; in both this and the previous case the receiver still actually receives the message and can do whatever with it.
Of course, with a set in the middle you also get unordered messages, although these can be swapped out for per-receiver queues instead. Another reason I like this approach is that unlike when using Bags, you can naturally "saturate" the state space instead of having unbounded counters everywhere that have to be limited artificially.

The tricky part is designing a nice API for this, where you can also optionally turn off message removal/duplication/unordering without having to change any of your TLA+ code that interacts with the network. I actually managed to do this - will need to find the file and publish it.

Andrew

Markus Kuppe

unread,
Mar 24, 2026, 1:12:13 PM (yesterday) Mar 24
to tla...@googlegroups.com

冯文翰

unread,
3:35 AM (14 hours ago) 3:35 AM
to tlaplus
In some real-world systems, sometimes we require more accurate modeling of network messages. For example, in the ZooKeeper specification (https://github.com/apache/zookeeper/tree/master/zookeeper-specifications), the messages sent from node N1 and to arrive at N2 is modeled as a tuple, which means the message ordering information is preserved. The entire message variable is then modeled as a two-level map, such as:
N1 :> (N1 :> <<m1, m2>> @@ N2 :> []) @@ N2 :> (N1 :> <<>> @@ N2 :> <<>>). The is because ZooKeeper actually relies on a single TCP connection for communication between nodes during the leader election phase, which ensures that messages sent between two nodes maintain a first-sent, first-received order. If the messages were modeled as a Bag, N2 might receive m2 before m1, allowing TLC to explore states that would never occur in the actual system, thereby producing false positives.

In my view, something worth noting here is whether a specification that differs from the actual system implementation might lead TLC into states that would never occur in implementation. For instance, modeling messages as a Set could introduce more interleavings of message arrivals compared to modeling them as a Bag. (Such states might confuse developers of the actual system implementation: “Since our system never enters such states, why must we guarantee properties under those states?”) This issue might be less apparent during early design stages, but it becomes more pronounced when verifying an existing system. Therefore, based on my current experience, modeling messages in a way that closely aligns with the system’s implementation is sometimes a more appropriate choice.

Alberto Carretero

unread,
3:36 AM (14 hours ago) 3:36 AM
to tlaplus
Thank you everyone for the responses! As Markus says, there is no "best formalism" and no need to pick the "right" level of detail up front; instead, refinement can be used to add incremental details. However, I still think some of the approaches are worse than others. For example, depending on the way duplicates* and drops are formalized, we end up effectively model-checking the network abstraction and not the algorithm itself. In approaches (2) to (6) from Andrew's message, the following:

Send(m) -> Duplicate(m) -> Send(m) -> ... -> Drop(m) -> Receive(m)

generate distinct states for the network but from the point of view of our algorithm this is equivalent to:


Send(m) -> Receive(m)

Instead, when using a set with no removal of messages, new states are only generated if sending or receiving multiple times affects the algorithm state itself. To me, this makes using a set agnostic to the algorithm as new states are generated only if relevant to the specifics of the algorithm.

Alberto.

*: Duplicates can be introduced by the network or by the protocol itself.
Reply all
Reply to author
Forward
0 new messages