Is it possible to TLA+ (and PlusCal) to check complex models?

606 views
Skip to first unread message

Shuai Mu

unread,
Feb 23, 2017, 4:50:32 PM2/23/17
to tlaplus
Hi, 

I am trying to use TLA+ to reason a fair complex protocol 
used for transaction and replication. An comparable system 
would be Google's Spanner, i.e. 2PC on top of Paxos. The 
difference is my protocol is not layered like 2PC and Paxos 
so everything is described as one. 

I found that it is hard to do model checking with complex models. 
It is okay to check with simpler models, e.g. 1 shard + 1 replication, 
3 shard + 1 replication, 1 shard+3 replication. The model checker 
can finish in a few minutes on my 24-core workstation. However, 
if I change to a more complex model, such as 3 shard + 3 replication. 
The checker seem never finishes. The longest run I had is a little 
less than 2 days with my 24-core workstation. 

I realize it might has something to do with how the model checker 
does (brutal search?), and perhaps the labels in the PlusCal kind 
of makes things worse? Does adding a label in PlusCal severely 
affect the time for model checking? 

Is there any way to work around this?

-Shuai

Stephan Merz

unread,
Feb 24, 2017, 3:14:06 AM2/24/17
to tla...@googlegroups.com
Hi Shuai,

I had a quick look at your spec, and I am not surprised that TLC blows up on it. Yes, it does "brutal search" of the state space, enumerating the reachable states, creating successor states by trying to evaluate all actions on every state it has seen so far. What you have seen is the well-known problem of combinatorial explosion (i.e., exponential growth of the state space) as you add more processes. PlusCal labels define the atomic actions of your spec: the more you have, the more fine-grained your transitions are and the more states you'll generate. On the other hand, inserting fewer labels may mask problems due to concurrent execution of actions that access shared variables.

Fortunately, exhaustive search tends to find errors in very small configurations because TLC explores states that are extremely unlikely to occur in actual executions. You should therefore ask yourself what are the smallest meaningful configurations that give you confidence in the correctness of the overall system.

Also, can you tear apart your algorithm and verify pieces independently of the rest? (Your description seems to say that this is difficult.)

Finally, it looks like you "only" verify the liveness property <>SrzAcyclic(srz). Liveness checking takes more time than invariant checking, and I'd try to think of invariant properties (describing relations among variables that you expect to hold) and have TLC check those as well. If exhaustive search is taking too long, you may consider doing random exploration (see Advanced Options -> TLC Options -> Simulation mode) which may find deeper invariant violations more quickly than standard breadth-first exploration.

Regards,

Stephan


--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To post to this group, send email to tla...@googlegroups.com.
Visit this group at https://groups.google.com/group/tlaplus.
For more options, visit https://groups.google.com/d/optout.

Hillel Wayne

unread,
Feb 24, 2017, 5:19:42 PM2/24/17
to tlaplus
Hi Shuai,

IIRC liveness checking only uses a single core, because it has to track behaviors. Testing safety conditions is generally faster.

Given N processes with M skip; labels and 1 starting state, you'll have approx (NM)!/(M!)^N possible behaviors. TLA+ will attempt to exhaustively check all of them. Combined with multiple starting states and with statements you hit combinatorial explosion pretty quickly. Usually the simplest optimizations are finding ways to reduce the number of labels without compromising the integrity of your spec. Here's one example:

procedure AggregateGraphForFastPath(pre_accept_acks, partitions) {

   xxxxdddx:

     tmp_partitions := partitions;  

   yyyyassy: while (tmp_partitions /= {}) {

     with (p \in tmp_partitions) {

       tmp_par := p;

       tmp_partitions := @ \ {p};

     };

     tmp_ack := PickXXX(pre_accept_acks[tmp_par]);

     dep_c := @ \union tmp_ack.dep;

   };

   return;

 }      


Since procedures automatically reset local variables arguments after use, you don't need to use tmp_partitions, which means you don't need the xxxxddxx label.

Shuai Mu

unread,
Feb 24, 2017, 9:22:19 PM2/24/17
to tlaplus
Hi Stephan, 

Thank your for your reply. I think your summarized my problem very well. 

I used to check invariant in the IDE on the model overview page. The invariant I used is simply "SrzAcyclic(srz)", if memory serves. It did not trigger any errors, only very slow, perhaps because it is checking for every reachable state? Since the serialization graph (srz) is only increasing in vertex and edge set, I thought checking final state would be enough? (Does the liveness check mean it will check this invariant in every final state?)

Now I am thinking about a possible "workaround" solution. Instead of using "process" for each server in the protocol, I am thinking about only having one process in PlusCal, and have a global msg queue. The one process will then be like a "event loop" that keeps popping messages from the msg queue, process, and push back to the queue. This will lose some correctness, but hopefully not too much. 

Shuai Mu

unread,
Feb 24, 2017, 9:27:59 PM2/24/17
to tlaplus
Hi Hillel, 

Thank you for your reply and thanks for catching the unnecessary label!

Do you mean if I check for liveness, at most only one core will be utilized? So we should not check any properties in the "model overview" page to utilize more than 1 core? This is certainly new knowledge to me!

Markus Alexander Kuppe

unread,
Feb 25, 2017, 8:55:10 AM2/25/17
to tla...@googlegroups.com
On 24.02.2017 23:19, Hillel Wayne wrote:
> IIRC liveness checking only uses a single core, because it has to track
> behaviors.

Hi,

technically, liveness checking - among other things - searches for
strongly connected components. TLC uses Tarjan's algorithm [1], which is
single threaded.

Cheers
Markus

[1]
https://en.wikipedia.org/wiki/Tarjan's_strongly_connected_components_algorithm

Leslie Lamport

unread,
Feb 25, 2017, 10:54:11 AM2/25/17
to tlaplus
   Do you mean if I check for liveness, at most only one core will be
   utilized?  So we should not check any properties in the "model
   overview" page to utilize more than 1 core? 

The "Properties" part of the "What to check?" section of the Model
Overview page is used to specify temporal formulas that can be safety
properties, liveness properties, or a conjunction of the two.  The
safety properties TLC can check are invariance ([] Invariant), step
simulation ([][Action]_vars), and a conjunction of these properties.
To check any property, TLC must compute the state graph.  It does that
using as many cores as the model specifies.  TLC checks for violation
of safety properties while computes the state graph.  It checks for
violation of liveness properties, using a single core, after computing
the state graph and periodically while computing it.  (I think it
does that about every half hour.)

Shuai Mu

unread,
Feb 25, 2017, 9:24:53 PM2/25/17
to tlaplus
Hi Leslie, 

Thank you very much for clarification. Then for my case it is just that the search space is too large.  
Reply all
Reply to author
Forward
0 new messages