Experimenting with PlusCal / TLA+ at Dropbox

Elliott Jin

unread,

Aug 4, 2016, 1:59:08 AM8/4/16

to tlaplus

Hi all,

I work at Dropbox, and we recently had a "Hack Week" (i.e. a week to work on anything we want); I chose to spend the week experimenting with TLA+ / PlusCal for specifying components of our production systems.

Anyway, I just wanted to share the results of my week with this list. I ended up using PlusCal to specify one of our protocols for maintaining strong consistency between a cache and the underlying database. I've attached a recent version of the spec; here are a few additional remarks:

I actually spent the majority of my time understanding the protocol (i.e. tracking down and talking to the authors, reading design docs, reading code); the process of writing PlusCal went really quickly and smoothly.
The model checker did find an issue, but it was already a known and tracked issue (something can go wrong if a writer takes a really long time to commit a database transaction while the cache is being restarted).
I had to keep the model's parameters really small, or face a tremendous combinatorial explosion of states (though I'm not ruling out the possibility that I could've formulated the spec in a different way that's more conducive to model checking).

Overall, I thought this was a really interesting and worthwhile exercise. I definitely plan to write a PlusCal or TLA+ spec next time I get the opportunity to design a complicated component involving lots of concurrency or subtle state transitions. I also think I've managed to generate some interest among my coworkers; if I'm lucky, I'll be able to get some of them to add these great tools to their toolbox as well.

Thanks for reading; I'd love to hear from you if you have any questions or suggestions!

Regards,

Elliott

ConsistentKV.pdf

Leslie Lamport

unread,

Aug 15, 2016, 12:57:55 PM8/15/16

to tlaplus

Hi Elliott,

Thanks for posting your experience.

I actually spent the majority of my time understanding the protocol (i.e. tracking down and talking to the authors, reading design docs, reading code); the process of writing PlusCal went really quickly and smoothly.

Yes, people are so used to starting to think only when they start writing code, that they don't realize that the hard part of programming is the thinking, not the coding. And they write bad code because thinking in terms of code leads to bad thinking. Since specifications are so much smaller than programs, it's harder to make that mistake--especially when the spec language is expressive enough to make it easy to express your ideas in the language.

The model checker did find an issue, but it was already a known and tracked issue (something can go wrong if a writer takes a really long time to commit a database transaction while the cache is being restarted).

Did you know about that issue before you wrote your spec? If not, that's a very good illustration of the usefulness of specs even if the problem had already been discovered.

I had to keep the model's parameters really small, or face a tremendous combinatorial explosion of states (though I'm not ruling out the possibility that I could've formulated the spec in a different way that's more conducive to model checking).

As this may have led you to realize, even tiny models can be very effective at finding bugs.

Cheers,

Leslie

Stephan Merz

unread,

Aug 23, 2016, 2:59:40 PM8/23/16

to tla...@googlegroups.com

Hi Elliott,

thanks very much for sharing your specification. The conclusions that you draw from your experiment are very encouraging to us in the TLA+ community, and I wonder if you would be able to present your experience at the TLA+ community meeting later this year (http://tla2016.loria.fr)?

I played a bit with your spec, and here is some feedback. The modified spec is attached.

I changed the reader and writer processes so that they are cyclic, modeling repeated invocations of read and write operations, rather than your terminating (one-shot) processes. This of course increases the number of states that the model checker has to consider, but it may make the result of verification a little more convincing, and the model checker will stop exploring when it hits the same state for the second time.
In order to reduce the number of reachable states that differ in irrelevant ways, I initialize the local variables to "type-correct" values and reinitialize them after every iteration. Although this change could in general mask some problems, it is quite obvious that your algorithm never reads an uninitialized variable, so this does not change the visible behavior.
It is important to have as large atomic steps as possible because interleaving is the main culprit of state space explosion. I therefore removed your "start_pending" operation, which is a no-op, and combined "get_sentinels" and "check_already_pending", since the latter only reads local variables, so at the level of observable (global) variables no difference of behavior can be observed. Similar for "cas_sentinels" and "check_cas_errors". (However, I added another step for reinitializing the local variables of the writers because a write operation may terminate prematurely if it detects a concurrent pending write.)
The model of the "operators" is clearly incomplete since the spec only models startup (integration of readers and writers) but does not model failure. However, I essentially left this as it was – it's quite obvious that the "operator" processes do not play an essential role for the stated correctness properties for this spec.
I combined your two variables cache_keys_to_values and cache_keys_to_versions in a single variable cache that stores an array of records. This is a purely cosmetic change that doesn't change the size of the state space, but I find it a bit easier to read.

The model checker did find an issue, but it was already a known and tracked issue (something can go wrong if a writer takes a really long time to commit a database transaction while the cache is being restarted).

I understand that this issue is not present in the spec that you posted.

I had to keep the model's parameters really small, or face a tremendous combinatorial explosion of states (though I'm not ruling out the possibility that I could've formulated the spec in a different way that's more conducive to model checking).

Indeed, that's the typical experience, and one invariably underestimates the number of reachable states. Nevertheless, as you noticed, small parameter values are often enough to find useful errors. The interesting question is how big parameters should be chosen in order to be convinced of correctness.

For many cache coherence protocols, it's enough to consider a single cache line (i.e., a single key in the terms of your spec). However, this is not true for your protocol because you consider sets of writes (through multi-CAS). It clearly makes a difference if all, no, or some updates succeed. However, it seems to me that two keys should be enough to see all relevant situations.

Similarly, two values are enough, since we are only interested in checking (dis)equality.

Finally, I believe that checking for two readers, two writers, and one operator is enough in order to detect all possible interferences.

Moreover, keys, values, readers, and writers can be chosen as symmetric sets of model for TLC (reducing the state space that TLC has to explore), since there is no reference to specific values of these sets – except for the initialization of the data base, which uses a CHOOSE expression. What I did there is to generalize the model by making the initialization non-deterministic, mapping keys to arbitrary values in the initial state: since all of these states are reachable anyway, they may as well be admitted as initial states.

With these parameter values fixed, it remains to choose a constraint on the version numbers that model checking should consider, expressed through a state constraint. Choosing a bound of 2, TLC explores the state space within approximately 45 minutes on my laptop. It takes about a day for a bound of 3.

These results made me quite confident that the protocol actually satisfies the stated correctness properties. If you want to go beyond fixed instances, you may turn to theorem proving. The main complication is that this requires an inductive invariant (a predicate that is true initially and preserved by every possible transition from any state satisfying the invariant), and in general it may be very complicated to come up with such an invariant.

In your case, the invariant seemed quite obvious to me, so I wrote it down and proved it using TLAPS. The proof is contained in the module ConsistentKV_proofs, in case you are interested. You may find it enlightening to read the invariant because it gives a clear explanation why the protocol works, in contrast to model checking, which "only" assures you that it does. In particular, the invariant may guide modifications and optimizations of the algorithm as it stands. The proof was a breeze because TLAPS essentially did it automatically. The only interaction was to break down the proof into individual steps per instruction (corresponding to the disjuncts of the next-state relation). For a few steps, I helped the prover by separately proving each conjunct in the conclusion, although for all but one it would actually have been enough to raise the timeout a little before the prover gives up ("SMTT(30)" stands for a timeout of 30 seconds to the relevant backend). For some reason, the auxiliary proof of the (full) typing invariant needed a little more interaction although it appears to be even more straightforward. I didn't bother prove the property VersionsOK, which looks obvious.

Thanks again for the interesting experience!

Stephan

ConsistentKV_proofs.tla

ConsistentKV.tla

Daniel Ricketts

unread,

Aug 24, 2016, 5:21:07 AM8/24/16

to tlaplus

Hi Stephan,

I understand that this issue is not present in the spec that you posted.

I believe the issue was present in the original spec, but the modifications that you made to cas_sentinels fixed it. In the original spec, a writer can proceed with db_commit if cas_error_keys is empty, even if cas_success_keys is not equal to db_write_keys. Such a state can occur if db_write_keys is non-empty and the writer takes a cas_sentinels step when it is Down. This violates consistency because keys can be modified in the database when they are not Pending in the cache.

You fixed this issue by (a) removing cas_error_keys and comparing cas_success_keys to db_write_keys and (b) changing the if-statement in cas_sentinels to an await.

- Dan

Stephan Merz

unread,

Aug 25, 2016, 12:09:24 PM8/25/16

to tla...@googlegroups.com

Hi Dan,

I understand that this issue is not present in the spec that you posted.

I believe the issue was present in the original spec, but the modifications that you made to cas_sentinels fixed it. In the original spec, a writer can proceed with db_commit if cas_error_keys is empty, even if cas_success_keys is not equal to db_write_keys. Such a state can occur if db_write_keys is non-empty and the writer takes a cas_sentinels step when it is Down. This violates consistency because keys can be modified in the database when they are not Pending in the cache.

You fixed this issue by (a) removing cas_error_keys and comparing cas_success_keys to db_write_keys and (b) changing the if-statement in cas_sentinels to an await.

thanks for pointing this out. Indeed, I removed the variable cas_error_keys, relying on the putative invariant that it was the complement of cas_success_keys w.r.t. db_write_keys, and overlooking the fact that this need not hold when the server is down. It appears to me that this is a spec error rather than a protocol error: a server that is down should presumably not write to the database (i.e., why would db_commit not be guarded by the same if-statement as get_sentinels or cas_sentinels), but only Elliott would be able to tell for sure.

Anyway, the current spec only models server startup, but not server failures. It is easy to extend it to the case where servers have persistent memory that ensures that all local variables are restored after recovering from failure, but I don't know if this is the intended behavior.