Yet more code conformance checking

Ognjen Maric

unread,

Jun 18, 2025, 9:01:10 AMJun 18

to tlaplus

Hi folks,

I thought I'd post about some recent work I did on checking the correspondence between TLA models and code using trace validation (thanks to Jesse Jiryu Davis for the nudge to post this here!). I know that others have done similar things, but the motivation and the details of my approach and some code artifacts (a Rust logging library for TLA) may still be interesting.

We've used TLA relatively extensively to verify different parts of the blockchain stack (Internet Computer) that we're working on. TLA modeling went pretty well in the beginning. The process was that I'd normally create the model myself based on the code and then sit down with the engineers owning the code to check that the model and code match (both to verify any findings and to make sure I didn't miss any interesting behaviors by overabstracting).

One particularly fruitful area was checking for reentrancy bugs in smart contracts. Reentrancy bugs are a form of concurrency bugs that have historically been responsible for some of the highest-profile hacks in blockchains, and our highly concurrent blockchain makes these bugs more likely to happen (we recently published a paper explaining this, for anyone interested [1]). We found multiple such bugs, one of which, in our wrapped Bitcoin token, in particular was potentially worth millions if exploited (the nice thing about blockchains is that it can be easy to put a price tag on a bug).

But lady Fortuna wasn't quite in the mood. Literally a couple of weeks after we had found and fixed this bug and successfully verified the model, and I went through the model with the lead engineer, someone accidentally reintroduced a variation of the bug in the code when implementing a new feature. As I was done with the model by that point I didn't catch the change. Luckily we caught it last minute via a manual security review, but it left a bitter taste in the mouth and raised questions over the utility of the TLA exercise.

Since we don't really have resources to do code-level verification, I started looking for a cheaper solution that could still give us some hope of keeping models and code in sync. The idea of a runtime verification approach (aka trace validation) popped up. I looked around and found the MongoDB paper on extreme modeling, with the recommendation to not do it :) But our situation was different, as our models were built after the code and align with the code pretty closely (though not perfectly). So I started working on this on the side and have recently wrapped up the first project, where we check the traces of smart contracts governing the platform (e.g., deciding on software upgrades or token minting) against its TLA models.

We do this by instrumenting our smart contract code (written in Rust) with certain logging macros that, during testing, log the TLA traces in the background. The macros are only enabled in test builds, not in production. We then observe all the state changes exercised in tests, and, at the end of each test, we check the logs for conformance against the model. You can find the (slightly marketingy) writeup here:

https://medium.com/p/c7854eb9458d

The Rust logging library is available here:

https://github.com/dfinity/ic/tree/57a4ecdad8167ab17508a37195279bc2ce30b12d/rs/tla_instrumentation

A few technical points that may be interesting for this crowd:

1. I ended up using Apalache instead of TLC for the conformance check, even though I used TLC for the actual model checking. The main reason is that the models often have guards of the form `\E num \in 1..MY_CONSTANT`, where MY_CONSTANT would be very small to make TLC model checking feasible, but might be huge in practice. IIUC TLC would enumerate everything explicitly, whereas Apalache would turn this into an SMT formula that would be trivial to check (since the `num` is effectively constrained by the pre- or post- state). Using Apalache also had the side benefit that it comes with a type system. I know this is a bit controversial here ;) but in my experience it made it easier to update the models when the code changed compared to the "vanilla TLA" experience.

2. Instead of checking the entire trace, I ended up checking just that each individual recorded step conforms to the model's transition relation. This way I could ignore the "environment" steps that are required to obtain a full legal trace of the system, and just focus on the "system under test" instead. I also effectively ignored checking the initial states, as I just stipulate the initial state to be equal to the recorded pre-state. This was helpful for tests which would often start from some predefined fixture state, rather than a "clean" initial state. Obviously it's not really sound, but hey, the entire approach is just best-effort.

3. I also ended up with a quirky way to phrase the check of "does this recorded step conform to the transition relation". Given the transition relation `Next(x, y)` (explicitly parameterized here for clarity), and the recorded pre- and post- states s and s', I ended up creating a new model with `Init(x) == x = s` and `Next2(x, y) == y = s' /\ Next(x, y)`, and asking whether this new model deadlocks or not. If it does, then the step doesn't conform to the transition relation. I don't really remember anymore why I took this exact approach. I *think* I was worried about accidentally allowing stuttering where none was expected (i.e., I wanted to disallow having s = s' unless the Next relation really specifies that this is possible), but maybe there are better ways of handling this concern.

If anyone has any ideas on improving this, I'm happy to get any feedback. In particular, one moderate pain point of my current approach is that its slow, though it's just about bearable to still include this in the CI. The combination of points 2 and 3 above means that I end up with one new model (to ask the "deadlock question") for each state pair collected. So checking each pair incurs the JVM startup cost, Apalache pre-processing costs, etc. This could be mitigated by caching the state pairs and not rerunning the checks for cached state pairs, but I haven't gotten around to twisting Bazel (our build system) into submission yet.

Cheers,
Ognjen

[1] https://arxiv.org/abs/2506.05932

A. Jesse Jiryu Davis

unread,

Jun 18, 2025, 10:35:02 PMJun 18

to tla...@googlegroups.com

Hi Ognjen! I'm leaving this in my inbox until I have some more time to read and think and follow the links. But thanks for sharing this info, it's appreciated.

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/tlaplus/858881a5-df2d-4a66-8b62-0171919ba2f0n%40googlegroups.com.

A. Jesse Jiryu Davis

unread,

Jul 5, 2025, 1:35:49 PMJul 5

to tla...@googlegroups.com

OK, I've read this and the Medium post and tla_instrumentation/README.md in your repo. This is fascinating work and a really innovative approach. I think that points 1 and 3 are especially novel.

Regarding the JVM startup cost, I can imagine it's painful. Would it help if Apalache permitted a streaming API so you could interact with it without restarting? Unfortunately Apalache isn't maintained now, I think.

My intern Finn Hackett is also working on trace-checking, using TLC. He'd considered adding a streaming API to TLC so it could a) handle very long traces but only keep a couple states in memory, and b) run trace-checking concurrently with the implementation test, and abort the test soon after the implementation's behavior diverges from the spec, instead of running after the test completes. He decided he didn't need this feature for his project this summer, but it seems like a worthwhile idea someday.

Igor Konnov

unread,

Jul 5, 2025, 2:01:20 PMJul 5

to tla...@googlegroups.com

Hi Jesse,

On Sat, Jul 5, 2025 at 7:35 PM A. Jesse Jiryu Davis
<je...@emptysquare.net> wrote:
>
> OK, I've read this and the Medium post and tla_instrumentation/README.md in your repo. This is fascinating work and a really innovative approach. I think that points 1 and 3 are especially novel.
>
> Regarding the JVM startup cost, I can imagine it's painful. Would it help if Apalache permitted a streaming API so you could interact with it without restarting? Unfortunately Apalache isn't maintained now, I think.

Apalache has the server mode, which was designed exactly for
model-based testing and avoiding the repetitive startup costs. A
client can interact with the server with GRPC. There is a Python
client [1]. I believe that developing a client for another language
would not be hard.

It's true that Apalache is not under active development. There is a
little bit of maintenance, but no new features. Actually, we want to
introduce transition-wise exploration mode, similar to how Apalache
does it internally, but this has never been implemented.

[1] https://github.com/apalache-mc/apalache-chai

Best regards,
Igor

> To view this discussion visit https://groups.google.com/d/msgid/tlaplus/CAFRUCtYcB_oVgnZpnAXoMwr9myg8hRUmfUWQNnLuKWk%2BgjLV%2Bw%40mail.gmail.com.

Ognjen Maric

unread,

Jul 6, 2025, 9:44:48 AMJul 6

to tla...@googlegroups.com, A. Jesse Jiryu Davis

Hi Jesse,

thank you for the comments, appreciated!

On 7/5/25 7:35 PM, A. Jesse Jiryu Davis wrote:
> Regarding the JVM startup cost, I can imagine it's painful. Would it
> help if Apalache permitted a streaming API so you could interact with it
> without restarting? Unfortunately Apalache isn't maintained now, I think

It would definitely shave off a non-negligible part of the overhead.
Unfortunately since I generate one model per state pair, this still
incurs the overhead of Apalache pre-processing for each state pair,
which wouldn't go away with a streaming version. I guess the ideal
solution would be to really directly add the functionality to check the
compliance of a state pair with a formula, but I haven't looked at the
Apalache codebase so I have no idea how difficult this would be.

> My intern Finn Hackett is also working on trace-checking, using TLC.
> He'd considered adding a streaming API to TLC so it could a) handle very
> long traces but only keep a couple states in memory, and b) run trace-
> checking concurrently with the implementation test, and abort the test
> soon after the implementation's behavior diverges from the spec, instead
> of running after the test completes.

I don't really get how the streaming API allows you to lower the memory
footprint, or indeed how to check a trace interactively (if I understand
you correctly) - could you explain?

As for the second point, this wouldn't be a big win for my current use
cases. There, the setup for an implementation test may indeed take a
while, but once the state pairs start being produced the test usually
ends quickly. So starting the checks in parallel wouldn't give me much.
Of course this could be quite different in other cases.

Cheers,
Ognjen

Ognjen Maric

unread,

Jul 6, 2025, 9:48:27 AMJul 6

to tla...@googlegroups.com, Igor Konnov

On 7/5/25 8:00 PM, Igor Konnov wrote:
> Apalache has the server mode, which was designed exactly for
> model-based testing and avoiding the repetitive startup costs. A
> client can interact with the server with GRPC. There is a Python
> client [1]. I believe that developing a client for another language
> would not be hard.

Thanks - I wasn't aware of this at all. I'll have a look at how I could
go about integrating this when I find some time. Python may already be
enough for my purposes, though a Rust client would likely make my life
easier.

And since you're here :) I take it extending Apalache to just check the
conformance of a state pair with a formula isn't really trivial?

Ognjen

Finn Hackett

unread,

Jul 6, 2025, 3:54:46 PMJul 6

to tla...@googlegroups.com

Re: the TLC streaming idea, I can clarify. It's in the context of using the Antithesis implementation fuzzer. The base idea is to expose the trace as an async lazy-filled sequence value, which populates with elements as the system makes progress (still for one execution, it just means we can check things earlier). On its own, streaming data to TLC might only find a counterexample sooner, since it can stop system execution early. If you have a hypervisor that's constantly cloning and branching your system execution (Antithesis), however, running TLC in parallel with the system means potential work deduplication, since you're exploring branches in both the system's state and TLC's in progress model check (so no rechecking path prefixes at the end). Likely less useful outside Antithesis, but that was the idea.

Best,

-Finn

--
You received this message because you are subscribed to the Google Groups "tlaplus" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tlaplus+u...@googlegroups.com.

To view this discussion visit https://groups.google.com/d/msgid/tlaplus/929bd577-37d7-495e-bd45-1255de164776%40gmail.com.

Igor Konnov

unread,

Jul 7, 2025, 4:32:42 AMJul 7

to Ognjen Maric, tla...@googlegroups.com

Hi Ognjen,

On Sun, Jul 6, 2025 at 3:48 PM Ognjen Maric <ognjen...@gmail.com> wrote:
>
> On 7/5/25 8:00 PM, Igor Konnov wrote:

> > Apalache has the server mode [...]

> Thanks - I wasn't aware of this at all. I'll have a look at how I could
> go about integrating this when I find some time. Python may already be
> enough for my purposes, though a Rust client would likely make my life
> easier.

Back then, writing GRPC clients was a bit tedious, but I guess, an LLM
could produce a GRPC client in Rust.

> And since you're here :) I take it extending Apalache to just check the
> conformance of a state pair with a formula isn't really trivial?

Currently, Apalache emits traces in JSON [1]. What you are asking for
is whether it would be possible to load two states,
e.g., in the same format as [1], and evaluate them against a predicate
of a preloaded model. This is definitely doable
and it's not really hard to implement for someone who knows the codebase.

We also had a more ambitious RFC-10 [2]. With this RFC, you would be
able to introduce a feedback loop between
the system under test and the model. Basically, evaluating the outputs
of the system against the model and producing
next inputs to the SUT. This was never implemented, though, again,
it's a very feasible task.

[1] https://apalache-mc.org/docs/adr/015adr-trace.html
[2] https://apalache-mc.org/docs/adr/010rfc-transition-explorer.html

Best,
Igor

Igor Konnov

unread,

Jul 7, 2025, 4:54:13 AMJul 7

to Ognjen Maric, tla...@googlegroups.com

I forgot to mention that there is also a simple way to produce TLA+
specs without going via the whole AST/IR route. In a recent project
for Stellar, we experimented with template engines such as EJS, which
are normally used to produce static web sites. It is relatively easy
to write a TLA+ template that generates a TLA+ spec tuned for fixed
inputs. Here is an example [1]. This does not solve the issue of
parsing TLA+ and doing preprocessing each time, but this approach is
quite convenient from the engineer's p.o.v., as one can load
blockchain transactions using the standard SDKs.

[1] https://github.com/freespek/solarkraft/blob/main/doc/case-studies/xycloans/MCxycloans_monitor.ejs.tla

Best,
Igor

Reply all

Reply to author

Forward