AI-written Metamath verifier take II -- Codex turned Mario Carneiro's mm-lean4 into a verifier verifier ++ expanded AI-created unit test suite

42 views

Skip to first unread message

Zarathustra Goertzel

unread,

May 28, 2026, 4:23:13 PMMay 28

to Metamath

We now have a plausibly correct Metamath verifier in Lean, completing Mario Carneiro's mm-lean4. I believe it adheres to the Metamath book's specs and has modes that align with metamath-exe and metamath-knife's interpretations (which differ conservatively).

We have also extended David Wheeler's metamath-test suite with about 70 new unit tests, largely AI-generated.

History: in Sep 25, I was working on a Metamath verifier in Metta, and was curious to see if Codex/Claude could write a MM verifer in Go. They superficially 'succeeded' to pass the current test suites, yet proper inspection of the code led to a lengthy list of omissions thanks to Matthew House when I shared it here.

In vis words, "Writing a correct Metamath verifier requires very careful attention to detail, which it seems that LLMs still struggle with, short of directed hand-holding or extraordinarily thorough test suites."

This would seem to be true: the LLMs are good at error-driven delevopment, but were not good at spec-driven development.

This prompted me to wonder whether we could generate the comprehensive test suite from the specs: asking LLM-driven coding agents to comb the specs and to create unit tests to check each claim in them. Once having the tests, filling in the "verifier gaps" was relatively easy.

However, it woul be cool to go the extra mile: actually verify that a verifier satisfies them in Lean. This should reduce the surface area of potential errors. Codex and Claude Code struggled a lot with getting loop invariants to work, doing the proofs over lists (vs arrays in the implementation), handling parser invariants (instead of adding a bunch of cheap axioms), etc. Eventually, they seem to have gotten something functional.

I tried to look over the main theorems connecting the operational and declerative semantics, which look probably fine, so I think this time they have likely succeeded :)

[Sorry in advanced if it's yet another dud. Most likelly if issues are found, they can be improved upon without too much pain, so it is probably probably approximately correct in the sense of converging toward correctness. They managed to fix up some issues Mario pointed out ^^;]

On my machine, it verifies set.mm in about 42s, which isn't amazing, but isn't horrible.

Reply all

Reply to author

Forward

0 new messages