Codex wrote an MM verifier in Go pretty easily [AI Progress Report]

64 views
Skip to first unread message

Zarathustra Goertzel

unread,
Sep 16, 2025, 5:03:21 AM (9 days ago) Sep 16
to Metamath
I was curious if Codex could write a Metamath verifier, so asked ChatGPT for a list of verifiers and what languages it thought would be both easy and interesting: it chose Go.

I started from my mmverify.py repo (where iIhave a too-slow verifier in another weird language, MeTTa), so it had access to reference implementations.

The task was completed pretty easily (perhaps in 40-60min of AI-work) with my feeding in some errors.  It seems to pass David Wheeler's Metamath-test suite.


FWIW, I decided to try to see if it could do Metamath Zero, and I must say it struggles (even to just do .mm0 + .mmu verification).

Matthew House

unread,
Sep 16, 2025, 3:45:00 PM (9 days ago) Sep 16
to meta...@googlegroups.com
A few issues I spotted, many of which are nitpicks and some of which are more serious:
  • The verifier does not check that each file only uses printable ASCII + "\t\n\f\r". This can be problematic if not checked, since differences in how characters outside this set are handled (e.g., what counts as white space) can lead to discrepancies between verifiers.
  • The verifier does not check that keyword tokens are separated by white space, e.g., it accepts $($) as a valid comment, or foo$a|- bar$. as a valid $a statement.
  • The verifier does not check that comments do not contain $( or $).
  • The verifier does not check that ${ $} delimiters for blocks are balanced within each file.
  • The verifier does not check that math symbols do not contain $.
  • The verifier accepts a meaningless dangling $ character at the end of a file.
  • The verifier does not check that active constants and variables cannot be redeclared, nor that variables cannot be declared as constants or vice versa. It has no support for variables becoming inactive at the end of their containing block.
  • The verifier does not check that math symbols in $d statements are distinct active variables.
  • The verifier does not check that all labels are unique, nor that all labels are distinct from all declared math symbols.
  • The verifier does not check that typecodes are active constants.
  • The verifier does not check that $f statements use an active variable, nor that all variables in an $e, $a, or $p statement have an active $f statement.
  • The verifier does not check that a variable has at most one active $f statement at a time, nor that every $f statement for a variable has the same typecode.
  • The verifier does not ignore repeated inclusions of the same file. For instance, a file test.mm containing $[ test.mm $] will lead to a runaway memory leak.
  • The verifier does not ignore comments in certain contexts in statements. For instance, it will interpret foo $a $( baz $) |- bar $. as a statement with typecode $( and expression $( baz $) |- bar, which is nonsense and will lead to discrepancies between verifiers.
  • The verifier does not check for integer overflow in compressed-label steps. This allows multiple ways to write step numbers, as well as ways to write the special code Z as a number, which will lead to discrepancies between verifiers.
  • The verifier has no support for unknown steps ?. After reading a proof with a ? step, a typical verifier will continue verifying the rest of the database as usual, but warn the user that it contains incomplete proofs.
  • The verifier does not check that a proof cannot use its own label in a step. This allows incorrect proofs such as $[ set.mm $] evil $p |- ph $= ( evil ) AB $. which insist upon themselves.
  • The verifier does not check that typecodes match when substituting variables. This allows incorrect proofs such as $[ set.mm $] evil $p |- ph $= ( vx wn weq wal equid ax-gen ax6 pm2.24ii ) BCZJDZBEAKBJFGBJHI $. which exploit syntax ambiguities.
Writing a correct Metamath verifier requires very careful attention to detail, which it seems that LLMs still struggle with, short of directed hand-holding or extraordinarily thorough test suites. The original mmverify.py code is quite compact for how many details it captures.

Matthew House

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion visit https://groups.google.com/d/msgid/metamath/e55ea1ba-b3c2-4018-80a1-a46149ead54fn%40googlegroups.com.

Matthew House

unread,
Sep 16, 2025, 5:52:27 PM (9 days ago) Sep 16
to meta...@googlegroups.com
Also, the verifier does not check that a proof only uses its own essential hypotheses. This allows incorrect proofs such as $[ set.mm $] evil $p |- ph $= ( idi.1 ) B $. which turn hypotheses into conclusions.

Jim Kingdon

unread,
Sep 16, 2025, 6:42:11 PM (9 days ago) Sep 16
to meta...@googlegroups.com, Matthew House
Those all sound like good things to add to the testsuite. (The tests in the metamath-exe and metamath-knife repos are I think better developed, although in principle some of their tests could be fed back to a testsuite which could be run on any verifier).

David A. Wheeler

unread,
Sep 17, 2025, 10:32:10 AM (8 days ago) Sep 17
to Metamath Mailing List, Matthew House


> On Sep 16, 2025, at 6:42 PM, Jim Kingdon <kin...@panix.com> wrote:
>
> Those all sound like good things to add to the testsuite.

Patches welcome :-) to:
https://github.com/david-a-wheeler/metamath-test


> (The tests in the metamath-exe and metamath-knife repos are I think better developed, although in principle some of their tests could be fed back to a testsuite which could be run on any verifier).

That was my goal for metamath-test. I also think it's useful to have a "general" test suite.

I'd be happy to contribute that repo to the https://github.com/metamath organization. They're all MIT licensed, allowing anyone do practically anything except sue the authors. Anyone want me to do that?

> On September 16, 2025 12:44:47 PM PDT, Matthew House <mattllo...@gmail.com> wrote:
> A few issues I spotted, many of which are nitpicks and some of which are more serious:...
...
> Writing a correct Metamath verifier requires very careful attention to detail, which it seems that LLMs still struggle with, short of directed hand-holding or extraordinarily thorough test suites. The original mmverify.py code is quite compact for how many details it captures.


LLMs can be a *great* productivity aid, but only an *aid*. The emerging term is "AI Code Assistant" which I think gives the right flavor. They are helpful *assistants* to humans, but at least so far, they are not *replacements* for humans.

--- David A. Wheeler


Matthew House

unread,
Sep 17, 2025, 5:27:08 PM (8 days ago) Sep 17
to David A. Wheeler, Metamath Mailing List
Coincidentally, by running it on one of my test files, I just noticed that mmverify.py suffers from the exact same issue of allowing proofs to reference inactive hypotheses belonging to other theorems. Centralizing all the test suites probably would be a good idea.

Jim Kingdon

unread,
Sep 17, 2025, 5:47:53 PM (8 days ago) Sep 17
to meta...@googlegroups.com, Matthew House, David A. Wheeler, Metamath Mailing List
It isn't either/or. Testsuites specific to one tool can have more assertions on error messages and can also test features (e.g. the proof minimizer or web page generation) other than proof verification.

I do appreciate David's encouragement of contributions to metamath-test though, and am thinking of taking a closer look when I get some time.

Zarathustra Goertzel

unread,
Sep 18, 2025, 5:31:40 PM (7 days ago) Sep 18
to Metamath

Whoops, this is embarrasing that I didn’t glance very closely at the code.., yes, the $v processing block hardly checks anything.

Implementing a MM verifier as a human, the approach I took was to carefully follow a reference implementation to try to ensure not missing some detail — clearly the LLM doesn’t do that. I am sorry 🙏.

I think it’d be great to unify the test suites. I liked how David Wheeler’s metamath test suite made it easy to test multiple verifiers.

I ran into a few ‘gaps’ in mmverify.py's adherence to the specs (and added a few tests to trigger them, some LLM-generated). 

Incidentally, I gave Matthew House’s list to GPT-5, and it found 4 issues that mmverify.py trips up on (and 1 false negative). I’ve uploaded the four test.mm files: https://github.com/zariuq/mmverify.py/tree/pure/adversarial_tests.

  1. It allows $d with a constant symbol.
  2. $d with duplicates (e.g., $d x x $).
  3. Redeclaring a variable as a constant.
  4. And doesn’t work with incomplete proofs via ?.
While the tests seem functional in pinpointing differences in performance between mmverify.py and metamath-exe, I think they're pretty crude (and could be more aesthetically formulated by someone who knows MM better than I do).

This does get me thinking about how to verify the verifiers (aside from actually formally verifying them).  While the tests are helpful, to an extent, I guess one falls back on "humans reading the code to judge whether it's really doing what we want"....

— Zar Goertzel

Reply all
Reply to author
Forward
0 new messages