Using general-purpose LLMs for proving theorems in Metamath

133 views
Skip to first unread message

savask

unread,
Jun 2, 2026, 11:54:00 AM (5 days ago) Jun 2
to Metamath
I did a few experiments trying to convince a general-purpose LLM (deepseek thinking in fast mode, in my case) to prove a couple of theorems from set.mm. Since metamath stores its proofs in the compressed format, LLMs seemingly have little chance to learn how metamath proofs actually work, so I was surprised to find out that it's actually able to do something useful.

I used the following prompt which already has an example theorem "loaded in". In all my experiments I waited till the model stops "thinking", and then copied the resulting proof into metamath-lamp. For example, for the theorem above the output would be:

|- ( ph -> A e. RR )
|- ( ph -> 0 < A )
|- ( ph -> 0 <_ A )
|- ( ph -> ( sqrt ` A ) e. RR )
|- ( ph -> 0 < ( sqrt ` A ) )
|- ( ph -> ( sqrt ` A ) =/= 0 )
|- ( ph -> ( ( sqrt ` A ) ^ 2 ) = A )
|- ( ph -> ( ( sqrt ` A ) ^ 2 ) = ( ( sqrt ` A ) x. ( sqrt ` A ) ) )
|- ( ph -> ( ( sqrt ` A ) x. ( sqrt ` A ) ) = A )
|- ( ph -> ( A / ( sqrt ` A ) ) = ( ( ( sqrt ` A ) x. ( sqrt ` A ) ) / ( sqrt ` A ) ) )
|- ( ph -> ( ( ( sqrt ` A ) x. ( sqrt ` A ) ) / ( sqrt ` A ) ) = ( sqrt ` A ) )
|- ( ph -> ( A / ( sqrt ` A ) ) = ( sqrt ` A ) )

Since the resulting proof would rarely work as is, I would go over the steps and use the "unify" tool which is similar to metamath.exe's "improve" command and can fill in short proofs. The result is a working metamath proof:

divsqrtid.1 $e |- ( ph -> A e. RR ) $.
divsqrtid.2 $e |- ( ph -> 0 < A ) $.
divsqrtid $p |- ( ph -> ( A / ( sqrt ` A ) ) = ( sqrt ` A ) )
$= ( csqrt cfv cdiv co cmul c2 cexp cc0 0red ltled resqrtcld eqidd int-sqdefd resqcld sqge0d sqrtge0d sqrtsqd sqr11d eqtrd oveq1d eqcomd recnd elrpd sqrtgt0d gt0ne0d divcan3d ) ABBEFZGZHZUKUKIHZUKULHZUKAUOUMAUNBUKULAUNUKJKHZBAU KUKABCALBAMCDNZOZAUKPQAUPBAUKURRAUKURSCUQAUKURABCU QTUAUBUCUDUEAUKUKAUKURUFZUSAUKABABCDUGUHUIUJUC $.

I should say that this theorem is taken from Thierry's mathbox, where it's left without proof as a comment (and has a bad hypothesis 0 <_ A).

Another example, which I just made up, is the following simple arithmetical theorem:

99p11e110 $p |- ( ; 9 9 + ; 1 1 ) = ; ; 1 1 0
$= ( c9 c1 cdc 9nn0 1nn0 eqid caddc co cc0 9p1e10 oveq1i dec10p eqtri decaddc2 ) AZOBZPPPCZOOCZQDZSEZTRFQFOPGZHZPUAHPICZPUAHQUBUCPU AJZKPLMUDN $.

Here the proof a LLM outputs is almost right, but I had to correct the final step manually (it did however correctly identify decaddc2 as the key theorem).

The longest proof I've been able to squeeze out of a LLM is the following, which is also commented out without a proof in Thierry's mathbox:

$d ph x $.
$d ph y $.
$d p x $.
$d p y $.
reuopan $p |- ( E! p e. { <. x , y >. | ( ( x e. A /\ y e. B ) /\ ph ) } ps <-> E! p e. { <. x , y >. | ( x e. A /\ y e. B ) } ( ph /\ ps ) )
$= ( cv wcel wa copab wreu cop wceq wex weu df-reu elopab anbi1i eubii bitri anass bicomi exbii 19.41v biancomi 19.42v an32 ancom 3bitri bitr4i ) BGCHZEIDHZFIJZAJZCDKZLZGHZULUMMNZUNJZDOZCOZABJZJZG PZVCGUNCDKZLZUQUSUOJZDOZCOZBJZGPZVEUQURUPIZBJZGPVL BGUPQVNVKGVMVJBUOCDURRSTUAVKVDGVKAVBJZBJVCVBJVDVJV OBVJUTAJZDOZCOZVOVIVQCVHVPDVPVHUSUNAUBUCUDUDVRAVAJ ZCOVOVQVSCVQAVAUTADUEUFUDAVACUGUAUASAVBBUHVCVBUIUJ TUAVGURVFIZVCJZGPVEVCGVFQWAVDGVTVBVCUNCDURRSTUAUK $.

The proof the model produced in this case had quite a few misplaced brackets, but after these were corrected, the proof worked without any further modifications from my side.

In my experience, the model had the most difficulty with remembering theorem labels (for example, using a right theorem but justifying it with a wrong label), would sometimes hallucinate theorems (like deduction versions of existing theorems), and would often skip over the details like proving that an expression lies in CC.

I'm sure there are other theorems which can be proved with this method. I should say that I'm by no means an experienced user of LLMs, so I'm sure there's plenty room for improvement. The prompt could be better (I used a theorem about Tarskian geometry, but using a theorem close to the one you're targeting to prove is a better idea), one can use a more powerful model, or maybe even set up an agent workflow, where a LLM interacts with a proof assistant like metamath.exe to look up theorems and verify proof steps as it goes. As we all know too well, metamath notoriously lacks almost any proof automation, so this seems like a promising way forward.

What do you think?

bil...@gmail.com

unread,
Jun 2, 2026, 5:20:00 PM (4 days ago) Jun 2
to Metamath
I am surprised that you got some tangible results using a standard LLM.

I have a txt file that has one of the current versions of the Metamath database with the proofs uncompressed. It has 164899 lines and its size is 248 MB. The comments are stripped out.

 I can email it to you if you are interested.

Here is the beginning of the file:

$c ( $.
$c ) $.
$c -> $.
$c -. $.
$c wff $.
$c |- $.
$c & $.
$c => $.
$v ph $.
$v ps $.
$v ch $.
$v th $.
$v ta $.
$v et $.
$v ze $.

Here are the two lines for canth (I put the label second):

$e canth.1 |- A e. _V $.
$p canth |- -. F : A -onto-> ~P A $= cA cA cpw cF wfo vx cv vx cv cF cfv wcel wn vx cA crab cF crn wcel cA cA cpw cF wfo vx cv vx cv cF cfv wcel wn vx cA crab cA cpw cF crn vx cv vx cv cF cfv wcel wn vx cA crab cA cvv canth.1 vx cv vx cv cF cfv wcel wn vx cA ssrab2 elpwi2 cA cA cpw cF forn syl5eleqr cA cA cpw cF wfo vx cv vx cv cF cfv wcel wn vx cA crab cF crn wcel vy cv cF cfv vx cv vx cv cF cfv wcel wn vx cA crab wceq vy cA wrex vy cv cF cfv vx cv vx cv cF cfv wcel wn vx cA crab wceq vy cA vy cv cA wcel vy cv vy cv cF cfv wcel vy cv vx cv vx cv cF cfv wcel wn vx cA crab wcel wb vy cv cF cfv vx cv vx cv cF cfv wcel wn vx cA crab wceq vy cv cA wcel vy cv vy cv cF cfv wcel wn vy cv vx cv vx cv cF cfv wcel wn vx cA crab wcel wb vy cv vy cv cF cfv wcel vy cv vx cv vx cv cF cfv wcel wn vx cA crab wcel wb wn vy cv vx cv vx cv cF cfv wcel wn vx cA crab wcel vy cv cA wcel vy cv vy cv cF cfv wcel wn vx cv vx cv cF cfv wcel wn vy cv vy cv cF cfv wcel wn vx vy cv cA vx cv vy cv wceq vx cv vx cv cF cfv wcel vy cv vy cv cF cfv wcel vx cv vy cv wceq vx cv vy cv vx cv cF cfv vy cv cF cfv vx cv vy cv wceq id vx cv vy cv cF fveq2 eleq12d notbid elrab baibr vy cv vy cv cF cfv wcel vy cv vx cv vx cv cF cfv wcel wn vx cA crab wcel nbbn sylib vy cv cF cfv vx cv vx cv cF cfv wcel wn vx cA crab vy cv eleq2 nsyl nrex cA cA cpw cF wfo cF cA wfn vx cv vx cv cF cfv wcel wn vx cA crab cF crn wcel vy cv cF cfv vx cv vx cv cF cfv wcel wn vx cA crab wceq vy cA wrex wb cA cA cpw cF fofn vy cA vx cv vx cv cF cfv wcel wn vx cA crab cF fvelrnb syl mtbiri pm2.65i $.

savask

unread,
Jun 3, 2026, 1:11:38 AM (4 days ago) Jun 3
to Metamath
> I have a txt file that has one of the current versions of the Metamath database with the proofs uncompressed. ... I can email it to you if you are interested.

Thank you, but I don't think I have a use for such a huge file. It should be easy enough to generate with metamath.exe if needed, anyway.

There was some discussion in the past that perhaps we should host a fully uncompressed set.mm version somewhere, so that LLM models can pick it up in their training data. I think this is unnecessary, especially if the proofs are represented as a sequence of theorem labels like in your example, since the models seem to have a poor memory for exact theorem names.

My impression is that for a LLM to produce a successful proof one should
1) represent the proof as a sequence of formulae, so the model always has exact proof state in its context
2) avoid technical details like theorem labels, step numbers and disjoint conditions which can be filled in automatically later
3) give an example proof in this format in the prompt.
Of course for this plan to succeed one needs a proof assistant which can read the proof as a sequence of formulae, check grammar and fill in missing steps, disjoint statements, step numbers and theorem labels. The only proof assistants I know which can do that to some degree are mmj2 (which seems to be deprecated) and metamath-lamp (which is a bit clunky). Maybe some of these requirements could be relaxed if the model is allowed to interact with the proof assistant directly instead of trying to "one shot" a proof.

Ender Ting

unread,
Jun 3, 2026, 1:46:06 AM (4 days ago) Jun 3
to Metamath
I'd look at using grammar-constrained output (so if we only have the initial axioms, LLM cannot output any tokens but "ax-1"/"ax-2"/...; in an expression, it would make bracket placement automatic).
Unfortunately, it will only work with local models which have somewhat lower quality than state-of-the-art (and those wouldn't remember so many theorem labels either).

Also, some support is needed if LLM skipped some steps and now doesn't know what to reference; I don't know how to let it return to them.
среда, 3 июня 2026 г. в 08:11:38 UTC+3, savask:

bil...@gmail.com

unread,
Jun 3, 2026, 2:18:11 AM (4 days ago) Jun 3
to Metamath
savask, what are the steps that you are using?

It appears that you are giving a prompt to deepseek thinking with one example of a theorem with its hypotheses, statement, and proof.
You then ask it to prove a statement with some hypotheses.

Is that correct?

A) Are you depending on deepseek to have already been trained on Metamath by metamath theorems being in the training corpus of deepseek?
B) In other words, you are not giving deepseek any additional training.

bil...@gmail.com

unread,
Jun 3, 2026, 2:55:27 AM (4 days ago) Jun 3
to Metamath
A) I think letting deepseek (or other LLM) use metamath-lamp as a tool is a very good idea. This would seem to solve the problems of remembering theorems and their labels.

B) For some theorems (like ax-mp, mp2, mp2b, mpd, syl), I don't think the labels are that important and the LLM should be able to handle their use without a problem.

C) Another approach is fine-tuning. But I think the tool approach is more promising.

Sine Flexione

unread,
Jun 3, 2026, 5:21:09 AM (4 days ago) Jun 3
to Metamath

When I read this post, my first thought was: "This is talking about me."

About three months ago, with the help of Codex, I wrote a proof assistant that I called Base2 (because it is based on classical second-order logic), and it does exactly this: axioms, definitions, theorems, and proofs represented simply as lists of formulas.

Codex not only helped me write the proof assistant itself, but it also generated a few proofs. It did so in such a straightforward way that I ended up thinking: "If it can do it with my system, it can probably do it with any system."

But maybe it's not quite that simple. Reading your post, and seeing that progress on Wiedijk's list of 100 theorems doesn't seem to have accelerated much despite the rise of AI agents (I'm speaking generally, not just about Metamath), makes me think the situation may be more complicated.

I had actually started a discussion about my system on Stack Exchange, but it didn't attract much interest. 

https://proofassistants.stackexchange.com/questions/6480/simple-and-cheap-proof-assistants-for-everyone-base2-as-a-case-study

In the end, I even got into an argument with someone there and decided to delete my account altogether.

Anyway, I think I'll continue working on my proof assistant this summer. It was partially inspired by Metamath, although it takes a somewhat different approach.

savask

unread,
Jun 3, 2026, 6:57:50 AM (4 days ago) Jun 3
to Metamath
> I'd look at using grammar-constrained output ... Unfortunately, it will only work with local models which have somewhat lower quality than state-of-the-art

I can't try it out myself (my hardware doesn't allow me to run any decent models whatsoever), but I suspect grammar-constraining theorem labels isn't going to help much. Theorems are routinely renamed in set.mm, so theorem labels the model learned from its training data quickly will become obsolete.

That being said, grammar-constraining the output formulae (with the exception of model's internal thinking monologue) to adhere to set.mm's grammar could work.


> Also, some support is needed if LLM skipped some steps and now doesn't know what to reference; I don't know how to let it return to them.

In my experience the model almost always skips some steps, so the proof assistant needs some capabilities to fill in basic proofs by itself, like metamath.exe's improve command or metamath-lamp's unification tool.

> savask, what are the steps that you are using? ...

For each theorem I'm starting a new fresh chat with the LLM and put the prompt (from the gist link in my first post) in the first message. Then I let the model think and copy its final output into metamath-lamp - unfortunately this step is a bit tedious, as you have to do it line by line, fixing spacing and grammar in the process. Then for each step of the proof I use the "unify" tool of metamath-lamp with depth 4 to fill in the missing proof steps, when it's required.

You're correct that I'm depending on set.mm being in deepseek's training data, and I'm not giving any additional training to it.


>  I think letting deepseek (or other LLM) use metamath-lamp as a tool is a very good idea.

Metamath-lamp is a web GUI application, so it's probably not very well suited for efficient interaction with LLMs. Ideally one needs some sort of console wrapper over metamath.exe or perhaps metamath-knife which can process basic formula-style proofs (as in my first post), fill in missing steps and disjoint statements, and report grammar and proof mistakes back to the LLM. I think metamath.exe almost fits this purpose, but as far as I know it can't figure out disjoint conditions by itself, and it also doesn't check grammar.

> Another approach is fine-tuning. But I think the tool approach is more promising.

Yes, I think we will all agree that a bespoke metamath-focused model would do much better, but I don't think anyone here has the resources to design and train one. I still hope we can figure out a way to use general-purpose LLMs to make formalization of theorems in metamath easier.

Igor Ieskov

unread,
Jun 4, 2026, 7:06:34 PM (2 days ago) Jun 4
to Metamath

I added some automation to mm-lamp to support LLM driven proofs. The idea is based on the fact that LLMs can generate structured responses. I added some generic JavaScript API functions to mm-lamp. Those functions accept JSON objects as input and run some common actions like add/modify/delete steps, invoke the bottom-up prover for a specific step, put current editor state to the clipboard. Then I instructed an LLM what it should do and how it can use those API functions. Mainly I was doing these few actions in a loop:

  1. Send the current editor state to the LLM. This is as simple as clicking a button in mm-lamp, which puts current state to a clipboard. Then open LLM UI interface and paste the clipboard content.

  2. Get the response from LLM which includes a JSON object in the expected format (from my instruction).

  3. Copy-paste the JSON object to mm-lamp. This made mm-lamp to execute some action.

  4. Go to step 1.


Here are files which I developed:

API functions - https://github.com/expln/metamath-lamp/blob/d6ea38a41fe2439bb956fe9c8a52f6ac15b99eb7/src/metamath/api/macros/llm_api_set_mm.js

Instructions for an LLM - https://github.com/expln/metamath-lamp/blob/d6ea38a41fe2439bb956fe9c8a52f6ac15b99eb7/src/metamath/api/macros/llm_prompt.md


And here is my dialog with an LLM where I asked it to find and prove the solution for a simple equation   - https://claude.ai/share/d46eb564-cac5-45fd-a949-02332e5e1dbc 


(Some of my replicas in the chat were big, so they are hidden in the dialog. The very first my replica is one of them. But it is just the content of the md file with the instruction mentioned above.)


The dialog was very long. I think I could prove the same myself much faster. I had to do some manual corrections in the mm-lamp editor. However, some of them were caused by a bug in mm-lamp. But eventually the LLM succeeded to develop a proof:


${

    hyp1 $e |- X e. CC $.

    hyp2 $e |- ( ( ; 4 5 x. X ) + 8 ) = 3 $.

    goal

        $p |- X = ( -u 1 / 9 )

        $= ( c3 c8 cmin co c4 c5 cdc cdiv c1 cneg c9 4nn0 5nn decnncl nncni nnne0i cmul mulcli 8cn mvlraddi mvllmuli wceq 3cn negsubdi2i eqcomi 5cn gbpart8 mvrladdi negeqi eqtri oveq1i w3a cc wcel a1i cc0 wne divnegd eqeq2d 4animp1 eel0000 9t5e45 oveq2i wtru 9cn 9ne0 5ne0 divcan8d mptru wn 1cnd pm2.18i ) ADZEZFZGZHZIZJZKZGZLZMNZWCGZWBAVSWBVTWAOPQZRZBWBWH SZWBATZGVQVPWBAWIBUAUBZCUCUDWDWEWFWCGZMZWGWDWAWBWC GZMZWNWDWAMZWBWCGZUEZWSWSWSWDWPUEZVSWQWBWCVSVQVPVR GZMZWQXBVSVQVPWLUFZUGUHXAWAVQVPWAXCUIZUJUKULUMUNZX EXEXEWSWSWSWSWTWSWSWSUOZWPWRWDXFWAWBWAUPZUQZXFXDUR WBXGUQXFWIURWBUSZUTXFWJURVAVBVCVDWOWMWOWAWFWAWKGZW CGZWMXKWOXJWBWAWCVEVFUHXKWMUEVGZWFWAWFXGUQZXLVHZUR XHXLXDURWFXIUTZXLVIZURWAXIUTXLVJURVKVLUMULUMWNWGUE ZXQVMZWEWFXRVNXMXRXNURXOXRXPURVAVOUMUM $.

$}


Currently I don’t provide an instruction how to reproduce this due to lack of time. But this is not difficult. I will provide an instruction in a few days as soon as I have free time to write it.


-

Igor

savask

unread,
Jun 5, 2026, 1:05:49 PM (2 days ago) Jun 5
to Metamath
The dialog was very long. I think I could prove the same myself much faster. I had to do some manual corrections in the mm-lamp editor. However, some of them were caused by a bug in mm-lamp. But eventually the LLM succeeded to develop a proof: ... That's very impressive. I have looked through the chat, and it does indeed seem long. I also noticed that you occasionally helped the process, i.e. you were not mechanically transferring JSONs from mm-lamp to Claude and vice versa. What do you think was the main reason why it stumbled on so many steps? Was it because it relied on "prove" too much, instead of adding intermediate steps of its own? Do you think it would be able to prove the theorem all by itself, if you didn't provide any human input, i.e. ran it in a real "agent loop"? I also tried proving this theorem but with the setup I described in the first post. I used the same prompt, but I replaced the example theorem by decpmulnc, since it uses arithmetic. The resulting proof was almost correct (again, up to using the default bottom-up prover with depth 4), but there were a couple of hiccups. Instead of fixing them myself, I also tried prodding the model like this:

Me:
Step
|- ( 8 + -u 5 ) = 3
doesn't follow from previous steps. Provide missing steps.

Model:
|- ( 5 + 3 ) = 8
|- ( 3 + 5 ) = 8
|- 3 = ( 8 - 5 )
|- ( 8 - 5 ) = ( 8 + -u 5 )
|- 3 = ( 8 + -u 5 )
|- ( 8 + -u 5 ) = 3

It sure could be done more efficiently, but in the end I produced the proof in perhaps 15 minutes (I admit that an experienced metamath user would be faster than that ...).

This is a bit offtopic, but, Igor, is it possible to add a feature to mm-lamp which allows adding several proof steps in bulk? It gets a bit tedious to press the + button and enter each line from the model's output manually. Maybe that could be a Javascript macros which accepts text input with one step per line?

Igor Ieskov

unread,
Jun 5, 2026, 2:37:51 PM (2 days ago) Jun 5
to Metamath
> is it possible to add a feature to mm-lamp which allows adding several proof steps in bulk? It gets a bit tedious to press the + button and enter each line from the model's output manually. Maybe that could be a Javascript macros which accepts text input with one step per line?

Yes, it is possible, I will do it soon as well as answer all other questions. But currently I am a bit short on time. I am replying just to give you an idea that you can export the editor state to json, programmatically add new steps to it, and import the json back to the editor. I guess an LLM should be able to write such a program (or a script) to add new statements to the middle of a json editor state.

-
Igor

savask

unread,
Jun 6, 2026, 11:04:32 AM (18 hours ago) Jun 6
to Metamath
> I am replying just to give you an idea that you can export the editor state to json, programmatically add new steps to it, and import the json back to the editor.

Thanks, I tried it, and indeed the JSON format of mm-lamp was easy enough to understand to make a small script.

I did a couple more experiments, for example, I tried reproving (re- since the proof is already in set.mm) theorem goldrapos by Ender:

  ${
    goldraposLLM.1 $e |- F = ( 2 x. ( cos ` ( _pi / 5 ) ) ) $.
    goldraposLLM $p |- 0 < F $=
      ( cc0 c2 cpi c5 cdiv co ccos cfv wbr wcel w3a wa divgt0ii wb ax-mp pm3.2i
      mpbi 3pm3.2i cmul clt 2re cr pire 5re 5ne0 redivcli recoscld pm2.18i 2pos
      wn a1i cneg cioo pipos halfpire lt0neg2 5pos neghalfpire 0re lttri ltdiv2
      2lt5 cxr neghalfpirx rexri elioo2 mpbir cosq14gt0 mulgt0ii eqcomi breq2i
      ) CZDZEZFZGZHZIJZUAHZUBZKVNAWBKVOVTUCZVTUDZLZWEULZVSVSWDLZWFVPVQUEZUFZUGU
      HZUMUIUJUKZVSVPVOVRHZUNZWLUOHLZVNVTWBKWNWGWMVSWBKZVSWLWBKZMZWGWOWPWJWMVNW
      BKZVNVSWBKZNWOWRWSVNWLWBKZWRVPVOWHWCUPZWKOWLWDLWTWRPUQZWLURQSVPVQWHWIXAUS
      ZORWMVNVSUTVAWJVBQVOVQWBKZWPVDVOWDLZVNVOWBKZNZVQWDLZVNVQWBKZNZVPWDLZVNVPW
      BKZNZMXDWPPXGXJXMXEXFWCWKRXHXIWIXCRXKXLWHXARTVOVQVPVCQSTWMVEZLZWLXNLZNWNW
      QPXOXPVFWLXBVGRWMWLVSVHQVIVSVJQVKWAAVNWBAWABVLVMS $.
  $}

The model succeeded after a bit of prodding, but the resulting proof is a bit longer than the one by the original author. I think it's safe to assume that this theorem wasn't in the training data, since it's rather new.

I've also noticed in some of my experiments that the model would start formatting proofs in the mmj2 format like "10:5,8:pm3.2i |- ...", so perhaps mmj2 source code and files got in the training data and somewhat influenced the ability of the model to reason in mmj2 style.


Thierry Arnoux

unread,
Jun 6, 2026, 12:38:14 PM (16 hours ago) Jun 6
to meta...@googlegroups.com, savask
On 06/06/2026 17:04, savask wrote:

> I've also noticed in some of my experiments that the model would start
> formatting proofs in the mmj2 format like "10:5,8:pm3.2i |- ...", so
> perhaps mmj2 source code and files got in the training data and
> somewhat influenced the ability of the model to reason in mmj2 style.

I asked Deepseek to generate MMP format / mmj2 style proofs which I
could directly paste into the editor (I use "Yamma"), and indeed it did.


Igor Ieskov

unread,
Jun 6, 2026, 5:50:55 PM (11 hours ago) Jun 6
to Metamath
> What do you think was the main reason why it stumbled on so many steps?
> Was it because it relied on "prove" too much, instead of adding intermediate steps of its own?


Yes, I think it is because it "overestimated" abilities of the bottom-up prover and followed my instruction to use it extensively.


> Do you think it would be able to prove the theorem all by itself, if you didn't provide any human input, i.e. ran it in a real "agent loop"?

I don't think so, with current set of API functions and current initial prompt. I guess the prompt induced it to rely so much on the "prove" function. But with a more precise prompt and additional API functions, maybe there is a chance that it could prove the theorem without human intervention. There was one interesting moment during the proof process, when the model added a temporary statement to test what works. When the statement was proved it made some conclusion and removed the test statement. But I didn't mention that it can do such things. I am wondering how it will behave if I add new functions that will allow it to explore existing database (for example, to search by pattern).

I experimented with theorems from your initial post.
divsqrtid was proved quite straightforward without my help - https://claude.ai/share/b647a86b-cc63-49fd-a98b-524e6abc10f4
But it could not prove 99p11e110 and reuopan.

Anyway, if anyone wants to experiment with the agentic approach using mm-lamp (version 32), I've updated the JavaScript code and the prompt I used:
https://github.com/expln/metamath-lamp/blob/develop/src/metamath/api/macros/llm_api_set_mm.js
https://github.com/expln/metamath-lamp/blob/develop/src/metamath/api/macros/llm_prompt.md

I've also written an instruction how to install macros in mm-lamp:
https://github.com/expln/metamath-lamp-docs/blob/master/mm_lamp_versions/v32/macros/creating_a_simple_macro.md

The actual instruction starts with the text "To persist macros so that they survive page reloads, do the following". You will need to enter the text "LLM driven proofs" into the "Module name" field and put the entire content of llm_api_set_mm.js to the "Script" field when installing llm_api_set_mm.js.


> is it possible to add a feature to mm-lamp which allows adding several proof steps in bulk?

I implemented such a feature.
https://github.com/expln/metamath-lamp/blob/develop/src/metamath/api/macros/add_steps_in_bulk_example.js
You will need to enter the text "Add steps in bulk" into the "Module name" field and put the entire content of add_steps_in_bulk_example.js to the "Script" field when installing it. Once installed, you can use that feature via the "Run a macro" button (a triangle-shaped button). If no steps selected in the editor, then new steps will be added at the very end. If one step is selected, then new steps will be added before it.

-
Igor

savask

unread,
2:29 AM (3 hours ago) 2:29 AM
to Metamath
> I don't think so, with current set of API functions and current initial prompt. I guess the prompt induced it to rely so much on the "prove" function.

I'm repeating myself, but I think the key thing is that the model shouldn't think about theorem names or justifications, it should produce a list of formulas. That list will almost inevitably contain gaps, but these can be later closed with the "prove" command. I've read some parts of the thinking traces of deepseek when it tries to prove a theorem, and it often tries to remember theorem labels and put step numbers, but it almost always gets these wrong (while the resulting formulas turn out to be correct).

That's why I think, for example, that producing proofs in the .mmp format (mmj2) is a bad idea. The model would spread its scarce thinking resources on trying to remember theorem labels which will turn out to be wrong anyway.

Maybe a better prompting strategy would be a combination of my prompt and yours: first make the model produce a list of formulas, load it into mm-lamp and then proceed in an agentic loop trying to close all the remaining holes.


> I am wondering how it will behave if I add new functions that will allow it to explore existing database (for example, to search by pattern).

I wonder if current top-grade models are powerful enough to put the whole of set.mm into their context window. set.mm is about 6M words long, let's say 2/3 of these are comments and compressed proofs which can be safely removed. We are still left with 2M words which is larger than the 1M tokens context window the best models have, and this doesn't account for a fact that one word isn't always one token.

There must be some smart way to supply the model with the theorems it might need in its proof. In my experiments I tried changing the example proof with a proof of a related theorem (for example, for 99p11e110 I would use a small proof of another arithmetic theorem), and also giving some theorem statements explicitly in the initial prompt, i.e. "You can use the following theorems in your proof (given in set.mm notation) ...". That's, of course, not purely automatic and requires human expertise.

> I implemented such a feature. ...

Thanks a lot! I also have a couple of quibbles about the bottom up prover, but I'll probably express them in a github issue in the mm-lamp repository.
Reply all
Reply to author
Forward
0 new messages