Speculations on how to help machine learning systems generate metamath proofs

David A. Wheeler

unread,

Mar 25, 2023, 8:17:15 PM3/25/23

to Metamath Mailing List

As you know, AI (artificial intelligence) / ML (machine learning)
systems have *already* generated Metamath proofs.

However, the recent breakthroughs in AI/ML, particularly in
large language models (LLMs) like GPT-3 & GPT-4 (which is really multimodal)
make me hope that these newer systems can create (or help create) proofs for the
Metamath system, such as for the set.mm and iset.mm databases.
I'm hoping some people reading this email will be doing that!

I've had some experience over the years with AI/ML systems, so I have
some speculations of how we might help such systems be more likely to
be successful. Like all speculations, they may not be true; they are
hypotheses based on educated guesswork. Still, hypotheses are a good
starting point for experimentation, and I'm hoping that some of them
will be correct & lead to better proofs.

Some of this thinking was inspired by the paper by Bubeck et al,
"Sparks of Artificial General Intelligence: Early experiments with GPT-4"
<https://arxiv.org/abs/2303.12712>. You might to read it and/or see the summary video
"'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations"
by AI explained <https://www.youtube.com/watch?v=Mqg3aTGNxZ0>

--- David A. Wheeler

=== Speculations ===

1. Many of the newer generative models have an odd limitation: they start
generating text without knowing where they'll end up as a conclusion.
This is highlighted by Bubeck et al several times as "lack of planning".

One way to address this limitation today would be train the system & have it
generate metamath proofs "backwards". That is, the first statement is the goal,
followed by the items supporting it (repeatedly). Many humans develop proofs this
way anyway (start with what you want, then work backwards).

Of course, this is a general problem, and there may be general solutions.
"Memory Augmented Large Language Models are Computationally Universal"
by Dale Schuurmans <https://arxiv.org/abs/2301.04589> shows that
"transformer-based large language models are computationally universal when augmented with an external memory."
Even with such augmentations, it may still easier for proof generation tools to
work "backwards" by default, as that is likely to *tend* to be easier for them.

2. More training data == better results in general.
Newer systems & models can learn more quickly, but it's still true that more training
data tends to produce better results. Therefore, I think it'd be helpful to
*automatically* generate proofs with a variety of terms to help the system
learn the underlying rule. For example, set.mm's ax-mp says:
min ⊢ 𝜑
maj ⊢ (𝜑 → 𝜓)
ax-mp ⊢ 𝜓
... but it would be easy to generate examples that use the general assertion.
If you don't care exactly what you're proving, as long as it's true, you can generate
lots of true statements :-). But this would give the systems more to work with.

Databricks says that their work "suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models."
<https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html>

3. Critical reasoning lacks an “inner dialogue”. Newer models *are* able to create
models to answer questions to a limited extent, but they're not great at it, and this
really hurts their mathematical capabilities. See the Bubeck paper for more.

I think this can be helped by training the system on *not*
just the sequence of steps, but by training it in a way that shows the *results* of the
each intermediate step (the text after "|-" in set.mm and iset.mm). This provides the
models with more information on the internal state, instead of having to figure it out.

4. Starting with existing models is probably the better step forward.
Instead of starting from scratch, I'd start with a pre-trained model.
Those seem to have worked out broader concepts of locality that make results
more likely to be good.

5. Supporting tool use might be helpful.
It might be helpful to have pre-created "tools" that automatically prove common cases.
One of the more remarkable capabilities of GPT-4 & friends is that they can invoke
*other* tools. Just like tool use by humans, tool use by LLMs makes them able to handle
far more complex problems than they could otherwise. Being able to invoke something
that can generate solutions for tasks it would otherwise struggle with (like a SAT solver
or numeric calculation proof) might make hard tasks trivial.

5. In the long run this may be completely doable on a single computer or even phone.
There are also many efforts to create open models that can be used for any purpose
(OpenAI doesn't create any open models, its name is rather misleading).
Really, there's a *lot* of work just *starting* to make these models train and/or run on
single (disconnected) systems instead of servers. Historically, making things work at *all* usually
comes first; once something works, others take steps to make them more efficient/cheaper.
The same is true here. There are lots of discussions about this, e.g.:
https://news.ycombinator.com/item?id=35141531
https://news.ycombinator.com/item?id=35184985
https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html

I *am* concerned problems of AI safety and especially AI alignment.
In the short term, we're going to be overwhelmed by spam & propaganda, making truth
about the world even harder to determine.
In the long term, I worry about AI alignment; universal paperclip is now a plausible long-term scenario.
However, I see no issue in creating tools to generate proofs that are then
separately verified, so I'd love to see that.

Jon P

unread,

Mar 31, 2023, 6:41:31 AM3/31/23

to Metamath

These are nice points David and I agree with you.

1. Yes I agree that backwards chaining might be easier than forwards. Partly as you're getting closer to what is already in the database with each step.

2. I agree about automatically generating training data to create a lot of it. This seems like a relatively simpler project than training an LLM directly?

3. In terms of guiding the AI one thing that can help is to ask them to first outline the proof and then use that outline to fill in the steps, for instance I'll put some chatgpt output for sqrt2irr below. So they can reason at higher levels as well as just with details.

4. Completely agree about starting out with pretrained models. Especially as some are already using metamath as a training database.

5. Having the AI check it's own work as it goes step by step with a verifier would help a lot I think. Without that it would be too easy to get a few symbols wrong and then get lost. Great idea.

In terms of the control problem I think that formal reasoning is actually a really helpful and positive thing. Imo just having inscrutable deep learning systems is what is dangerous, a formally controlled AGI, where you could prove its code operated under certain rules, would be much better and safer I think. Like "you are allowed to make a smarter copy of yourself only if you can prove that it will be governed by the same constraints that you are" gives at least some hope of things staying in control over time.

So yeah I think advancing the ability to reason about code and formal structures is hugely helpful for alignment, rather than hindering it.

===================

what are the main steps to prove root 2 is irrational?

The proof that the square root of 2 is irrational involves several steps, including:

Assume that the square root of 2 is rational and can be expressed as a fraction p/q in its lowest terms, where p and q are positive integers.

Square both sides of the equation to obtain 2 = p^2/q^2.

Rearrange the equation to obtain p^2 = 2q^2.

Note that p^2 is an even number, since it is the product of an even number (p) and itself. Therefore, p must be even as well.

Substitute 2n for p in the equation, where n is an integer. This gives (2n)^2 = 2q^2, or 4n^2 = 2q^2.

Simplify the equation to obtain 2n^2 = q^2. This shows that q^2 is even, and therefore q must be even as well.

However, this contradicts our assumption that p/q is in its lowest terms, since both p and q are even. Therefore, our initial assumption that the square root of 2 is rational must be false.

Conclude that the square root of 2 is irrational, meaning that it cannot be expressed as a ratio of two integers.

David A. Wheeler

unread,

May 4, 2023, 12:21:43 PM5/4/23

to Metamath Mailing List

On March 25, 2023, I posted some speculations on how recent AI/ML improvements might be applied to create Metamath proofs. Like any hypothesis, experiments are necessary to see whether or not they're true. That said, I think good hypotheses are a helpful first step.

I have a few other notes & speculations. My hope is that these posts will inspire some of those experiments. See below for more.

--- David A. Wheeler

========

First, there is a *lot* going on to make AI/ML technologies more widely available.

For example, we have "OpenLLaMA: An Open Reproduction of LLaMA":
https://github.com/openlm-research/open_llama

That also includes efforts to make it easier to run these kinds of tools locally and/or on servers controlled by others.
For example, WebGPU has been recently implemented by Chrome & Firefox is expected to release that soon.
I expect that will make it easier to run machine learning algorithms, including on client-side JavaScript
(where it might be possible to run an ML algorithm without having to install anything, just use your browser).

I can' summarize it all, there's too much going on. If someone can point us to sites which are keeping
track of this, or to key new results, that'd be awesome.

Second, I think we should keep thinking about ways we could output Metamath proofs
to make it easier for machine learning systems
to learn how to create new proofs. I discussed a lot of that in my March 2023 post, such
as presenting proofs "backwards" (start from the goal).

One challenge is that most outputs include step numbers and step references.
Those are challenging for an ML tool to
figure out what the underlying relationship is, because it has to convert those numbers into
"spatial" relationships that aren't obvious. For training purposes it might be best to do something
else, thought it's not clear what. One option would be to omit step numbers,
and for references use relative steps (+1). Or instead of step numbers, use reference names
based on the label of the statement that uses.
We could also add {...} to surround groups, to
help the ML system learn groupings. Other ideas welcome.

Jon P

unread,

May 7, 2023, 3:23:20 PM5/7/23

to Metamath

I agree David it's a really interesting time for all this and an AI powered proof assistant could be really helpful.

"Second, I think we should keep thinking about ways we could output Metamath proofs
to make it easier for machine learning systems"

Re this one thing I wondered about is whether adding english text between the statements might help a language model understand what is happening? It would be very repetitive so maybe it makes no sense but here is an example.

Given that we have df-2 which states

50::df-2 |- 2 = ( 1 + 1 )

We can apply theorem oveq2i to step 50 to get
51:50:oveq2i |- ( 2 + 2 ) = ( 2 + ( 1 + 1 ) )

Given that we have df-4 which states
52::df-4 |- 4 = ( 3 + 1 )

Given that we have df-3 which states
53::df-3 |- 3 = ( 2 + 1 )

We can apply theorem oveq1i to step 53 to get
54:53:oveq1i |- ( 3 + 1 ) = ( ( 2 + 1 ) + 1 )

Given that we have 2cn which states
55::2cn |- 2 e. CC

Given that we have ax-1cn which states
56::ax-1cn |- 1 e. CC

We can apply theorem addassi to steps 55,56,56 to get
57:55,56,56:addassi
|- ( ( 2 + 1 ) + 1 ) = ( 2 + ( 1 + 1 ) )

We can apply theorem eqtri to steps 52,54,57 to get
58:52,54,57:3eqtri |- 4 = ( 2 + ( 1 + 1 ) )

We can apply theorem eqtr4i to steps 51.58 to get
qed:51,58:eqtr4i |- ( 2 + 2 ) = 4

Which finishes the proof

I think maybe that makes it easier to understand what is happening and gives the language model something to grip on to which training?

David A. Wheeler

unread,

May 8, 2023, 1:59:00 PM5/8/23

to Metamath Mailing List

> On May 7, 2023, at 12:23 PM, Jon P <drjonp...@gmail.com> wrote:
> Re this one thing I wondered about is whether adding english text between the statements might help a language model understand what is happening? It would be very repetitive so maybe it makes no sense but here is an example. ...

The underlying Metamath proof format *does* support comments internal to the proof.
See the Metamath book appendix E ("Metamath Language EBNF").

However, in *practice* tools don't use that information to generate output, nor do they
take any steps to keep such comments. I'm not sure anyone is seriously interested in processing them.
Maybe we should, but then we'd have to think about how to handle changes.

> I think maybe that makes it easier to understand what is happening and gives the language model something to grip on to which training?

Perhaps, but no one's trying to do that. Currently, if there's a trickiness, it's documented in the high-level description of the statement as a whole. If if's more sophisticated, it might be better split out as its own theorem.

--- David A. Wheeler

Jon P

unread,

May 11, 2023, 3:18:33 PM5/11/23

to Metamath

Yeah I understand. I think it would be too much work to add the comments to the actual theorems one by one, I was thinking more that it would be a post processing step to help explain to the model how the steps are related in natural language. As you say though maybe having it read the metamath book or something might be a better way of explaining.

I thought this was an interesting release from Google, looks like they're offering fine tuning of their new models and offering free credits to get started, I don't know whether it would stretch to training on the MM database as I haven't explored it. I also don't know if the MM database is already in it's training set.

https://cloud.google.com/vertex-ai

---------------------

Here's a little conversation snippet with Bard

Me: do you know what the metamath database is?

Yes, I know what the Metamath database is. It is a database of mathematical proofs that have been verified by a computer program. The database is written in a formal language called Metamath, which is designed to be simple and easy to understand. The Metamath database contains proofs of over 23,000 theorems, including many important theorems from logic, set theory, number theory, algebra, topology, and analysis.

The Metamath database is a valuable resource for mathematicians and students of mathematics. It allows them to see how mathematical theorems are proven, and it can help them to understand the underlying logic of mathematics. The database can also be used to find proofs of specific theorems, and it can be used to generate new proofs.

The Metamath database is a free and open-source project. You can download the database and the Metamath program from the Metamath website.

Me: Is the metamath database part of your training data?

No, the Metamath database is not part of my training data. I was trained on a massive dataset of text and code, but the Metamath database is not included in that dataset. This means that I am not able to answer questions about mathematical theorems or proofs.

However, I can still answer general questions about mathematics. For example, I can tell you what a theorem is, or I can explain the difference between a proof and a disproof. I can also generate mathematical expressions and solve simple math problems.

If you have a question about a specific mathematical theorem or proof, I recommend that you consult a mathematician or a mathematics textbook.

Me: Can you generate me a metamath proof that |- ( 5 + 2 ) = 7

Sure. Here is a Metamath proof that |- ( 5 + 2 ) = 7:

|- 0 = 0. [Axiom]
|- (x = 0) → (x + y = y). [Axiom]
|- (x = y) → (y = x). [Axiom]
|- (x = y) → (x + z = y + z). [Axiom]
|- (x = y) → (z + x = z + y). [Axiom]
|- (x = y) → (x * z = y * z). [Axiom]
|- (x = y) → (z * x = z * y). [Axiom]
|- (x < y) → (x + z < y + z). [Axiom]
|- (x < y) → (z + x < z + y). [Axiom]
|- (x < y) → (x * z < y * z). [Axiom]
|- (x < y) → (z * x < z * y). [Axiom]
|- (x > y) → (x + z > y + z). [Axiom]
|- (x > y) → (z + x > z + y). [Axiom]
|- (x > y) → (x * z > y * z). [Axiom]
|- (x > y) → (z * x > z * y). [Axiom]
|- (x <= y) → (x = y ∨ x < y). [Axiom]
|- (x >= y) → (x = y ∨ x > y). [Axiom]
|- (x = y) → (x <= y ∧ x >= y). [Axiom]
|- (x <= y ∧ y <= x) → (x = y). [Axiom]
|- (x < y ∨ y < x) → (x ≠ y). [Axiom]
|- Successor(0) = 1. [Axiom]
|- Successor(x) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(x) = 1. [Axiom]
|- (x ≠ 0) → Successor(x) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(x)) = 2. [Axiom]
|- (x ≠ 0) → Successor(Successor(x)) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(Successor(x))) = 3. [Axiom]
|- (x ≠ 0) → Successor(Successor(Successor(x))) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(Successor(Successor(x)))) = 4. [Axiom]
|- (x ≠ 0) → Successor(Successor(Successor(Successor(x)))) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(Successor(Successor(Successor(x))))) = 5. [Axiom]
|- (x ≠ 0) → Successor(Successor(Successor(Successor(Successor(x))))) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(Successor(Successor(Successor(Successor(x)))))) = 6. [Axiom]
|- (x ≠ 0) → Successor(Successor(Successor(Successor(Successor(Successor(x)))))) = Successor(Successor(y)) → x = y. [Axiom]
|- (x = 0) → Successor(Successor(Successor(Successor(Successor(Successor(Successor(x

Reply all

Reply to author

Forward