As you know, AI (artificial intelligence) / ML (machine learning)
systems have *already* generated Metamath proofs.
However, the recent breakthroughs in AI/ML, particularly in
large language models (LLMs) like GPT-3 & GPT-4 (which is really multimodal)
make me hope that these newer systems can create (or help create) proofs for the
Metamath system, such as for the
set.mm and
iset.mm databases.
I'm hoping some people reading this email will be doing that!
I've had some experience over the years with AI/ML systems, so I have
some speculations of how we might help such systems be more likely to
be successful. Like all speculations, they may not be true; they are
hypotheses based on educated guesswork. Still, hypotheses are a good
starting point for experimentation, and I'm hoping that some of them
will be correct & lead to better proofs.
Some of this thinking was inspired by the paper by Bubeck et al,
"Sparks of Artificial General Intelligence: Early experiments with GPT-4"
<
https://arxiv.org/abs/2303.12712>. You might to read it and/or see the summary video
"'Sparks of AGI' - Bombshell GPT-4 Paper: Fully Read w/ 15 Revelations"
by AI explained <
https://www.youtube.com/watch?v=Mqg3aTGNxZ0>
--- David A. Wheeler
=== Speculations ===
1. Many of the newer generative models have an odd limitation: they start
generating text without knowing where they'll end up as a conclusion.
This is highlighted by Bubeck et al several times as "lack of planning".
One way to address this limitation today would be train the system & have it
generate metamath proofs "backwards". That is, the first statement is the goal,
followed by the items supporting it (repeatedly). Many humans develop proofs this
way anyway (start with what you want, then work backwards).
Of course, this is a general problem, and there may be general solutions.
"Memory Augmented Large Language Models are Computationally Universal"
by Dale Schuurmans <
https://arxiv.org/abs/2301.04589> shows that
"transformer-based large language models are computationally universal when augmented with an external memory."
Even with such augmentations, it may still easier for proof generation tools to
work "backwards" by default, as that is likely to *tend* to be easier for them.
2. More training data == better results in general.
Newer systems & models can learn more quickly, but it's still true that more training
data tends to produce better results. Therefore, I think it'd be helpful to
*automatically* generate proofs with a variety of terms to help the system
learn the underlying rule. For example,
set.mm's ax-mp says:
min ⊢ 𝜑
maj ⊢ (𝜑 → 𝜓)
ax-mp ⊢ 𝜓
... but it would be easy to generate examples that use the general assertion.
If you don't care exactly what you're proving, as long as it's true, you can generate
lots of true statements :-). But this would give the systems more to work with.
Databricks says that their work "suggests that much of the qualitative gains in state-of-the-art models like ChatGPT may owe to focused corpuses of instruction-following training data, rather than larger or better-tuned base models."
<
https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html>
3. Critical reasoning lacks an “inner dialogue”. Newer models *are* able to create
models to answer questions to a limited extent, but they're not great at it, and this
really hurts their mathematical capabilities. See the Bubeck paper for more.
I think this can be helped by training the system on *not*
just the sequence of steps, but by training it in a way that shows the *results* of the
each intermediate step (the text after "|-" in
set.mm and
iset.mm). This provides the
models with more information on the internal state, instead of having to figure it out.
4. Starting with existing models is probably the better step forward.
Instead of starting from scratch, I'd start with a pre-trained model.
Those seem to have worked out broader concepts of locality that make results
more likely to be good.
5. Supporting tool use might be helpful.
It might be helpful to have pre-created "tools" that automatically prove common cases.
One of the more remarkable capabilities of GPT-4 & friends is that they can invoke
*other* tools. Just like tool use by humans, tool use by LLMs makes them able to handle
far more complex problems than they could otherwise. Being able to invoke something
that can generate solutions for tasks it would otherwise struggle with (like a SAT solver
or numeric calculation proof) might make hard tasks trivial.
5. In the long run this may be completely doable on a single computer or even phone.
There are also many efforts to create open models that can be used for any purpose
(OpenAI doesn't create any open models, its name is rather misleading).
Really, there's a *lot* of work just *starting* to make these models train and/or run on
single (disconnected) systems instead of servers. Historically, making things work at *all* usually
comes first; once something works, others take steps to make them more efficient/cheaper.
The same is true here. There are lots of discussions about this, e.g.:
https://news.ycombinator.com/item?id=35141531
https://news.ycombinator.com/item?id=35184985
https://www.databricks.com/blog/2023/03/24/hello-dolly-democratizing-magic-chatgpt-open-models.html
I *am* concerned problems of AI safety and especially AI alignment.
In the short term, we're going to be overwhelmed by spam & propaganda, making truth
about the world even harder to determine.
In the long term, I worry about AI alignment; universal paperclip is now a plausible long-term scenario.
However, I see no issue in creating tools to generate proofs that are then
separately verified, so I'd love to see that.