GPT-f paper

Stanislas Polu

unread,

Sep 9, 2020, 4:47:00 AM9/9/20

to meta...@googlegroups.com

Hi!

We finally released a paper describing our methodology for the proof assistant.

It's available on arXiv here: https://arxiv.org/abs/2009.03393

Any feedback/suggestions or questions is obviously welcome!

Special shout out to David Wheeler and Mario Carneiro in the
acknowledgements that I want to reiterate here:

""
Finally, the authors would like to thank the whole Metamath community
for their support, feedback, and encouragement, in particular, David
A. Wheeler for his motivating enthusiasm and Mario Carneiro for his
precious help on a wide variety of technical questions.
""

Hope you enjoy it.

-stan

David A. Wheeler

unread,

Sep 9, 2020, 9:04:19 AM9/9/20

to 'Stanislas Polu' via Metamath

I definitely enjoyed it. I think that's the first time I've been cited for motivating enthusiasm, but I'll take it :-).

> GPT-f found new short proofs that were accepted into the main Metamath library, which is to our knowledge, the first time a deep learning based system has contributed proofs that were adopted by a formal mathematics community.

This seems to me to be a remarkable claim, especially since creating formal proofs is something that even relatively early computers were doing. There are many other automated provers. Coq proved the 4-color theorem, and prover9 found a previously unknown algebraic proof.

Yet I do not know of a counterexample. Does anyone?

I was intrigued that you could only add such a small percentage of synthetic data before it started making things worse. That seems unfortunate, because current machine learning algorithms generally need a large number of examples to learn from. Is it clear what the problem is? Are there obvious ways to overcome this? If the problem is that the synthetic data is too different from the learned data, maybe follow-on work could try learning from the data and generating new examples, to then feed back as more training data. Then the new examples might be more similar to the existing set. This doesn't take away from the paper at all, it's impressive, I'm just trying to think of ways to make its results even better.

In 4.3.2 Formal Verifier, first sentence, a word is missing (added in brackets):
> Performing such proof searches requires [us] to tightly couple a Metamath verifier with our models.

Nice work!

--- David A.Wheeler

Stanislas Polu

unread,

Sep 9, 2020, 9:40:47 AM9/9/20

to meta...@googlegroups.com

> This seems to me to be a remarkable claim, especially since creating formal proofs is something that even relatively early computers were doing. There are many other automated provers. Coq proved the 4-color theorem, and prover9 found a previously unknown algebraic proof.

The claim is about *deep learning* based automated provers.
"Classical" provers and hammers --that are super well integrated in
Coq/HOL4/...-- have indeed contributed numerous proofs in other
libraries obviously. We've done a quite thorough review of deep
learning based systems but we may have missed something obviously.
Arguably, the deep learning distinction is extremely arbitrary; but it
would have been a shame not to take it :)

> I was intrigued that you could only add such a small percentage of synthetic data before it started making things worse. That seems unfortunate, because current machine learning algorithms generally need a large number of examples to learn from. Is it clear what the problem is? Are there obvious ways to overcome this? If the problem is that the synthetic data is too different from the learned data, maybe follow-on work could try learning from the data and generating new examples, to then feed back as more training data. Then the new examples might be more similar to the existing set. This doesn't take away from the paper at all, it's impressive, I'm just trying to think of ways to make its results even better.

You're perfectly right. The synthetic data is pretty
out-of-distribution compared to set.mm which explains the phenomenon.
Training a model that conjectures statements "in-distribution" and
attempting to prove them would provide data that would help more
efficiently (we've done some work in that direction, but are not ready
to publish those yet). We still kept the synthetic data because it
helps a bit, it allows the analysis of section 5.5 which I find quite
interesting, and it makes the model much better when using it as a
proof assistant to prove simple "math exercises" (which are not so
common in set.mm as of today).

If we were to evaluate on a benchmark of maths exercises / IMO
problems / etc... we would probably be able to pour in more data from
these generators and others (work for future us!).

> In 4.3.2 Formal Verifier, first sentence, a word is missing (added in brackets):
> > Performing such proof searches requires [us] to tightly couple a Metamath verifier with our models.

Will action :+1:

> Nice work!

Thanks, it means the world!

-stan

> --
> You received this message because you are subscribed to the Google Groups "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
> To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/3399CCB0-E840-478A-9D8A-A8E2CD78BECF%40dwheeler.com.

savask

unread,

Sep 9, 2020, 11:05:56 AM9/9/20

to Metamath

Nice work! I really hope that a publication from such a credible AI company will bring more attention to metamath.

I have several small questions, hope you could shed some light on some of them.

1) Maybe you mention it in the article and I missed it, but why is the model called "GPT-f"? Does 'f' stand for "formal"?

2) Have you tried augmenting the set.mm database by adding theorems with renamed variables (like a version of id with |- ( ps -> ps ) instead of |- ( ph -> ph ))? I think that some variables are used more often in set.mm than others, so it seems plausible that the model will "learn less" about rare variables. Does GPT-f have less luck proving statements which use rare variables rather than often-used ones?

3) Do you have any plans to release GPT-f in the future as OpenAI did with GPT-2?

Stanislas Polu

unread,

Sep 9, 2020, 3:04:52 PM9/9/20

to meta...@googlegroups.com

> 1) Maybe you mention it in the article and I missed it, but why is the model called "GPT-f"? Does 'f' stand for "formal"?

Exactly

> 2) Have you tried augmenting the set.mm database by adding theorems with renamed variables (like a version of id with |- ( ps -> ps ) instead of |- ( ph -> ph ))? I think that some variables are used more often in set.mm than others, so it seems plausible that the model will "learn less" about rare variables. Does GPT-f have less luck proving statements which use rare variables rather than often-used ones?

This is indeed something that happens quite often. We've run a few
experiments around using de-bruijn indices instead of class/setvar
names but it's easy to get wrong and the uplift appeared as limited
compared to the risks associated :) Shuffling variable names seems to
be a very interesting data-augmentation strategy that we should
explore. Concerning stats on the performance of GPT-f based on
variables contained in theorem, I unfortunately don't have numbers
available but I can probably try to run them. Though I'm not sure how
useful that would be as I presume there is correlation between
difficulty and appearance of specific variable names?

> 3) Do you have any plans to release GPT-f in the future as OpenAI did with GPT-2?

No plan for now. We've released the proof assistant and we will
probably release the code we use to generate our datasets from set.mm.
We'll also probably make the model available through our API as well.

> Nice work! I really hope that a publication from such a credible AI company will bring more attention to metamath.

Thank you for your questions and encouragements!

-stan

> --
> You received this message because you are subscribed to the Google Groups "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/5a86d117-f2ac-47c9-9f29-61d5dec65addo%40googlegroups.com.

savask

unread,

Sep 10, 2020, 2:27:02 AM9/10/20

to Metamath

Concerning stats on the performance of GPT-f based on
variables contained in theorem, I unfortunately don't have numbers
available but I can probably try to run them. Though I'm not sure how
useful that would be as I presume there is correlation between
difficulty and appearance of specific variable names?

I guess "rarer" variables are usually used in statements which require a lot of variables, and those typically have harder proofs. I was thinking about replacing variables in already existing theorems (something like "|- ( ka -> ka )" in case of id) and seeing if GPT-f has harder time proving these "new" formulations. That way proof difficulty is left the same, since it's essentially the same statement.

Stanislas Polu

unread,

Sep 10, 2020, 2:49:36 AM9/10/20

to meta...@googlegroups.com

> I guess "rarer" variables are usually used in statements which require a lot of variables, and those typically have harder proofs. I was thinking about replacing variables in already existing theorems (something like "|- ( ka -> ka )" in case of id) and seeing if GPT-f has harder time proving these "new" formulations. That way proof difficulty is left the same, since it's essentially the same statement.

Got it. This is indeed an interesting experiment that could be run on
a few hundreds theorem statements :+1:

-stan

> --
> You received this message because you are subscribed to the Google Groups "Metamath" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.

> To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/63139c24-6fca-4b5e-bb6d-ee497e146874o%40googlegroups.com.

Reply all

Reply to author

Forward