[RFC] EvoGPT-f: An Evolutionary GPT Framework for Benchmarking Formal Math Languages

Johnathan Mercer

unread,

Feb 28, 2024, 2:12:24 PMFeb 28

to Metamath

I didn't get a chance to include Metamath in this (see Discussion) but thought you'd enjoy and like to provide feedback.

EvoGPT-f: on arXiv
Abstract
Formal mathematics is the discipline of translating mathematics into a programming language in which any statement can be unequivocally checked by a computer. Mathematicians and computer scientists have spent decades of painstaking formalization efforts developing languages such as Coq, HOL, and Lean. Machine learning research has converged on these formal math corpora and given rise to an assortment of methodologies to aid in interactive and automated theorem proving. However, these papers have primarily focused on one method, for one proof task, in one language. This paper introduces EvoGPT-f: a novel evolutionary framework for the first systematic quantitative analysis of the differential machine learnability of five formal math corpora (Lean 3, Lean 4, Coq, HOL 4, HOL Light) using four tokenization methods (character, word-level, Byte Pair Encoding and StarCoder tokenizer). This paper does not put to rest the question of the "best" or "easiest" language to learn. Rather, this framework and preliminary findings begin to illuminate the differential machine learnability of these languages, offering a foundation to forge more systematic quantitative and qualitative comparative research across communities.

Glauco

unread,

Feb 28, 2024, 4:52:38 PMFeb 28

to Metamath

Hi Johnathan,

I read why you didn't include metamath; but to let me understand your paper in the metamath "environment":

- to keep things simple, assume we had a set.mm with all proofs in normal mode (no compression)

- assume you use symbols + labels + metamath reserved keywords (in set.mm) as your tokens (no BPE applied)

- assume you got a transformer T from your genetic optimizer (on hyperparameters), at generation i

- then, do you fit T's parameters on (let's say) 90% of the proofs in set.mm ?

- then, do you measure "learnability" by measuring how good it is at predicting the next token in the 10% out of sample?

- would it mean, either the next token in a wff or the next label in a proof?

Is this roughly how it would go in the metamath framework?

I'm trying to look at it from a Proof Assistant standpoint, where may be the most intriguing part would be to "generate" the next missing label in a proof (where "next" could be "one" of the missing labels)

Thanks in advance for your help

Glauco

bil...@gmail.com

unread,

Feb 29, 2024, 3:56:25 PMFeb 29

to Metamath

@Glauco said "would it mean, either the next token in a wff or the next label in a proof?"

I would think it would be easier to use the full expression for a statement rather than just its label in a proof.

That is, the text would be a sequence of expressions that constitute the hypotheses, an expression that constitutes the assertion, followed by expressions that constitute the proof of the assertion.

I don't know whether supplying the labels would even help that much.

Of course, besides the theorems, one would also have to include the axiom expressions and the definition expressions.

Johnathan Mercer

unread,

Mar 4, 2024, 1:56:49 AMMar 4

to meta...@googlegroups.com

Hi Glauco,

Apologies for the delay!! I'll try to answer inline

I read why you didn't include metamath; but to let me understand your paper in the metamath "environment":

- to keep things simple, assume we had a set.mm with all proofs in normal mode (no compression)

- assume you use symbols + labels + metamath reserved keywords (in set.mm) as your tokens (no BPE applied)

Ok, with you, this is most similar to word-encoding

- assume you got a transformer T from your genetic optimizer (on hyperparameters), at generation i

- then, do you fit T's parameters on (let's say) 90% of the proofs in set.mm ?

- then, do you measure "learnability" by measuring how good it is at predicting the next token in the 10% out of sample?

Yep, the validation loss is how well it predicts tokens on token sequences it wasn't trained on.

- would it mean, either the next token in a wff or the next label in a proof?

It depends on how we created the train/validation token sequences (but in general can be anything t+1 token given {...., t} tokens)

Is this roughly how it would go in the metamath framework?

I'm trying to look at it from a Proof Assistant standpoint, where may be the most intriguing part would be to "generate" the next missing label in a proof (where "next" could be "one" of the missing labels)

Would you or a group of folks be interested in collaborating on a follow-up paper to use this framework to build a mm GPT?

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/9cc05019-dc5b-4242-9bf6-7af5406b6837n%40googlegroups.com.

Glauco

unread,

Mar 9, 2024, 4:32:01 AMMar 9

to Metamath

I don't write Python, but it looks like I can read it.

If you could create a github repo with an interface for the missing pieces, I'm confident we can adapt the Python parser/validator (we already have) to expose set.mm pretty much however you need it.

My main concern is that the "uncompressed" version of set.mm (that I expect to be the more effective, for a learning machine) would be enormous, thus it has to be "created" on the fly (not a big deal, anyway).

Reply all

Reply to author

Forward