Hi Johnathan,
I read why you didn't include metamath; but to let me understand your paper in the metamath "environment":
- to keep things simple, assume we had a
set.mm with all proofs in normal mode (no compression)
- assume you use symbols + labels + metamath reserved keywords (in
set.mm) as your tokens (no BPE applied)
- assume you got a transformer T from your genetic optimizer (on hyperparameters), at generation i
- then, do you fit T's parameters on (let's say) 90% of the proofs in
set.mm ?
- then, do you measure "learnability" by measuring how good it is at predicting the next token in the 10% out of sample?
- would it mean, either the next token in a wff or the next label in a proof?
Is this roughly how it would go in the metamath framework?
I'm trying to look at it from a Proof Assistant standpoint, where may be the most intriguing part would be to "generate" the next missing label in a proof (where "next" could be "one" of the missing labels)
Thanks in advance for your help
Glauco