In the post
repeatedly re-proven statements David A. Wheeler found statements which appear in different proofs again and again. I decided to search for "repeatedly re-proven lemmata", i.e. sequences of theorem labels which often appear as sub-proofs in other theorems.
The rough algorithm is as follows. Say, we want to find lemmata with k steps. We take a compressed proof of some theorem T, and check all k-step subsequences L of its proof. We count a subsequence valid if
1) All its steps reference theorem labels (not a hypothesis and not a step marked with Z);
2) All its steps correspond to |- statements (to avoid counting syntax builder subproofs as lemmata);
3) L leaves exactly one theorem on the proof stack provided it is supplied with enough hypotheses (i.e. L is indeed a subproof, so subsequences like "0re, 0re" are not counted);
4) At least one of the theorems used in the subsequence L is used exactly once in the proof of T (otherwise minimizing with L would append an additional theorem label to the compressed proof of T, and likely that wouldn't minimize the proof).
Next we find all valid subsequences, and give each one a score by the formula (length - 1)*(number of theorems where it occurs). This gives an approximation to the number of saved bytes.
I did this search in
set.mm for subsequences of lengths from 2 to 7, took top 20 of each length and combined into the following list:
The first component of each tuple is the subsequence and the second is its score.
For example, the first line is (["fvex","eqeltri"],791). By reverse-engineering the associated statement in mmj2 we get a reasonable theorem
${
test1.1 $e |- A = ( F ` B ) $.
test1 $p |- A e. _V $=
( cfv cvv fvex eqeltri ) ABCEFDBCGH $.
$}
If we take the line (["0re","1re","elicc2i"],88) we get
${
test2 $p |- ( C e. ( 0 [,] 1 ) <-> ( C e. RR /\ 0 <_ C /\ C <_ 1 ) ) $=
( cc0 c1 0re 1re elicc2i ) BCADEF $.
$}
This is a version of elicc2i specialized to the unit interval. Apparently it is used quite often.
There are some mysterious proof subsequences in
set.mm, for example consider (["toponunii","restid","ax-mp","eqcomi"],93). When expanded it gives
${
test3.1 $e |- A e. C $.
test3.2 $e |- A e. ( TopOn ` B ) $.
test3 $p |- A = ( A |`t B ) $=
( crest co wcel wceq toponunii restid ax-mp eqcomi ) ABFGZAACHNAIDACBBAEJKLM $.
$}
This looks like a theorem with a completely useless hypothesis test3.1, but believe me or not, it does appear in some proofs, for example, in cncfcn1 (one can shorten its proof from 172 to 116 bytes using test3). If we correct the statement a little bit, we get a saner theorem
${
test4.1 $e |- A e. ( TopOn ` B ) $.
test4 $p |- A = ( A |`t B ) $=
( crest co ctopon wcel wceq toponunii restid ax-mp eqcomi cfv ) ABDEZAABFMZGNAHCAOBBACIJKL $.
$}
Which also minimizes the proof of cncfcn1 but this time to 98 bytes.
There are 120 proof sequences in the list I linked above, and surely not all of them deserve to be
set.mm theorems (although many might indeed save byte usage, though I haven't been able to run a full minimize run with theorems test1-test4 as it's too resource intensive for my computer). I'm not an active
set.mm contributor, so I don't know which theorems look reasonable, how to name them, where to put them in
set.mm etc. If anyone is interested in this, I encourage them to take any theorem they like from the
list (including test1-test4) and add them into
set.mm - no reference to me or the list is needed.
I would be glad to hear your opinions and suggestions.