Thanks :-)
Yes, the ones for the second proposal are ties, but I wouldn't say it's worth keeping them all since one could generate huge amounts of ties from what my tool can generate and what information such a proof – and the collection as a whole – contains. Except if you mean keeping only both, the historically first discovered and the alphanumerically minimal variants, since they are both "relevant" in a way.
Let me explain this and my approach a bit. (Sorry for the incoming wall of text. Feel free to skip at your pace. :-)
A parser can quickly extract all subproof strings, e.g. DD2D1DD22D11DD2D112 (length 19) for theorem *1.5 has 7 non-trivial proper subproofs (i.e. excluding axioms and itself): D11, D22, D2D11, DD22D11, DD2D112, D1DD22D11, and D2D1DD22D11. Since every proof is essentially a modus ponens (two inputs and one output), a proof of length
n ≥ 3 (they all have odd lengths) has up to (
n-3)/2 non-trivial proper subproofs (there may be overlaps).
These numbers can grow quite large for longer proofs, and there may be many overlaps, e.g. my 1883-step proof of biass alone has 222 different non-trivial proper subproofs. When counting such, the unified collection (pmproofs-unified.txt) consists of 1374 proofs (which are all listed explicitly in data/pmproofs_1-unified.dpc), and FYI the non-unified collection consists of 1436 proofs (listed explicitly in data/pmproofs_1.dpc).
Now each such proof has a final formula (the conclusion), and many of those formulas have several shortest (known) proofs. There may even be cases where both a formula and a different schema of that formula are suitable with both having a minimal proof of the same length.
Now for a proof that has
m uses of a certain conclusion where that conclusion has
n shortest (known) proofs, there is a number of
m to the power of
n ways to only insert proofs of that conclusion into the final proof. So we are talking about exponential growth of possibilities based on the number of overlaps and redundant shortest (known) proofs. There is certainly quite a number of other 1883-step proofs of biass that could be deduced from the information that is already in pmproofs.txt. But not in pmproofs-unified.txt since the overlaps have been removed there.
Now for the overall approach, my parser creates the list of non-trivial subproofs merged from all proofs, sorts them by their lengths and parses them all to their conclusions.
Then I used files I generated which contain proofs of all possible conclusions of proofs up to length 29 to find better variants, using a lot of schema checks (my formulas are actually tree structures). If a conclusion
A from my files is a schema of a conclusion
B of pmproofs.txt, all string occurrences of proof(
B) within pmproofs.txt can simply be replaced by proof(
A), which keeps all the proofs valid and ideally shortens them.
So I collected all the best potential replacements, and then applied them from long to short.
Not too interesting I'd say, but rather straightforward (it took me one day when all the basic functionality was already implemented).
What really took its time and great effort was the very optimized implementation of the D-proof-to-formula (tree structure) parser and subtree checks, and the generator of all possible outcomes. For instance, I can parse pmproofs.txt in less than 170 ms (to what you see in data/pmproofs_2.dpc) using a single thread of my old Intel Core i7-3770, but I use multiple threads to compare and generate results.
The generator of all provable formulas is fully parallelized via Intel's oneTBB library. I essentially implemented a push-down automaton to generate all possible D-proofs of length
n+2 based on when all of length
n are already known, then do this step by step (i.e. file by file). But I sorted out all of the redundancies, including proofs with a conclusion for which there is a schema that has a known proof at least as short. So the remaining proofs have a conclusion that represents a whole bunch of other proofs and potentially other formulas. In short, I kept the amount of data as small as possible.
Here is the amount of new representative proofs of lengths up to 29:
1 : 3
3 : 6
5 : 12
7 : 38
9 : 89
11 : 229
13 : 672
15 : 1844
17 : 5221
19 : 15275
21 : 44206
23 : 129885
25 : 385789
27 : 1149058
29 : 3449251
(5181578 representatives in total)
For example, dProofs29.txt is already 103.477529 megabytes (that is 3449251*(29+1)-1 bytes) in size, and all the parsed data from those files, including the formula trees of all of the conclusions, just barely fits into my 32 GiB of DDR memory. What takes a lot of time currently are the schema checks of which there are too many (quadratic complexity; pairwise comparisons only limited by fitting formula lengths), and I am still hoping to be able to come up with a database structure to make this more efficient. (For the 27->29 step, everything else took just under an hour on my PC, but the schema filtering took over 17 days.) The gain of schema filtering is little but still has an exponential effect on subsequent files. (926015 proofs were filtered out from the proofs of length 29 in that manner.) Anyway, a supercomputer could probably produce more shortest results using my HPC-ready program, and I will try to get access to one.
I hope this is somewhat readable, if something is still unclear, feel free to ask. And let's hope at least David is still alive .. (´ ∀ ` *)
— Samiro Discher