A few days ago I stumbled upon a topic "
Formula wanted to estimate 'minimize' run times" by Norman where he suggested an interesting problem of finding an approximate formula for minimization run time based on proof length and label number. It would allow (in theory) to schedule minimization jobs more evenly which must make a difference when performing full set.mm minimizations.
I decided to give this task a shot, but in the end results turned out to be more dull than I had hoped for. I think it's still worth reporting on these finds as maybe someone with better data analysis skills will get interested in the problem. So, my procedure was as follows.
First of all, I prepared the data as described in the original topic by Norman. Basically, I took run times from the minimization log by Dylan Houlihan and augmented them by label numbers and proof sizes, getting the total of 28438 items (a few had to be removed since, for instance, 4exdistrOLD wasn't present in the set.mm version linked in the post). Label numbers in mathboxes were shifted up by 1070 (as suggested in the original post) to compensate for the fact that 'minimize' doesn't scan mathboxes by default.
Perhaps not surprisingly, the majority of theorems (~76%) take less than 15 seconds to minimize. The mean minimization time is ~45s and standard deviation is ~228, so already the "formula" predicting 0s for each theorem gives us a "small" mean error :-) Intuitively it seems that the formula ought to look like c*label^a*size^b, where a, b and c are some constants. I have tried various curve-fitting techniques available in Wolfram Mathematica, for instance, it suggested the following: 0.0000811892*label^0.405776*size^1.33858, which has mean error ~35 and standard deviation ~155. A simpler hand-found formula 4.33e-9*label*size^1.7 produces comparable mean ~25 and standard deviation ~180.
These formulae have a small "global" coefficient (~1e-9 in the latter example) which suggests that perhaps one should differentiate between the small run time and large run time cases. I set the small class to be theorems taking less than 15s to minimize and trained a basic linear classifier to distinguish it from the "large time" class. I curve-fitted a formula of the form c*label^a*size^b separately in the small and large cases to obtain the following prediction formula:
If -0.0572153 + 2.68348e-6*label + 0.000206234*size < 0.5 then 4.92e-6*label^0.68*size^1.03 else 0.000479*label^0.32*size^1.25,
which has mean error ~28 and standard deviation ~156. Clearly it's not a large improvement over simpler formulae listed above.
That is more or less what I have for approximating minimization run times. Now, the intended application of such an approximation is scheduling minimization jobs evenly, so lets switch our attention to that.
I assumed that we have 16 CPUs (processes, threads, etc) available, and we schedule our theorems for minimization by the Longest Processing Time algorithm (the task is assigned to a CPU with smallest total load). Obviously during the scheduling phase we calculate time by one of our formulae, but after theorems have been spread to CPUs, we can evaluate our procedure by calculating real run times. The relevant parameters are average CPU load which is always ~79495s, and latency which I define to be the difference between maximum CPU time and minimum CPU time (the smaller latency is, the closer CPU loads are). It is useful to compare our results with a straightforward scheduling approach where one randomly partitions theorems into 16 (almost) equal lists. In my experiments such an approach yields latency ~39397, so it can be taken as a base value.
So, the first formula (together with LPT scheduling) produces the following list of CPU run times:
[76126,83042,81573,90121,72944,79387,74600,84677,78886,69388,86309,83830,69727,74918,81779,84567],
has latency ~20733 and standard deviation ~6055. The second (hand-found) formula gives latency ~32342 and standard deviation ~8781, which is an arguably worse result (the latency is not much different from the random approach). Finally, the last (large) formula gives latency ~40756 and standard deviation ~9684, which is worse than a random approach.
What is, in my opinion, surprising is that the simpler formula 'size' (i.e. minimization time is proportional to the proof size; notice that we don't need a multiplicative constant since we are only summing and comparing values in our scheduling algorithm) produces latency ~19218 and standard deviation ~6388 which is on par with the first formula found by curve-fitting. I have tried other simple expressions without great success, so this seems like the best result I can get.
That is how far I could get with my basic knowledge of statistics. Apparently the end result is that one can simply estimate theorem minimization time by its proof size and then use well-known scheduling algorithms to assign jobs :-) I couldn't find any reports on the minimization run performed by Dylan Houlihan (for instance, I don't know the "latency" parameter for that run or even the number of threads used), so I don't know whether the approach I'm suggesting has any benefits over the method used by Norman and Dylan. Nevertheless, I hope this small report was interesting to read and maybe it will motivate other people to try to come up with a better formula and/or approach. Here's the dataset I used: https://gist.github.com/savask/900161e9d093a5b787b6473113dfde16. I'm also looking forward to any questions or criticism regarding approaches used here. I can't boast a good knowledge of statistics or data science and my "analysis" must be taken with a grain of salt, so I will be happy to hear opinions of other people.