A formula for 'minimize' run time or how to schedule minimization

68 views
Skip to first unread message

savask

unread,
Dec 31, 2019, 6:47:00 AM12/31/19
to Metamath
A few days ago I stumbled upon a topic "Formula wanted to estimate 'minimize' run times" by Norman where he suggested an interesting problem of finding an approximate formula for minimization run time based on proof length and label number. It would allow (in theory) to schedule minimization jobs more evenly which must make a difference when performing full set.mm minimizations.

I decided to give this task a shot, but in the end results turned out to be more dull than I had hoped for. I think it's still worth reporting on these finds as maybe someone with better data analysis skills will get interested in the problem. So, my procedure was as follows.

First of all, I prepared the data as described in the original topic by Norman. Basically, I took run times from the minimization log by Dylan Houlihan and augmented them by label numbers and proof sizes, getting the total of 28438 items (a few had to be removed since, for instance, 4exdistrOLD wasn't present in the set.mm version linked in the post). Label numbers in mathboxes were shifted up by 1070 (as suggested in the original post) to compensate for the fact that 'minimize' doesn't scan mathboxes by default.

Perhaps not surprisingly, the majority of theorems (~76%) take less than 15 seconds to minimize. The mean minimization time is ~45s and standard deviation is ~228, so already the "formula" predicting 0s for each theorem gives us a "small" mean error :-) Intuitively it seems that the formula ought to look like c*label^a*size^b, where a, b and c are some constants. I have tried various curve-fitting techniques available in Wolfram Mathematica, for instance, it suggested the following: 0.0000811892*label^0.405776*size^1.33858, which has mean error ~35 and standard deviation ~155. A simpler hand-found formula 4.33e-9*label*size^1.7 produces comparable mean ~25 and standard deviation ~180.

These formulae have a small "global" coefficient (~1e-9 in the latter example) which suggests that perhaps one should differentiate between the small run time and large run time cases.  I set the small class to be theorems taking less than 15s to minimize and trained a basic linear classifier to distinguish it from the "large time" class. I curve-fitted a formula of the form c*label^a*size^b separately in the small and large cases to obtain the following prediction formula:

If -0.0572153 + 2.68348e-6*label + 0.000206234*size < 0.5 then 4.92e-6*label^0.68*size^1.03 else 0.000479*label^0.32*size^1.25,

which has mean error ~28 and standard deviation ~156. Clearly it's not a large improvement over simpler formulae listed above.

That is more or less what I have for approximating minimization run times. Now, the intended application of such an approximation is scheduling minimization jobs evenly, so lets switch our attention to that.

I assumed that we have 16 CPUs (processes, threads, etc) available, and we schedule our theorems for minimization by the Longest Processing Time algorithm (the task is assigned to a CPU with smallest total load). Obviously during the scheduling phase we calculate time by one of our formulae, but after theorems have been spread to CPUs, we can evaluate our procedure by calculating real run times. The relevant parameters are average CPU load which is always ~79495s, and latency which I define to be the difference between maximum CPU time and minimum CPU time (the smaller latency is, the closer CPU loads are). It is useful to compare our results with a straightforward scheduling approach where one randomly partitions theorems into 16 (almost) equal lists. In my experiments such an approach yields latency ~39397, so it can be taken as a base value.

So, the first formula (together with LPT scheduling) produces the following list of CPU run times:

[76126,83042,81573,90121,72944,79387,74600,84677,78886,69388,86309,83830,69727,74918,81779,84567],

has latency ~20733 and standard deviation ~6055. The second (hand-found) formula gives latency ~32342 and standard deviation ~8781, which is an arguably worse result (the latency is not much different from the random approach). Finally, the last (large) formula gives latency ~40756 and standard deviation ~9684, which is worse than a random approach.

What is, in my opinion, surprising is that the simpler formula 'size' (i.e. minimization time is proportional to the proof size; notice that we don't need a multiplicative constant since we are only summing and comparing values in our scheduling algorithm) produces latency ~19218 and standard deviation ~6388 which is on par with the first formula found by curve-fitting. I have tried other simple expressions without great success, so this seems like the best result I can get.

That is how far I could get with my basic knowledge of statistics. Apparently the end result is that one can simply estimate theorem minimization time by its proof size and then use well-known scheduling algorithms to assign jobs :-) I couldn't find any reports on the minimization run performed by Dylan Houlihan (for instance, I don't know the "latency" parameter for that run or even the number of threads used), so I don't know whether the approach I'm suggesting has any benefits over the method used by Norman and Dylan. Nevertheless, I hope this small report was interesting to read and maybe it will motivate other people to try to come up with a better formula and/or approach. Here's the dataset I used: https://gist.github.com/savask/900161e9d093a5b787b6473113dfde16. I'm also looking forward to any questions or criticism regarding approaches used here. I can't boast a good knowledge of statistics or data science and my "analysis" must be taken with a grain of salt, so I will be happy to hear opinions of other people.

Jim Kingdon

unread,
Dec 31, 2019, 12:30:25 PM12/31/19
to savask, Metamath
I wonder about this idly while I'm waiting for a (single theorem, usually) minimization to run. So although I don't have as much reason to know as someone planning large scale minimization runs, thank you for posting these results.

Norman Megill

unread,
Dec 31, 2019, 4:35:02 PM12/31/19
to Metamath
On Tuesday, December 31, 2019 at 6:47:00 AM UTC-5, savask wrote:
A few days ago I stumbled upon a topic "Formula wanted to estimate 'minimize' run times" by Norman where he suggested an interesting problem of finding an approximate formula for minimization run time based on proof length and label number. It would allow (in theory) to schedule minimization jobs more evenly which must make a difference when performing full set.mm minimizations.

This is nice work, thanks for doing it!

For you or anyone wishing to put together the scripts for the current set.mm using your formula(s), I uploaded 3 files from the 2018 run for reference:

http://us2.metamath.org/min2018-emails.zip - contains the correspondence with Dylan in mbox format, with attachments deleted.
http://us2.metamath.org/min2018-jobs.zip - 24 metamath command scripts (called attachment "metamathjobs.zip" in the emails)
http://us2.metamath.org/min2018-logs.zip - individual logs from the runs (called attachment "metamath_logs.zip" in the emails).  I think that the minjobs.log you used is just a concatenation of all of these.

The emails describe how we did this as well as some problems we ran into. Some jobs took much longer than others, and we ended up with 72 logs.

The scripts run "minimize" but just log the results and don't save the proofs.  I collected all the logs and created a script to run the actual minimizations separately.

In min2018-jobs.zip, "minimize_with * /no_new_axioms_from ax-* /time" should be changed to "minimize_with * /allow_new_axioms * /no_new_axioms_from ax-* /time" due to additional protection from accidentally depending on new axioms, that was recently added to "minimize" ("/allow" opens it to all $a statements, and "/no_new" blocks ax-*)
 

I decided to give this task a shot, but in the end results turned out to be more dull than I had hoped for. I think it's still worth reporting on these finds as maybe someone with better data analysis skills will get interested in the problem. So, my procedure was as follows.

First of all, I prepared the data as described in the original topic by Norman. Basically, I took run times from the minimization log by Dylan Houlihan and augmented them by label numbers and proof sizes, getting the total of 28438 items (a few had to be removed since, for instance, 4exdistrOLD wasn't present in the set.mm version linked in the post). Label numbers in mathboxes were shifted up by 1070 (as suggested in the original post) to compensate for the fact that 'minimize' doesn't scan mathboxes by default.

Perhaps not surprisingly, the majority of theorems (~76%) take less than 15 seconds to minimize. The mean minimization time is ~45s and standard deviation is ~228, so already the "formula" predicting 0s for each theorem gives us a "small" mean error :-) Intuitively it seems that the formula ought to look like c*label^a*size^b, where a, b and c are some constants. I have tried various curve-fitting techniques available in Wolfram Mathematica, for instance, it suggested the following: 0.0000811892*label^0.405776*size^1.33858, which has mean error ~35 and standard deviation ~155. A simpler hand-found formula 4.33e-9*label*size^1.7 produces comparable mean ~25 and standard deviation ~180.

This is better than I expected.  I would have been happy if the job run times were within a factor of 2 of each other.  Our problem was that some jobs took orders of magnitude longer.  So either of these formulas should be fine.
 

These formulae have a small "global" coefficient (~1e-9 in the latter example) which suggests that perhaps one should differentiate between the small run time and large run time cases.  I set the small class to be theorems taking less than 15s to minimize and trained a basic linear classifier to distinguish it from the "large time" class. I curve-fitted a formula of the form c*label^a*size^b separately in the small and large cases to obtain the following prediction formula:

If -0.0572153 + 2.68348e-6*label + 0.000206234*size < 0.5 then 4.92e-6*label^0.68*size^1.03 else 0.000479*label^0.32*size^1.25,

which has mean error ~28 and standard deviation ~156. Clearly it's not a large improvement over simpler formulae listed above.

That is more or less what I have for approximating minimization run times. Now, the intended application of such an approximation is scheduling minimization jobs evenly, so lets switch our attention to that.

I assumed that we have 16 CPUs (processes, threads, etc) available, and we schedule our theorems for minimization by the Longest Processing Time algorithm (the task is assigned to a CPU with smallest total load).

If I understand correctly, we should first add theorems with largest expected run time, right?  Otherwise if there are 999 fast tiny ones added first, they would become about evenly distributed, and the 1000th giant one added at the end would skew one job significantly.
 
Obviously during the scheduling phase we calculate time by one of our formulae, but after theorems have been spread to CPUs, we can evaluate our procedure by calculating real run times. The relevant parameters are average CPU load which is always ~79495s, and latency which I define to be the difference between maximum CPU time and minimum CPU time (the smaller latency is, the closer CPU loads are). It is useful to compare our results with a straightforward scheduling approach where one randomly partitions theorems into 16 (almost) equal lists. In my experiments such an approach yields latency ~39397, so it can be taken as a base value.

So, the first formula (together with LPT scheduling) produces the following list of CPU run times:

[76126,83042,81573,90121,72944,79387,74600,84677,78886,69388,86309,83830,69727,74918,81779,84567],

has latency ~20733 and standard deviation ~6055. The second (hand-found) formula gives latency ~32342 and standard deviation ~8781, which is an arguably worse result (the latency is not much different from the random approach). Finally, the last (large) formula gives latency ~40756 and standard deviation ~9684, which is worse than a random approach.

It will be interesting to see how this works out for the current set.mm, which has had massive modifications in the past 2 years.
 

What is, in my opinion, surprising is that the simpler formula 'size' (i.e. minimization time is proportional to the proof size; notice that we don't need a multiplicative constant since we are only summing and comparing values in our scheduling algorithm) produces latency ~19218 and standard deviation ~6388 which is on par with the first formula found by curve-fitting. I have tried other simple expressions without great success, so this seems like the best result I can get.

That is how far I could get with my basic knowledge of statistics. Apparently the end result is that one can simply estimate theorem minimization time by its proof size and then use well-known scheduling algorithms to assign jobs :-) I couldn't find any reports on the minimization run performed by Dylan Houlihan (for instance, I don't know the "latency" parameter for that run or even the number of threads used), so I don't know whether the approach I'm suggesting has any benefits over the method used by Norman and Dylan. Nevertheless, I hope this small report was interesting to read and maybe it will motivate other people to try to come up with a better formula and/or approach. Here's the dataset I used: https://gist.github.com/savask/900161e9d093a5b787b6473113dfde16. I'm also looking forward to any questions or criticism regarding approaches used here. I can't boast a good knowledge of statistics or data science and my "analysis" must be taken with a grain of salt, so I will be happy to hear opinions of other people.

BTW when errors are caused by "Proof modification is discourged" or missing theorems, the scripts will just go on to the next one rather than abort.  For this purpose I modified metamath.exe to add "_exit_pa" as a synonym for "exit" inside MM-PA, which MM> will just ignore (with an error message).

Norm

savask

unread,
Jan 1, 2020, 11:55:08 AM1/1/20
to Metamath
For you or anyone wishing to put together the scripts for the current set.mm using your formula(s), I uploaded 3 files from the 2018 run for reference:
...

The emails describe how we did this as well as some problems we ran into. Some jobs took much longer than others, and we ended up with 72 logs.

With that information it is possible to compare the schedule you used in 2018 with the schedules built by other approaches. Your schedule apparently had the form:

[41040,57642,44711,45151,50352,45021,49597,46507,42826,43233,44058,43347,49606,48702,45067,39088,45075,44259,44467,50376,48313,45099,47391,225349]

with maximum job time being 225349. Meanwhile, if one were to randomly scramble theorems and evenly distribute them among 24 CPUs, the maximum job time would have been ~80521, which is ~36% of the 2018 run. Maximum job time resulting from the LPT algorithm with 'size' heuristic is ~67790, which is ~30% of the 2018 run. If we could predict minimization run times with absolute precision, LPT algorithm would have managed to produce a makespan of ~52994, which is ~24% of the 2018 run; that seems to be out of reach now :-)

If I understand correctly, we should first add theorems with largest expected run time, right?  Otherwise if there are 999 fast tiny ones added first, they would become about evenly distributed, and the 1000th giant one added at the end would skew one job significantly.

Longest Processing Time (LPT) scheduling algorithm works like this. You sort jobs (theorems) based on the time they take (sizes in our case) in the non-increasing order, and then assign each job to a CPU with the smallest load, one by one (CPU load is the sum of run times of jobs it has at the moment). It's essentially the same as if each CPU pulled a new job out of the sorted queue once it completed a previous task.

Obviously the best way would be to implement that algorithm "online", so that different copies of the metamath program would pull theorems to minimize from a common sorted queue, but I don't know a way to do that without some sort of threads library.

It will be interesting to see how this works out for the current set.mm, which has had massive modifications in the past 2 years.
 
I made a bash script to generate job1xx.cmd files automatically: https://gist.github.com/savask/e44ee6086a1321b2bbddbce027975bae To run it, modify the metamath_exe variable inside the script to point at your local metamath installation and run it as "./preparejobs.sh N filename" where N is the number of CPUs and filename is the database name. For example, "./preparejobs.sh 24 set.mm". It will generate N files job1xx.cmd in the same directory with the format as in the min2018-jobs.zip archive.

My bash skills are far from perfect, so feel free to report on a bug in that script. I'm very intrigued to know if it will improve the full minimization run, so I hope it works well :-)
Reply all
Reply to author
Forward
0 new messages