Hutter Prize Entry "STARLIT" Open For Comments

149 views

James Bowery

Jun 7, 2021, 1:12:31 PMJun 7
to Hutter Prize
The judging committee has ruled Artemiy Margaritov's STARLIT compressor is a new prospective winner of The Hutter Prize!

As per the award rules for The Hutter Prize For Lossless Compression of Human Knowledge, this initiates the open comment period, prior to the final award:

​The prize is awarded as follows:

Jeroen van der Burg

Jun 10, 2021, 4:14:54 PMJun 10
First congratulations to Artemiy Margaritov for this accomplishment!
I find the algorithm description very interesting.

Quote: "Due to the limited size of the buffer, the longer the accumulated information stays in the buffer, the higher the chances that this information will be corrupted or evicted."
and
Quote: "During the searching phase, firstly, each article is mapped to a feature vector using a Doc2Vec model. Secondly, considering each feature vector to be a point in a Euclidean space, the Traveling Salesman Problem is solved resulting in the shortest path visiting each point. As a result of these operations, the found shortest path represents the order of all articles where similar articles are placed nearby."

If I understand correctly, the feature vector is something like {8, 4, 7, 2} for an article and {1, 3, 3, 7} for another. (Can be another number than 4 values and floats instead of integers, but than doesn't matter for my question)
When the articles are ordered you would prefer {1, 3, 3, 7} closer to {2, 3, 2, 6} than to {8, 4, 7, 2} and that is done via the TSP.
What size is this buffer where the information is accumulated? Would it be possible to copy the data in the buffer, process a few articles, and copy back that buffer later? Maybe even recursively remembering a few buffers?
If that would be possible, would a minimum spanning tree also be possible to reorder the articles?
If I remember correctly, a minimum spanning tree with the same coordinates used as a solution for the TSP would have a shorter length of all lines added together.

Regards,
Jeroen

--
You received this message because you are subscribed to the Google Groups "Hutter Prize" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hutter-prize...@googlegroups.com.

Artemiy Margaritov

Jun 14, 2021, 9:22:19 AMJun 14
to Hutter Prize
Dear Jeroen,

Thank you for your kind words and interest in STARLIT.

Q1: If I understand correctly, the feature vector is something like {8, 4, 7, 2} for an article and {1, 3, 3, 7} for another. (Can be another number than 4 values and floats instead of integers, but that doesn't matter for my question)
When the articles are ordered you would prefer {1, 3, 3, 7} closer to {2, 3, 2, 6} than to {8, 4, 7, 2} and that is done via the TSP.
Your understanding is correct. Each article was mapped to a vector of 300 floats. The TSP was solved using the Concorde TSP solver

Q2: What size is this buffer where the information is accumulated?
This is a good question. In the STARLIT description, I mean a logical buffer that is used by a compressor/decompressor for accumulating context information for prediction. The cmix compressor employs a combination of models for prediction, e.g. PAQ8HP, PPMD, LSTM, and others (the full list can be found in \$REPO_ROOT/src/predictor.cpp). Each of the cmix models has its own separate buffer where it stores its model-specific context information. The combined size of all buffers of all models is compressor's/decompressor's RAM usage which is restricted to be under 10GB by the competition rules. Thus, one can say that the overall buffer size is 10GB.

Q3: Would it be possible to copy the data in the buffer, process a few articles, and copy back that buffer later? Maybe even recursively remembering a few buffers?
Yes, it is possible to create a checkpoint of the content of all model buffers to disk as the competition rules allow up to 100GB disk usage (at each moment). As a result, one can have up to 10 saved checkpoints at the same time. However, frequent disk writes/reads can require time and may leave less time for making CPU-time-consuming predictions (e.g. by the LSTM-based model). It seems that there is a trade-off between benefits from checkpoints and losses from waiting for disk operations. This trade-off requires investigation.

Q4: If that would be possible, would a minimum spanning tree also be possible to reorder the articles?
If I understood you correctly, you would like to use a minimum spanning tree to "guide" creating and restoring checkpoints of model buffers. This is an interesting idea. It may improve the overall compression ratio if enwik9 has a set of articles that are 1) close to each other by similarity but 2) far away from all other articles as a group. Since the group is remote, no other articles are similar to articles from the group. As a result, it may be beneficial to drop all context information accumulated during processing articles from the group by loading a previous checkpoint. It would be interesting to see if such remote and dense groups exist in enwik9; it is a good idea for further research on the STARLIT approach. There is one caveat though: I think that encoding of the minimum spanning tree takes more bits than that of the shortest path because the minimum spanning tree requires additional symbols for checkpointing/restoring commands. Thus, when using a minimum spanning tree, the compressor's size can be larger than when using the shortest path.

Please let me know if you have any further questions.

Best regards,
Artemiy Margaritov

by...@byronknoll.com

Jun 15, 2021, 2:59:26 PMJun 15
to Hutter Prize
Hi Artemiy,

Congrats on the great results! I have a couple questions about the entry:

1) What is the size of the latest submitted entry? I tried running the code currently checked into GitHub, and get 402,237 bytes for compressor (S1), 114,942,851 bytes for decompressor (S2), for a total size of 115,345,088 (which is worse than what is reported in the README). For the result currently reported in the README, the decompressor size (S2) is stated as 114713624 bytes. If we compare to the "Expected STARLIT compressor output" section, 114713624 is the size of just the enwik9 compression, without the other parts of the self-extracting archive.

2) Is the source code (PySpark?) for generating new_article_order released anywhere? If not, are you planning to release it?

Thanks!
// Byron

Matt Mahoney

Jun 15, 2021, 4:57:04 PMJun 15
to Hutter-Prize
My results from testing starlit is that it meets the conditions for
the Hutter prize. On June 8 2021:

I am currently testing the May 31 submission on my Lenovo core
i7-1165G7, 2.80 GHz, 16 GB in an Ubuntu 20.04 shell under Windows 10.
The compiled size under clang++-12 is 401505 bytes (cmix_orig = 124984
bytes), which is larger than the prebuilt binary (cmix_amdzen1_ub18 =
114012) but smaller than a zipped directory tree of the source code
and makefile = 639942 bytes (mostly the uncompressed article
reordering data). I will use the executable size for both the Hutter
prize and large text benchmark score.

I have the following results so far. All tests use 9.5 GB of memory
according to top and /usr/bin/time -v.

cmix -n enwik6 -> 199039, 276s, 280s (size, compress and decompress
times, no preprocessing)
cmix -c enwik6 -> 198967, 280s, 270s (preprocessing, no dictionary)
cmix -c .dict enwik6 -> 179224, 186s, 191s (with dictionary)
cmix -c .dict enwik8 -> 15215107, about 5 hours, testing decompression
now. I didn't get an exact time because I had to disable sleep mode
and leave the lid open to keep the CPU running at 100%.

I expect compression and decompression to take about 2 days each (50
hours), which is within the requirements for the Hutter prize (70
hours on a geekbench 5 T = 1427).

On June 13 2021:

Starlit test was successful.
archive9 size 114951433
Compress time 48:19:20
User time 173865 s system 92 s
Max resident set 10230464 kb
Decompress time 47:41:28
User 171549s sys 133s
10233912 kb max resident
enwik9_uncompressed compares OK.

Also I have updated the large text benchmark. starlit is #2.
http://mattmahoney.net/dc/text.html

--
-- Matt Mahoney, mattma...@gmail.com

Artemiy Margaritov

Jun 15, 2021, 4:58:15 PMJun 15
to Hutter Prize
Hi Byron,

1) There is a mistake in the table with results in the README: S2 only includes the size of the cmix output and doesn't include the size of the cmix decompressor code/dictionary. Thank you for pointing this out. On the evaluated AMD processor, the missing part is about 192KB (114012 bytes for cmix code and 77986 bytes for the English dictionary). The correct S2 on the evaluated AMD processor is 114905622 bytes. The ~50KB difference with your result (114,942,851) can be explained by the fact 1) that clang-12 generates a shorter cmix executable for an AMD processor than for an Intel processor (this explains the ~20KB difference) and 2) the seed for the random generator used for initializing LSTM weights is tuned for the evaluated AMD processor (the remaining ~30KB difference).

I've corrected the mistake in the README. FYI, as far as I know, Matt Mahoney also validated STARLIT on an Intel processor and got results similar to yours: S1 - 401505 bytes, S2 - 114951433 bytes (link).

2) Yes, I plan to add the infrastructure for generating a new article order to the STARLIT repo. I will release it by June 21.

Best regards,
Artemiy Margaritov

by...@byronknoll.com

Jun 16, 2021, 3:04:08 AMJun 16
to Hutter Prize
Thanks for the replies! One follow-up question for Matt about the time constraint: is it correct that the time limit for future entries is 70 hours on the T = 1427 machine? That seems to be about double the time limit currently posted on the Hutter Prize website (50000/T).

Matt Mahoney

Jun 16, 2021, 4:12:46 PMJun 16
to Hutter-Prize
Perhaps Marcus Hutter can resolve this. The Hutter prize main page says the decompressor must run in under 100 hours on our test machine, which links to a description of my old Dell laptop Core i5-M620 with 4 GB memory. My own geekbench 4 scores are 2505 single threaded and 4399 multi core, which are close to the posted score from the Geekbench website.

But in the rules page where it says 50000/T hours, T refers to the Geekbench 5 score. On that machine my test got 566 single, 1104 multi. The Dell doesn't have enough memory to run tests since the prize was updated 10x.

The rules page was updated with my new machine, a Lenovo 82HT laptop, Core i7-1165G7 2.80 GHz, 4 cores, 8 threads, 16 GB RAM, 500 GB SSD, Windows 10/Ubuntu 20.04. Geekbench 4 scores are 6638 single, 20384 multi. Geekbench 5 scores are 1427, 4667 as published. So under those rules the time limit is 35 hours (50000/1427) for a single threaded program or 10 hours 42 minutes if more than one thread is running.

Starlit takes 48 hours to compress on my machine. I am testing cmix-hp now. It takes 7.6 hours to compress enwik8, so I would expect it to take 76 hours. Both are single threaded. Starlit uses 10GB resident. Cmix differs only in the larger, memory mapped ppmd model. It uses 5 GB resident and 24 GB virtual memory at 99% CPU and surprisingly few page faults.

So it depends on how you interpret the rules, and Marcus will have the final say. Yes it's under 100 hours on my (different) test machine. And I'm not the only one testing it.

Also I added starlit to the large text benchmark with a description. It is second now but I expect cmix-hp to take that spot. Starlit = article reordering + dictionary from phda9 + reduced cmix.