checkmm ported to TypeScript

82 views
Skip to first unread message

Antony Bartlett

unread,
May 16, 2022, 9:02:08 AM5/16/22
to meta...@googlegroups.com
Hi,

I'm writing to announce I have completed my port of the checkmm verifier from C++ to TypeScript.  The test suite passes.

If you have nodejs installed (and tens of millions of developers worldwide do) you can verify an mm file in your current working directory with a command, e.g.

npx checkmm set.mm

Though naturally it's more efficient to install my checkmm package first.

Link to package and source code repositories here


(that's the announcement done, just a few remarks to follow).

The O(n^2) problem I mentioned previously turned out to be due to the JavaScript Array.shift operation (which removes an item from the front of an array) being an O(n) operation which was used n times.  Glauco reports previously having a similarly unexpected problem with Array.slice.  The takeaway from this is to make sure you properly understand the capabilities of whatever containers you decide to use, and in particular to be wary of a JavaScript array that might be a million items or more long.

I just reversed the array, by the way - which is also an O(n) operation, but I only had to do it once, then my tokens could be removed in the desired order with Array.pop which is an O(1) operation.

What enabled me to track down this problem in about three quarters of an hour was knowing my way around the Visual Studio Code debugger - that's what I didn't have in 2019 when I first encountered this difficulty.  Debuggers have their detractors (Linus Torvalds is probably the best known), and in the context investigating performance issues they're sometimes called the "poor man's profiler", but personally I find a debugger to be an indispensable general purpose tool

I did this port because I like Eric Schmidt's C++ source code for checkmm.  I alway respect a job well done by source code, but actually it's quite rare for me to like the source code itself.  As such my port should be pretty faithful to the original.  Hopefully you should almost be able to step through the two source files line-by-line and treat it like a Rosetta Stone between C++ and TypeScript.


It should be said though that reading code written with C++'s Standard Template Library containers is a bit of a specialised skill.  It's never let me down with unexpectedly slow operations not immediately obvious from the documentation like JavaScript has above, but with that caveat aside, I anticipate my TypeScript port being found to be slightly easier to read.

    Best regards,

        Antony

Thierry Arnoux

unread,
May 16, 2022, 10:45:40 AM5/16/22
to meta...@googlegroups.com, Antony Bartlett

Congrats, a new verifier is always a good news!

And Typescript is a very relevant language.

Interestingly enough, Raph Levien (& al)'s Python verifier has the exact same "trick" with reversing the token buffer, then popping from it. This had me wonder.

What's the current running time?

In any case I would suggest adding it to our set.mm CI!

_
Thierry

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/CAJ48g%2BD5uC0tesStx%2Bj-bJe1Zmw4bkx1pqG-puoHpAWosX7k0Q%40mail.gmail.com.

Mario Carneiro

unread,
May 16, 2022, 2:41:37 PM5/16/22
to metamath, Antony Bartlett
The pop/reverse does seem unnecessary. I would just keep track of the position in the array, possibly using a little structure containing the array and the current position with some functions to get the current token and move to the next. In fact, you can use this interface to parse the next token on demand, so that you don't need to precalculate the array of tokens (which involves millions of tiny allocations); you just hold on to the input string and pull off tokens as you need.

Antony Bartlett

unread,
May 17, 2022, 4:16:33 AM5/17/22
to meta...@googlegroups.com

Thierry Arnoux writes:

> Congrats, a new verifier is always a good news!

Thanks! :-)

> What's the current running time?

It should be respectable.  It's verifying set.mm for me in about eight seconds on the MacBook Pro I'm using.  Unfortunately XCode is currently broken on said device so I'm unable to build other verifiers in order to provide a fair comparison, other than to make the obvious statement that this is not, and never will be, a competitor for metamath-knife.

Mario's suggestion could potentially shave off maybe a couple of seconds from that, but I won't know until I try.

    Best regards,

        Antony


Thierry Arnoux

unread,
May 17, 2022, 5:30:01 AM5/17/22
to meta...@googlegroups.com, Antony Bartlett

Eight seconds is respectable, we should try to add this new verifier to our set.mm CI.

Mario's suggestion is actually the way metamath-knife works: this avoids both to allocate and to copy each token. However, Typescript seems to create a new String every time even a substring or a slice is taken, so if you want to get this kind of speedup in TS, you'd probably need to use shared array buffers, which `slice` method, I believe, does not copy. That structure shall also allow use in multi-threading, which is another way to get speedup.

_
Thierry

Mario Carneiro

unread,
May 17, 2022, 12:47:23 PM5/17/22
to metamath
On Tue, May 17, 2022 at 5:30 AM Thierry Arnoux <thierry...@gmx.net> wrote:

Eight seconds is respectable, we should try to add this new verifier to our set.mm CI.

Mario's suggestion is actually the way metamath-knife works: this avoids both to allocate and to copy each token. However, Typescript seems to create a new String every time even a substring or a slice is taken, so if you want to get this kind of speedup in TS, you'd probably need to use shared array buffers, which `slice` method, I believe, does not copy. That structure shall also allow use in multi-threading, which is another way to get speedup.


Note that even if subslicing creates a new string, it would still be worthwhile to do the slicing on the fly since these strings only exist for a short time: many of them are discarded (like comment tokens or structural tokens like $c or $.) and only math tokens and label tokens are retained for longer. GC generally does better if you have lots of short lived allocations than if you have them all in flight simultaneously (which is what happens if you first tokenize the entire file and then go through the array of tokens).
 

Antony Bartlett

unread,
Aug 29, 2022, 9:37:23 AM8/29/22
to meta...@googlegroups.com
On Mon, May 16, 2022 at 7:41 PM Mario Carneiro <di....@gmail.com> wrote:
The pop/reverse does seem unnecessary. I would just keep track of the position in the array, possibly using a little structure containing the array and the current position with some functions to get the current token and move to the next. In fact, you can use this interface to parse the next token on demand, so that you don't need to precalculate the array of tokens (which involves millions of tiny allocations); you just hold on to the input string and pull off tokens as you need.

On 17/05/2022 16:16, Antony Bartlett wrote:

It's verifying set.mm for me in about eight seconds on the MacBook Pro I'm using.  [...]

Mario's suggestion could potentially shave off maybe a couple of seconds from that, but I won't know until I try.

 
It's probably about time I reported back about this.  The fact that JavaScript is asynchronous is usually one of its strengths.  You load or download your .mm file, or a chunk of your file.  Use the 'await' or 'then' keyword, and other parts of your program continue to execute until the file is ready.  In an .mm file you're typically never more than three tokens away from the possible end of a file inclusion and needing to load (or download) an additional file.  That means to avoid having to pre-load all the .mm data, the gettoken function needs to be asynchronous so we are able to get a new file should we need one.  There is a slight overhead on making a function asynchronous which I knew about but have never in my career as a programmer had a problem with before.  It's a problem when you have literally millions of calls, as is the case with gettoken because there are millions (about 30 million was it?) of tokens in set.mm.  The checkmm verifier slowed down threefold when I tried this - my eight second verification of set.mm became 24 seconds.

There's a second reason for pre-loading all the .mm data up front, which is there's no maximum token length in the .mm spec, so the only way of knowing you have enough data in your buffer to load the next token is to have all of the data.

I'm aware of one other approach to avoid having to pre-load all the .mm data, which is to run the parser speculatively, throw an exception if you run out of data, and where you catch the exception find a point to recover from, wait for a little more data to be loaded, and try again.  That requires a very well thought out parser that protects the user of the API from this unpredictability.  What I have at the moment is a checkmm verifier that I've deliberately exposed the internals of to make it fully hackable.  This is good at the moment because a unifier and a html generator for example both place new and different demands upon a .mm parser which I have some vague ideas about but don't fully understand yet.  Maybe one day I'll be ready to write a .mm parser that is all things to all people, very fast, and nicely hidden behind a good api, or maybe checkmm will always be good enough for my purposes.  Certainly for the moment it's exactly what I need.

By the way, I stand behind the decision to bundle parser and verifier together, because I feel a verifier should be small and self contained, while at the same time there's no reason not to expose the parser as an API for other programs to use in a context where checkmm is not being used as a verifier.

When I tried to avoid having a million-token array by lexing the data twice, the verifier took eight seconds to run.  I couldn't tell if it was faster or slower than using an array.  When I tried dropping support for file inclusions so that I only had to lex once, I managed to shave a third of a second off my eight second running time.

This demonstrates what I never doubted, which is that Mario is right about a huge token array being an avoidable overhead in a parser/verifier, but it's not a huge consideration - JavaScript arrays actually seem to be very fast and efficient at handling a very large number of little strings.  Therefore I'm sticking with the token array for now.

    Best regards,

        Antony


David A. Wheeler

unread,
Aug 29, 2022, 3:07:53 PM8/29/22
to Metamath Mailing List, Antony Bartlett


> On May 17, 2022, at 5:29 AM, Thierry Arnoux <thierry...@gmx.net> wrote:
>
> Eight seconds is respectable, we should try to add this new verifier to our set.mm CI.

Agreed! That's faster than some other checkers we already use, so if we
just add it in parallel it shouldn't slow down the final result.
The instructions here seem clear:
https://github.com/Antony74/checkmm-ts

Let me know of any objections to adding it to our test suite.

We now have several checkers based on the C++ checkmm by Eric Schmidt,
so an error in the C++ version is more likely to hit other checkers.
It's already known that N-version programming (writing multiple programs with the
same goals) are *not* necessarily independent in terms of errors. This was shown in
a famous software engineering paper:
"Analysis of faults in an N-version software experiment" by Brilliant, Knight, & Leveson,
IEEE Transactions on Software Engineering ( Volume: 16, Issue: 2, February 1990)
https://ieeexplore.ieee.org/document/44387
http://sunnyday.mit.edu/papers/nver2.pdf
Still, while the odds of detection are *less* than one would expect from
independence, it still provides us with improved evidence that the proofs are correct...
and that's a good thing.

--- David A. Wheeler

Benoit

unread,
Aug 30, 2022, 11:13:37 AM8/30/22
to Metamath
A theoretical remark, please correct me if I'm wrong:

Currently, the CI runs 5(?) verifiers on the databases.  Since verifiers are not overly complex, it is very unlikely that they have a "common bug" (i.e., that all have a bug which makes the same "wrong proof" be undetected).  Actually, if there were a loophole in our verification process, it would be more likely to lurk in the CI software or the .yml file or the various scripts it uses.

The bug could also come from the compilers (but again, the various compilers would need to have a kind of "common bug" which is "exploited" by all the verifiers... very unlikely).  Anyway, an additional assurance would come from using other compilers (in place or in addition to the ones currently used).  I do not advocate doing that in the CI, which is already overkill in my opinion, but currently, the CI compiles metamath.c using gcc (Line 102 in https://github.com/metamath/set.mm/blob/develop/.github/workflows/verifiers.yml#L102).  We could compile it using the certified compiler compcert (https://compcert.org/).  It supports "almost all of C11" and "generates efficient code".  Has anyone tried to compile metamath.c with it ?

Benoît

Antony Bartlett

unread,
Aug 31, 2022, 5:01:30 AM8/31/22
to meta...@googlegroups.com
On Monday, August 29, 2022 at 9:07:53 PM UTC+2 David A. Wheeler wrote:
Let me know of any objections to adding it to our test suite. 
 
Well I'd love to see this, but of course I'm biased ;-). I don't think it adds much value because it's a port, and any errors in the original program are likely to have been faithfully reproduced.

Is it possible to get run time statistics from the test suite? - i.e. see how long each validator is taking to run in the CI environment.  That's definitely something I'd like to see, as at the moment you only have my word for it that checkmm-ts is fast (but nowhere near metamath-knife's league).

    Best regards,

        Antony



David A. Wheeler

unread,
Aug 31, 2022, 9:30:00 AM8/31/22
to Metamath Mailing List
Take a look at any pull request, and click on the checks.
For example, if you go here:
https://github.com/metamath/set.mm/pull/2803
Click on any of the green checkmarks, which will show you the list of checks.
You can then see their times.

That only shows the times for that case; if you were serious you'd take a sample.
But we aren't in *that* much of a rush that running some tests hampers anything.

--- David A. Wheeler

Reply all
Reply to author
Forward
0 new messages