Modularity for set.mm; interface design for the "perfect editor"

Mario Carneiro

unread,

Apr 16, 2016, 6:53:12 AM4/16/16

to metamath

On Sat, Apr 16, 2016 at 1:09 AM, Norman Megill <n...@alum.mit.edu> wrote:

On 4/15/16 12:25 AM, Marnix Klooster wrote:

Hi all,

I'm not invested in any design direction here, but here is just one
remark from my side.

It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).

Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.

(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)

[...]

Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility. If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.

[...]

As for modularity, in set.mm you will find commented out file includes for things like logic.mm. At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't). I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor. Eventually set.mm will have to be broken up, but we aren't at that point yet.

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this. Most file editors are just not designed for this, and it is not good to need to read it all and store it in memory in order to get to a single theorem. Edits to an .mm file almost always are local, touching only a tiny tiny part of that giant file, and splitting it up is one way to focus development a bit more.

Could you repeat the reasons why you think having set.mm in one file is a good thing?

1. Moving sections around is trivial if sections are in separate files, it just means moving a file and possibly changing the dependencies, while moving a theorem or range of theorems is just the same with multiple files, just cut and paste.

2. Global renames are completely identical with multiple files; almost all file editors have a "find/replace in files" feature, which would do the trick. Ideally this should be done by the aforementioned "perfect metamath editor", though, so it can be more syntax-aware and only do the replacement where it is syntactically valid.

If you could start over from scratch, that is, some kind of "Metamath 2.0", how would you design the interface? What is the core structure of MM file data? Some thoughts:

* The basic design is a sequence of theorems with possibly shared environments. Within each theorem, though, the data structure is something complex, which we have decided not to store in plain text in the present version, necessitating interaction with a proof editor in any case. (That is, a user can not practically write a complete mm file by hand, and must resort to a program to write the proof parts.)

* There is a logical two-level interaction pattern with data in the file:
1. Maintenance mode: Adding and removing theorems, changing theorem statements, changing hypotheses and disjoint variable conditions.

2. Proof mode: Writing proof bodies.

* The raw MM file presentation is suboptimal in that on the one hand users must interact with it directly for maintenance mode, yet are presented with human-unreadable data in the proofs, while on the other hand existing tools allow proof mode work but do not handle any maintenance mode activities.

* An easy presentation format for maintenance mode is the raw MM file, but with proofs collapsed or omitted. The operations on such a representation may also be limited beyond arbitrary text edits, to a more well-formed approach such as adding and removing groups, and moving statements around as atomic elements. A graphical interface for this might work, but would need a lot of usability testing to get right.

* For proof mode, the best current contender is the MMP file format, but mmj2 is the only program that can read it, and there is no connection between this and maintenance mode in mmj2 (in fact mmj2 does not edit the MM file at all). I don't think this is the best direction in order to support proof scripts in the future, and I would support something more like a sequence of commands. (Tactic modes exist in Lean, Coq, HOL Light, Isabelle and I can explain them in more detail if anyone doesn't know how they work.) An MMP style output could be written as a sequence of ` apply xxx "|- math" ` commands like those used in MM-PA (the math strings are optional for user input).

* Linking in to an existing IDE like Eclipse is an easy way to get a lot of nice features like syntax highlighting out of the box. But any attempt to do this in maintenance mode is severely restricted by the sheer size of the file, even if proofs are omitted. Eclipse assumes that a software project has a modular structure, and I daresay that most other IDEs do as well. Even simple section-by-section splits would mostly resolve this issue.

* A primary problem that is not currently addressed by mmj2 is the ability to change the state of the database while the program is running (that is, maintenance mode changes affecting proof mode). MM-PA is able to handle this well, although it does not allow many maintenance mode changes in the first place. A good editor would have both modes and could use each to affect the other.

I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?

Mario

fl

unread,

Apr 16, 2016, 7:21:30 AM4/16/16

to Metamath

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this.

A normal database is big. Having only one file has its benefits. You can download it easily. You can

use an editor to look for the proofs. Interfacing metamath with powerful existing text tools is very good.

Parceling out the file would make everything more restrictive.

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you

may place the proofs somewhere else.

--

FL

fl

unread,

Apr 16, 2016, 7:40:10 AM4/16/16

to Metamath

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you
may place the proofs somewhere else.

I think that mathematics are linear in their essence. This is currently the way metamath

manages them: one theorems after the other. A structure in the form of a net would make everything

more difficult to understand and manage. With a net you never know where the things are.

But a simple db2 database can be considered. I'm not sure however the potential benefits are worth the trouble.

The current most serious drawbacks in metamath are the time required to minimize proofs or generate

html pages. Maybe additional elements and caches could be added to minimize those times.

--

FL

fl

unread,

Apr 16, 2016, 7:43:08 AM4/16/16

to Metamath

And for the perfect editor, I think that more well designed algorithms and routines can help to achieve

such a goal.

(In particular the ability to add keywords to the theorems to find them easilier.)

--

FL

Norman Megill

unread,

Apr 16, 2016, 10:17:04 PM4/16/16

to meta...@googlegroups.com

> logic theorems in set.mm <http://set.mm>, there is no reason not to
> make it part of set.mm <http://set.mm> itself, say in a mathbox, so

> that everyone will benefit.
>
> [...]
>

> As for modularity, in set.mm <http://set.mm> you will find commented
> out file includes for things like logic.mm <http://logic.mm>. At

> one point I had several pieces, but they were a pain to keep in sync
> (without special software that didn't exist and still doesn't). I

> think anyone who has done major work on set.mm <http://set.mm> will

> agree that it is very convenient to have everything in one file for
> global label/symbol changes, moving sections around, etc. with a

> text editor. Eventually set.mm <http://set.mm> will have to be

> broken up, but we aren't at that point yet.
>
>
> Actually, I've been brainstorming how to fix this, what the perfect
> metamath editor should look like. It's quite clear to me, at least, that
> we are not there, and having one 16MB file is a symptom of this. Most
> file editors are just not designed for this, and it is not good to need
> to read it all and store it in memory in order to get to a single
> theorem.

I'm not sure why it's not good. On today's computers, 28MB is nothing
compared to the available 4 to 8GB of memory. If an editor has trouble
with a file of that size, maybe the editor design is not good. While I
haven't checked
https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features
carefully, the currently supported ones I recognize handle several GB or
more. Emacs supposedly handles 2EB (10^9 GB).

The main editor I use reads set.mm in well under 1 sec. When I search
for a string at the end of the file it gets there essentially instantly.
I just tried a search and replace of the 27K instances of '$p' with a
larger string, and it completed before I could lift my finger from the
mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it
became a little sluggish, taking 6 sec to find a string at the end of
the file.

> Edits to an .mm file almost always are local, touching only a
> tiny tiny part of that giant file, and splitting it up is one way to
> focus development a bit more.
>

> Could you repeat the reasons why you think having set.mm <http://set.mm>

> in one file is a good thing?

Simplicity. Speed. Lowest common denominator. Convenience.

> 1. Moving sections around is trivial if sections are in separate files,
> it just means moving a file and possibly changing the dependencies,
> while moving a theorem or range of theorems is just the same with
> multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

> 2. Global renames are completely identical with multiple files; almost
> all file editors have a "find/replace in files" feature, which would do
> the trick.

What if we forget to load one of them into the editor?

Anyway, the main editor I use doesn't have this feature. I could change
editors, but it would be asking me to sacrifice something I'm
comfortable with.

> Ideally this should be done by the aforementioned "perfect
> metamath editor", though, so it can be more syntax-aware and only do the
> replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $]
includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would
be similar to what is in set.mm now if you search for '$( $['.
Conversion between single and multiple would of course be automated in
metamath.exe or another program.

I think something like this should satisfy everyone.

FYI the current plans on my to-do list for metamath.exe include a change
to its data structure to allow any part of a statement to be modified
internally.

Applications I have in mind are global label and math token renames,
checking for conflicts before applying and also updating comment markup
and htmldefs. This will prevent accidental changes to English words
that might happen in an editor. Maybe I'll update the Recent Label
Changes automatically.

I don't have plans at this point to provide for insertion, deletion, or
changing the order of statements, or to provide for editing comments.

> I'd like to collect opinions on what a good editor should look like, and
> establish known shortcomings of existing methods, to get a roadmap for
> future editors. What does your "perfect Metamath editor" look like?

These are interesting ideas for a Metamath 2.0, and I'd like to hear
other opinions too.

As a general comment, I think it would be useful for Metamath 2.0 to be
translatable to and from the current version. If 2.0 needs additional
keywords for its features, maybe they could be hidden in special
comments in the 1.0 version so the versions could be completely
compatible. If 2.0 has say local scoping of things like labels, the
translator could rename them for the 1.0 version with some standard
suffix to make them unique. The tools we have for 1.0 (metamath.exe,
numerous verifiers) may or may not be practical to rewrite for 2.0, but
with perfect translatability they could continue to be used.

And ideally 2.0 could be verified with a 74-line program. :)

Norm

Mario Carneiro

unread,

Apr 17, 2016, 6:24:41 AM4/17/16

to metamath

On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:

On 4/16/16 6:53 AM, Mario Carneiro wrote:

Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

The main editor I use reads set.mm in well under 1 sec. When I search for a string at the end of the file it gets there essentially instantly. I just tried a search and replace of the 27K instances of '$p' with a larger string, and it completed before I could lift my finger from the mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it became a little sluggish, taking 6 sec to find a string at the end of the file.

You have managed to get impressively far using only plain text tools, and the mm file format is somewhat tailored for use this way (e.g. spaces around label names everywhere). But I think that Metamath 2.0 should not rely on this, and use instead a more structured format, with tools that respect this format. (This might still basically be the current mm format, but the tool would not just be a text editor.)

Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity. Speed. Lowest common denominator. Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.

1. Moving sections around is trivial if sections are in separate files,
it just means moving a file and possibly changing the dependencies,
while moving a theorem or range of theorems is just the same with
multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.

2. Global renames are completely identical with multiple files; almost
all file editors have a "find/replace in files" feature, which would do
the trick.

What if we forget to load one of them into the editor?

Usually you don't have to; my editor allows you to select a directory to search, and it will automatically open all the files that need to change, or you can just change them in place without opening them. A good maintenance tool for Metamath will search for and update all downstream files under a global rename (because otherwise it wouldn't be correctness-preserving).

Anyway, the main editor I use doesn't have this feature. I could change editors, but it would be asking me to sacrifice something I'm comfortable with.

Like I said, my goal is to figure out what features we need in an editor of the future, so that you don't have to sacrifice anything.

Ideally this should be done by the aforementioned "perfect
metamath editor", though, so it can be more syntax-aware and only do the
replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $] includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would be similar to what is in set.mm now if you search for '$( $['. Conversion between single and multiple would of course be automated in metamath.exe or another program.

I think something like this should satisfy everyone.

A simple short term solution would be a tool to separate set.mm by section, and daisy-chain the files together with include statements. If this can be done in a perfectly reversible manner, then we could pass between each form painlessly. That is, you could take:

$( =-=- section 1 =-=- $)
thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm, and form the files section1.mm:

$( =-=- section 1 =-=- $)
thm1 $p ...

section2.mm:

$[ section1.mm $]

$( =-=- section 2 =-=- $)
thm2 $p ...

section3.mm:

$[ section2.mm $]

$( =-=- section 3 =-=- $)
thm3 $p ...

I believe Stefan has already used such a tool for the set.mm-history project.

FYI the current plans on my to-do list for metamath.exe include a change to its data structure to allow any part of a statement to be modified internally.

Applications I have in mind are global label and math token renames, checking for conflicts before applying and also updating comment markup and htmldefs. This will prevent accidental changes to English words that might happen in an editor. Maybe I'll update the Recent Label Changes automatically.

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause unforeseen mistakes.

As a general comment, I think it would be useful for Metamath 2.0 to be translatable to and from the current version. If 2.0 needs additional keywords for its features, maybe they could be hidden in special comments in the 1.0 version so the versions could be completely compatible. If 2.0 has say local scoping of things like labels, the translator could rename them for the 1.0 version with some standard suffix to make them unique. The tools we have for 1.0 (metamath.exe, numerous verifiers) may or may not be practical to rewrite for 2.0, but with perfect translatability they could continue to be used.

I totally agree. I think that full backwards compatibility is very important for Metamath in general, so that proofs are supported long into the future, and we retain the current simplicity that has encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the current mm format, in particular the separation of proofs and theorem statements, so I would relax this restriction to having a (fairly simple) translation from 1.0 to and from 2.0.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

label $p math $.

For forward compatibility, the label $p math $= ... $. form is also valid, and in this case a separate proof is not needed. Then, have a parallel file, say set.mmc to go with set.mm, which contains any proofs in binary format. Details of this format are up for discussion, but let me point out a few facts:

* It only needs to have the structure {{"thm1", "proof of thm1"}, {"thm2", "proof of thm2"}, ...} where the entries for each theorem can come in any order convenient for lookup or update (like a sorted binary tree or something).

* The file can be binary because all the human-readable information has been removed and placed in the other file. The file should be optimized for quick seeks, so that single-proof examination can operate without reading the whole file.

* Proof checking can be done in parallel (this has actually always been true, but it is harder to mix reading the file with proof checking at the same time), provided that care is taken with sequence numbers from the mm file to ensure that a theorem does not reference later theorems.

Here's an actual binary format which satisfies the above constraints:

Part 1: Header

4D4D4301, 4 bytes = [MMC<hex 1>]: magic number (format specifier)

length of header, 4 bytes (big endian)

List of records of the form:

Label string, zero-terminated. This is the search key, the first label in alphabetical order in the right subtree.

Length of left subtree, 4 bytes (skip this many bytes to get to the right subtree)

Length of all proofs in the left subtree, 4 bytes (skip this many bytes in part 2 to jump over all the proofs for the left subtree)

A zero byte (empty label string) indicates an empty subtree.

Part 2: Proof data

List of records of the form:

Length of record, 4 bytes

Label string, zero-terminated (This and the record length are not necessary if you use the header, but make it easier to traverse all proofs)

Proof data. The proof data has a format reminiscent of the compressed proof format, but without the printable text restriction. It has the following structure:

List of label strings, separated by zero bytes (the referenced theorems)

A second zero byte denoting the end of the label list

The "letters" part of the compressed proof (discussed below)

That's all that is needed for efficient storage and retrieval. In order to find a specific record, one needs to read log n entries in the header, then skip straight to the proof location. Traversal over all records is easily accomplished by skipping the header, then reading each proof data record. Insertion or proof data replacement is also efficient, requiring updates to log n entries in the header. Renames are most easily accomplished as a deletion followed by an insertion. Rebalancing may be necessary, but only requires adjustment to the header.

Is it over-engineered? Probably, but the idea here is that an editor would not store this data in memory at all, and would instead just open the file whenever it needed to consult a new proof, and just use the mm file for contextual information otherwise. This I think can scale up much better than having one giant .mm file the way we currently have.

For the "letter list" part, I'll start by describing the general scheme, without picking the cutoffs yet. Divide the 256 possible byte values into three regions, A,B,C. A single number takes the form CC...CA or CC...CB (that is, all but the last byte have values in region C, and the last is in region A or B). If there are a,b,c numbers in each region (with a+b+c = 256), then sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a0 translates to (((cn*c+...)*c+c2)*c+c1)*a+a0, and the sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a+b0 translates to (((cn*c+...)*c+c2)*c+c1)*b+b0.

This is identical to the mixed-base representation in the standard compressed proof format, except there are three regions instead of two. The difference between the A numbers and the B numbers is that B numbers are marked, in the same sense as the "Z" mark in a compressed proof. This way, a single marked step can fit in only one byte instead of two, if the number is small enough.

I gathered statistics on how often each number, marked or not, is used in the compressed proofs, and tested what choices {a,b,c} lead to the shortest overall size. The largest number in set.mm is 905, and 10% of steps are marked, with a nice exponential decay for both groups.

Surprisingly, the best choices have c being incredibly small; the best solution was {208, 46, 2} with 6084 KB to store all the compressed numbers in set.mm. (For comparison, the currrent compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.) My instinct is to go with a more balanced choice, such as {192, 56, 8} = 6118 KB, but I guess the fact that there is a loose upper bound on proof lengths currently in force in set.mm means that numbers never get so large that more than 3 digits is needed.

Mario

fl

unread,

Apr 17, 2016, 11:15:41 AM4/17/16

to Metamath

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add

syntax-aware routines.

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

And all those problems are not related to the way the database is stored on a disk anyway.

> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause

> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues

but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say 80% of their time with text-based

tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always

captured by the checker (at least for the proofs an formulas).

--

FL

Norman Megill

unread,

Apr 17, 2016, 7:51:21 PM4/17/16

to meta...@googlegroups.com

On 4/17/16 6:24 AM, Mario Carneiro wrote:
>
>
>
> On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu
> <mailto:n...@alum.mit.edu>> wrote:
>
> On 4/16/16 6:53 AM, Mario Carneiro wrote:

...

>
> The multiple-module feature you want could be done now with the $[
> $] includes.
>
> For those wanting a single file, the includes would be commented out.
> For those wanting multiple files, the includes would be uncommented.
>
> We'd have to agree on the commenting convention, but I imagine it

> would be similar to what is in set.mm <http://set.mm> now if you

> search for '$( $['. Conversion between single and multiple would of
> course be automated in metamath.exe or another program.
>
> I think something like this should satisfy everyone.
>
>
> A simple short term solution would be a tool to separate set.mm

> <http://set.mm> by section, and daisy-chain the files together with

> include statements. If this can be done in a perfectly reversible
> manner, then we could pass between each form painlessly. That is, you
> could take:
>
> $( =-=- section 1 =-=- $)

Do you mean the existing section headers? If so, how would the user put
multiple tiny sections in one file or alternately break a huge section
into 2 files, or is that not possible/desirable with your scheme? What
is the algorithm for mapping the string "section 1" to "section1.mm"?
Or if you don't mean that, wouldn't it be clearer just to put the file
name instead of "section 1"?

> thm1 $p ...
> $( =-=- section 2 =-=- $)
> thm2 $p ...
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>

> from set.mm <http://set.mm>, and form the files section1.mm
> <http://section1.mm>:

>
> $( =-=- section 1 =-=- $)
> thm1 $p ...
>

> section2.mm <http://section2.mm>:

(Aside: Is there any way to turn off Google's insertion of spurious
http links on file names?)

>
> $[ section1.mm <http://section1.mm> $]

>
> $( =-=- section 2 =-=- $)
> thm2 $p ...
>

> section3.mm <http://section3.mm>:
>
> $[ section2.mm <http://section2.mm> $]

>
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>
>
> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?

...

>
> As a general comment, I think it would be useful for Metamath 2.0 to
> be translatable to and from the current version. If 2.0 needs
> additional keywords for its features, maybe they could be hidden in
> special comments in the 1.0 version so the versions could be
> completely compatible. If 2.0 has say local scoping of things like
> labels, the translator could rename them for the 1.0 version with
> some standard suffix to make them unique. The tools we have for 1.0
> (metamath.exe, numerous verifiers) may or may not be practical to
> rewrite for 2.0, but with perfect translatability they could
> continue to be used.
>
>
> I totally agree. I think that full backwards compatibility is very
> important for Metamath in general, so that proofs are supported long
> into the future, and we retain the current simplicity that has
> encouraged the existing proliferation of verifiers.
>
> However, I am not certain that all features can be supported with the
> current mm format, in particular the separation of proofs and theorem
> statements, so I would relax this restriction to having a (fairly
> simple) translation from 1.0 to and from 2.0.

Can you clarify this? I'm not opposed to hiding new 2.0 features in
special 1.0 comments so that 2.0 could be recovered from it exactly. If
we have perfect (lossless) translations back and forth between 1.0 and
2.0 (which is what I was trying to suggest), what are you relaxing?

>
>
> Here's a simple proposal for separating the proofs: take the proofs out
> of the .mm file, so that a $p statement takes the simple form:
>
> label $p math $.
>
> For forward compatibility, the label $p math $= ... $. form is also
> valid, and in this case a separate proof is not needed. Then, have a

> parallel file, say set.mmc to go with set.mm <http://set.mm>, which

> shortest overall size. The largest number in set.mm <http://set.mm> is

> 905, and 10% of steps are marked, with a nice exponential decay for both
> groups.
>
> Surprisingly, the best choices have c being incredibly small; the best
> solution was {208, 46, 2} with 6084 KB to store all the compressed

> numbers in set.mm <http://set.mm>. (For comparison, the currrent

> compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
> My instinct is to go with a more balanced choice, such as {192, 56, 8} =
> 6118 KB, but I guess the fact that there is a loose upper bound on proof

> lengths currently in force in set.mm <http://set.mm> means that numbers

> never get so large that more than 3 digits is needed.

Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?

Norm

Mario Carneiro

unread,

Apr 17, 2016, 8:16:24 PM4/17/16

to metamath

On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add
syntax-aware routines.

But it's not an *editor*. It was explicitly designed not to modify the .mm file, which was IMO a mistake. The task of simply reading an .mm file and converting it to an internal data structure is not that hard, and is performed by every verifier. It is true that mmj2 handles "proof mode" better than any other existing solutions, but it does not do "maintenance mode" at all, nor can it handle the interplay between the two (I have a script to restart mmj2 whenever I do a maintenance mode edit, which I do using a text editor open to the .mm file).

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

True. Ideas are welcome.

And all those problems are not related to the way the database is stored on a disk anyway.

The disk storage question is an attempt to address the scalability of the full library going forward. Almost all activities you want to do with the library are "local", and as set.mm becomes more modular this will become even more so. You can't "seek" in a text file.

> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause
> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues
but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say 80% of their time with text-based
tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

You're kidding yourself if you think that compressed proofs are "readable". Currently an .mm file is this odd mix of human readable and unreadable data. The basic idea here is to take the unreadable part out of an .mm file so it becomes pure "source code", then make the human-unreadable part *really* unreadable, i.e. full binary, taking advantage of all the unutilized byte values to achieve a significant savings in proof size (~30% decrease), and making it possible to jump around the file more efficiently than just loading the whole thing into memory.

Also, a binary format like this one is naturally paired with the aforementioned "proof mode" software, which will unpack a proof into a human readable format like .mmp style. That way you get the best of both worlds.

The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always
captured by the checker (at least for the proofs an formulas).

Just because an error is recognized doesn't mean it's easy to fix. It is quite easy to cause thousands of errors with a simple change in one spot, and syntax aware editors can automatically resolve these downstream errors in many cases.

On Sun, Apr 17, 2016 at 7:51 PM, Norman Megill <n...@alum.mit.edu> wrote:

On 4/17/16 6:24 AM, Mario Carneiro wrote:

A simple short term solution would be a tool to separate set.mm

<http://set.mm> by section, and daisy-chain the files together with

include statements. If this can be done in a perfectly reversible
manner, then we could pass between each form painlessly. That is, you
could take:

$( =-=- section 1 =-=- $)

Do you mean the existing section headers? If so, how would the user put multiple tiny sections in one file or alternately break a huge section into 2 files, or is that not possible/desirable with your scheme? What is the algorithm for mapping the string "section 1" to "section1.mm"? Or if you don't mean that, wouldn't it be clearer just to put the file name instead of "section 1"?

Yes, those are the existing section headers. Splitting and joining files is done just like you would in a software project: make a new file, copy the theorems of interest over. I don't think it is necessary to specifically support this action just because it happens to be simple with the current approach. Also, you could just add a second section to one of the single-section files, then round trip it and it will get broken up.

You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.

thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm <http://set.mm>, and form the files section1.mm
<http://section1.mm>:

$( =-=- section 1 =-=- $)
thm1 $p ...

section2.mm <http://section2.mm>:

(Aside: Is there any way to turn off Google's insertion of spurious http links on file names?)

Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

I totally agree. I think that full backwards compatibility is very
important for Metamath in general, so that proofs are supported long
into the future, and we retain the current simplicity that has
encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the
current mm format, in particular the separation of proofs and theorem
statements, so I would relax this restriction to having a (fairly
simple) translation from 1.0 to and from 2.0.

Can you clarify this? I'm not opposed to hiding new 2.0 features in special 1.0 comments so that 2.0 could be recovered from it exactly. If we have perfect (lossless) translations back and forth between 1.0 and 2.0 (which is what I was trying to suggest), what are you relaxing?

Oh, I just meant it may not be backwards or forwards compatible, in the sense that one spec is a subset or superset of the other, but there is an embedding and maybe a bijection between formats.

I gathered statistics on how often each number, marked or not, is used
in the compressed proofs, and tested what choices {a,b,c} lead to the

shortest overall size. The largest number in set.mm <http://set.mm> is

905, and 10% of steps are marked, with a nice exponential decay for both
groups.

Surprisingly, the best choices have c being incredibly small; the best
solution was {208, 46, 2} with 6084 KB to store all the compressed

numbers in set.mm <http://set.mm>. (For comparison, the currrent

compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
My instinct is to go with a more balanced choice, such as {192, 56, 8} =
6118 KB, but I guess the fact that there is a loose upper bound on proof

lengths currently in force in set.mm <http://set.mm> means that numbers

never get so large that more than 3 digits is needed.

Interesting. I'm curious - how well do these formats compress further with say bzip2 or xz?

There is probably still some room for compression, since the label lists are still english-like. But a proper answer would require implementing the algorithm (those KB numbers are projections based on the statistics of set.mm, I never actually made a 6084 KB file). I'll let you know if/when 2.0 is actually built.

Mario

Stefan O'Rear

unread,

Apr 17, 2016, 8:53:13 PM4/17/16

to Metamath

On Sunday, April 17, 2016 at 3:24:41 AM UTC-7, Mario Carneiro wrote:

On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/16/16 6:53 AM, Mario Carneiro wrote:

Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).

Being able to load a 28MB file is not the same thing as being able to handle a 28MB file _quickly_. Most files encountered in programming are much smaller (look at the C files in the Metamath distribution), and most specialized programmer's editors don't place a high priority on being able to edit responsively with a large file loaded, even if that file can theoretically be loaded. I believe I reported a long time ago that vim was having 1/4 - 1/2 seconds of lag to move around in the 17MB set.mm. It's great that your editor of choice has made tradeoffs that allow it to efficiently handle large files, but other people have editors of choice that make different tradeoffs, and keeping set.mm in smaller pieces will help make them happy.

The version control systems I use also get noticably sluggish with files that large.

Moreover, it's important to plan for the future as well as the present. If we ever succeed in attracting a significant number of new contributors, that file will increase in size much faster than it has done in the past. I consider the fact that set.mm is growing slower than Moore's law to be a bug, not a feature.

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

Exactly, and this goes double for my zany idea of the browser-based proof editor, because Javascript implementations give you very little control over the heap layout (security reasons) and many data structures will require a lot more memory than they would in a C program. (Unless we do the asm.js "everything in a big Int8Array" approach).

One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

My plan for SMM-PA was to keep the proofs in text format when they weren't being edited, because that's denser than almost any other possible data structure. Still, maintaining the parse trees for all statements took ~ 120 MB last time I tried it. Developing a denser format for the parse trees was on my TODO list.

Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity. Speed. Lowest common denominator. Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.

Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

Important: This proposal DOES NOT use $[ $]. The directory structure is king. I regard $[ $] as a mistake because it gives the database author too much flexibility; the database should be a database, not a program.

This gives us a large number of small files (although Construction_of_a_vector_space_from_a_Hilbert_lattice could still stand to be split up :) ), making our tree similar to other trees and play nicer with the widest variety of existing text manipulation tools. (A current Linux kernel checkout is 624 MB, or 12 KB per source file...)

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.

Agreed on commits and the value of integrating version control into our workflow. Again, git and other systems I've used will be happier with hundreds of small files than one largish one.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.

If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

-sorear

Stefan O'Rear

unread,

Apr 17, 2016, 9:02:19 PM4/17/16

to Metamath

On Sunday, April 17, 2016 at 4:51:21 PM UTC-7, Norman Megill wrote:

> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?

The code I used to make set.mm-history: https://gist.github.com/sorear/3dffa6806e90dc0e0c4af90594bc9716

A newer version of essentially the same logic: https://github.com/sorear/smm/blob/master/lib/smm/outline.js

I have already stated my opposition to $[ $].

Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?

I'd really encourage us to _not_ spend much time on optimizing compression with databases at rest. I think I could make something half the size of .mm.xz, given my experience with format-specific data compressors (the general-purpose compressors are a lot worse for specific formats than people realize), but at least for me 28MB on disk is very little. I care much more about 120MB of parse trees in RAM, occupying space in a browser heap already split between N tabs...

This might be US city-dweller bias. FL might have choice words on the importance of optimizing databases for transmission over very slow unreliable links.

-sorear

Stefan O'Rear

unread,

Apr 17, 2016, 9:05:20 PM4/17/16

to Metamath

On Sunday, April 17, 2016 at 5:16:24 PM UTC-7, Mario Carneiro wrote:

You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.

I posted two implementations of the "news website approach" above. It needs a little tweaking for Metamath but works well.

Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

Two-letter file extensions are best avoided for new projects because of the ccTLD issue...

-sorear

David A. Wheeler

unread,

Apr 17, 2016, 10:18:31 PM4/17/16

to meta...@googlegroups.com, Mario Carneiro

I do like the readability of text files. If seekability is important, perhaps they could be stored as a bunch of small text files, with one proof profile.

--- David A.Wheeler

Norman Megill

unread,

Apr 17, 2016, 11:35:35 PM4/17/16

to meta...@googlegroups.com

On 4/17/16 8:16 PM, Mario Carneiro wrote:
>
>
> On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath

> <meta...@googlegroups.com <mailto:meta...@googlegroups.com>> wrote:
...

> > In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause
> > unforeseen mistakes.
>
> And frankly readable, text-based databases are never superseded by
> binary databases. Binary data can be useful for speed or other
> efficiency issues
> but it is always at the cost of poorer interfacing. I think computer
> engineers must spend I would say 80% of their time with text-based
> tools to manage data, systems and so on. Text-based databases have
> the peculiarity of interfacing well with important element of a
> computer: human beings.
>
>
> You're kidding yourself if you think that compressed proofs are
> "readable". Currently an .mm file is this odd mix of human readable and
> unreadable data. The basic idea here is to take the unreadable part out
> of an .mm file so it becomes pure "source code", then make the
> human-unreadable part *really* unreadable, i.e. full binary, taking
> advantage of all the unutilized byte values to achieve a significant
> savings in proof size (~30% decrease), and making it possible to jump
> around the file more efficiently than just loading the whole thing into
> memory.
>
> Also, a binary format like this one is naturally paired with the
> aforementioned "proof mode" software, which will unpack a proof into a
> human readable format like .mmp style. That way you get the best of both
> worlds.

I'll mention that I am somewhat uncomfortable with binary formats, but
as long as we can convert to ASCII 1.0 I guess it's tolerable. For some
things like label renames, a syntax-aware Metamath-specific tool is
certainly preferred. But no such tool can anticipate all the possible
manipulations we may want to do to a .mm file. Having the ability to
process it with text-manipulating tools from sed to powerful text editor
scripting has been extremely useful in the past, and that ability goes
away with binary formats. Unlike a computer program where editing can
introduce hidden bugs, the ability to verify proofs makes .mm files
relatively robust against editing mistakes.

A computer program source file 28MB in size is hard to even imagine, and
I wouldn't want to be responsible for its maintenance. The fact that
set.mm flourishes with no significant problems in spite of its size
suggests that is has a fundamentally different character from a computer
program. The same principles of structure and modularity, intended to
reduce bugs and increase maintainability, may or may not apply. set.mm
is also quite different from an ordinary SQL database, which could be
riddled with unverifiable data entry errors that can only partially be
checked for consistency (such as correct zip code). I don't know what
category of data object set.mm would be classified as, but it seems
different from the usual ones of our experience.

As for file size, of course smaller files are preferable as a general
principle, but I don't think it is worth giving up flexibility. I still
don't understand the concern with our present file sizes, where set.mm
takes less than 1% of available memory on most computers. (BTW my text
editor uses an additional 13MB to work with the 28MB set.mm, growing up
to a user-specified limit for the undo stack.) I know you are trying to
anticipate future scalability, and theoretical studies of different
proof compression methods is certainly interesting, but it will be quite
a while before saving 30% of the size will make a noticeable difference.

> The quantity of errors introduce by a non aware-based system are
> easily corrected I think especially with metamath where an error in
> the data is always
> captured by the checker (at least for the proofs an formulas).
>
>
> Just because an error is recognized doesn't mean it's easy to fix. It
> is quite easy to cause thousands of errors with a simple change in one
> spot, and syntax aware editors can automatically resolve these
> downstream errors in many cases.

What is an example of an error introduced by a careless edit that is
hard to fix? (Personally, during an editing session I usually save the
file frequently and run the fast 'verify proof */syntax_only'.)

Norm

Mario Carneiro

unread,

Apr 18, 2016, 1:31:23 AM4/18/16

to metamath

On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:

Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $] <-- thru sec 1.2.2
$[ fol\prop\imp.mm $] <-- sec 1.2.3
$[ fol\prop\neg.mm $] <-- sec 1.2.4
...
$[ zf\ax-ext.mm $] <-- sec 2.1.1
$[ zf\cab.mm $] <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).

If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:

I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.

set.mm is also quite different from an ordinary SQL database, which could be riddled with unverifiable data entry errors that can only partially be checked for consistency (such as correct zip code). I don't know what category of data object set.mm would be classified as, but it seems different from the usual ones of our experience.

As we've seen, not all data errors in set.mm can be checked. The definition checker, grammar parsing, and theorem scoping are all examples of data integrity checks that are not checkable by the average verifier. Furthermore, testing that theorems and definitions mean what we want will never be automatically checkable. This is exactly the same sort of thing as an SQL database with unverifiable data entry errors, or a C program which might compile but might still have logical bugs.

The biggest difference between set.mm and an SQL database is instead the online nature of the data (online as in https://en.wikipedia.org/wiki/Online_algorithm). If all the grammar was defined up front, it actually wouldn't be all that bad to store a metamath file in a DBMS.

As for file size, of course smaller files are preferable as a general principle, but I don't think it is worth giving up flexibility. I still don't understand the concern with our present file sizes, where set.mm takes less than 1% of available memory on most computers. (BTW my text editor uses an additional 13MB to work with the 28MB set.mm, growing up to a user-specified limit for the undo stack.) I know you are trying to anticipate future scalability, and theoretical studies of different proof compression methods is certainly interesting, but it will be quite a while before saving 30% of the size will make a noticeable difference.

I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.

Just because an error is recognized doesn't mean it's easy to fix. It
is quite easy to cause thousands of errors with a simple change in one
spot, and syntax aware editors can automatically resolve these
downstream errors in many cases.

What is an example of an error introduced by a careless edit that is hard to fix? (Personally, during an editing session I usually save the file frequently and run the fast 'verify proof */syntax_only'.)

An almost trivial example: Change syl to $e |- ( ps -> ch ) $e |- ( ch -> ph ) $p |- ( ps -> ph ) $. By "fix" here I don't mean "undo the change", but "fix the file so that it's not broken, but keep the change".

Mario

fl

unread,

Apr 18, 2016, 6:08:17 AM4/18/16

to Metamath

> But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the > syntax tree delivered the by mmj2 to add

Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

You can also externalize the parser and make an external filter in the "find" spirit. I think it would
allow to do think.

By the way if could add automatic periodic backups.

--
FL

fl

unread,

Apr 18, 2016, 6:13:11 AM4/18/16

to Metamath

> I think it would allow to do think.

To do things.

--
FL

Stefan O'Rear

unread,

Apr 19, 2016, 12:51:00 AM4/19/16

to Metamath

On Sunday, April 17, 2016 at 10:31:23 PM UTC-7, Mario Carneiro wrote:

On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:
Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

That was a very quick idea, and sure to have a number of problems… this is a fairly hard problem to solve, and I still don't have all the answers yet. Definitely interested in future exploration and discussion of the "how do we make a _good_ dual-mode editor" space.

What do we want to do in terms of platform issues? I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.

A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $] <-- thru sec 1.2.2
$[ fol\prop\imp.mm $] <-- sec 1.2.3
$[ fol\prop\neg.mm $] <-- sec 1.2.4
...
$[ zf\ax-ext.mm $] <-- sec 2.1.1
$[ zf\cab.mm $] <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well. At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

Vaguely I feel like the Database should always be, abstractly, a singular entity, but it need not be worked with _as_ a singular entity, if that makes any sense at all.

the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?

Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

Sounds good, assuming we have a splitting mechanism that is rationalized enough that we can actually handle files in isolation. (Even C compiler authors are turning against textual inclusion these days. http://clang.llvm.org/docs/Modules.html — I have my hopes.)

The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).

When I did the memory profiling on smm that came up with the 120MB number, the worst offender was large numbers of short string objects, specifically the "(" "A" "+" "B" ")" strings that came from first-stage tokenizing the math strings. I'm pretty sure the memory usage could be cut substantially, first by using an intern table for tokens, second by more restricted caching of things like parse trees that can be locally constructed. There are tradeoffs here; I was trying to cut every last centisecond from the verification time, and I reasoned that would add quite a few (although to be honest I never measured it, I could do that).

"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches. We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them. We can't go around rebuilding all of the indexes on load and on every edit. A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here. They'll have to respect tree structure and scope structure in a few ways.

If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.

That's what I had in mind - splitting the database into 100 or so files with separate timestamps. Don't go crazy though as nobody ever added transitive directory timestamps to UNIX and stat overhead can be a killer when you have 50,000+ files (see also: doing almost anything with said Linux checkout).

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

Absolutely version control should have a single point of truth. If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.

On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one. Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier. I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with. A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.

A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.

The largest legitimate (not a firmware dump, etc) single source file in my Linux checkout is 534 KB. It's a matter of degree, not kind.

I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.

Worth noting that one of my big research plans for a proof tool is a fast highly-indexed search function. I haven't worked out very many of the details yet but I've spent a fair amount of time looking at the algorithms used by Lucene et al, and I do think something can be usefully done, *but* managing memory use is going to be one of the hardest parts.

-sorear

Stefan O'Rear

unread,

Apr 19, 2016, 12:55:52 AM4/19/16

to Metamath

On Monday, April 18, 2016 at 3:08:17 AM UTC-7, fl wrote:

Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.

And don't even think about enabling the syntax highlighter.

Mario Carneiro

unread,

Apr 19, 2016, 4:47:50 AM4/19/16

to metamath

On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:

What do we want to do in terms of platform issues? I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.

I've been investigating a possible replacement for metamath.exe written in Rust (which targets the same platforms as C). As long as you are running local, it seems useful to just run as a command line tool, like a compiler, which is invoked by actual text editors to do the heavy lifting. The biggest problem with this model is the large startup cost, which would seem to suggest a more server-oriented approach in which the compiler runs as a service and updates its memory based on commands from the editor.

A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $] <-- thru sec 1.2.2
$[ fol\prop\imp.mm $] <-- sec 1.2.3
$[ fol\prop\neg.mm $] <-- sec 1.2.4
...
$[ zf\ax-ext.mm $] <-- sec 2.1.1
$[ zf\cab.mm $] <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well. At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

One idea I've been toying with is snapshotting the database after each file, so that you have a restart point when some file in the middle changes. Not sure if it's not simpler to just filter the full database and throw out everything after a given sequence number.

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?

It's almost true; moving local $v's to the start causes duplicate $v errors, but you can fix that by merging the duplicates, but local $f's might cause an irreparable duplicate problem, if the type of the variable is different in the two $f statements. A recent suggestion of mine for ensuring that local $f assignments always use the same type will resolve this.

But the other half of the story is that moving the grammar up means that more statements will be accepted, for example a propositional calc statement might make reference to a constant declared later (like CHOICE), which would parse "correctly" even though it is not valid. I'm not sure how to resolve this; probably it is best to just load the full grammar and then postprocess the parse to make sure it does not use any forward references. (The same problem applies to user scripts, which may have hardcoded references to theorems from later on.)

"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches. We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them. We can't go around rebuilding all of the indexes on load and on every edit. A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here. They'll have to respect tree structure and scope structure in a few ways.

Alternatively, you can present the user with the raw .mm file, and diff any changes against the indexed version, then attempt to reverse engineer which statements are untouched and only rebuild the index for changed statements. Doing this quickly is tough, though, even assuming you are working with a reasonably sized .mm file fragment.

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

Absolutely version control should have a single point of truth. If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.

Well, you do have to version control the MMC files, because that's where the proofs are. (In this conception the proofs are stubbed out of all the $p statements.) It would even diff well, in the sense that a change to one proof affects only one subsection of the binary file. It's really not all that different from an archive of one-proof files concatenated together; the reason I'm avoiding that is because most proofs are small, and having that many files makes it hard to work with, as you have mentioned.

On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one. Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier. I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with. A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.

I think it will be a win to be able to separate maintenance mode from proof mode as separate files. That is, if you change the .mmc file but not the .mm file, there are no changes needed to the index because you only changed a proof, and not any statements. Also, you can present the .mm file to the user as a pure source code file, rather than fiddling with collapsing markers to hide the proofs so that you can quickly look over the development at a high level.

I really really don't want to reinvent the wheel here. The UI of mmj2 simply was not built with large files in mind, and many of my attempts to add functionality have hit a performance wall. Syntax highlighting was a lot of work and was ultimately a failure because it is not fast enough to be usable in practice. I don't want to repeat the experience. I'm looking into an Eclipse or Emacs plugin, because text editing that scales well is too hard to get right without writing a lot of code that has nothing to do with Metamath.

Mario

fl

unread,

Apr 19, 2016, 5:35:42 AM4/19/16

to Metamath

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.

Consider using real tools. Emacs manages a 28M file as if it had 100 bytes.

But anyway the filter I was speaking about above may also be a good idea.

And don't even think about enabling the syntax highlighter.

Incremental parsers are your friends. (And you need a deterministic LR. You will be happy.)

--

FL

Stefan O'Rear

unread,

Apr 19, 2016, 11:28:56 PM4/19/16

to meta...@googlegroups.com

On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

I've been playing with Rust for a few months now, and I've *already* been
thinking it'd be an excellent fit for a new metamath program. Also provides
nicer multithreading than C (and *much* nicer than any current Javascript), and
there are quite a few places we can leverage that for improved responsiveness
in the editor... Was going to try writing a one-shot verifier in Rust in a
week or so.

Given the low startup overhead of a pre-compiled program and the ability to
keep a cache file (see below), I think startup under 200ms is feasible. Will
need to do experiments.

There would be a lot of problems to solve relative to integrating it into an
editor / constructing a GUI shell.

(BTW, I am presently running a Linux desktop)

> > That could work well. At that point it would essentially be a project
> > file; I feel like there's something _almost_ here that would be a big help
> > to local work / handling subsets of the database, but I'm not quite sure
> > what it is.
> >
>
> One idea I've been toying with is snapshotting the database after each
> file, so that you have a restart point when some file in the middle
> changes. Not sure if it's not simpler to just filter the full database and
> throw out everything after a given sequence number.

I don't like restart points.

I spent some time thinking about this last night and I think the reason I was
hitting a wall with smm was that I was trying to handle it at the wrong
granularity. smm wanted to support swapping out individual metamath
*statements*, which could be things like `${`; it'd work much better if we
forced the reparsing granularity to be a much larger segment.

Proposal for SMM 0.3:

* The Database is divided up into segments, which are coterminous with .mm
files. If proofs are split from statements, then each segment has a .mms
and a .mmc file.

It is possible to consider shorter segments, but some way to stably
identify them would become necessary.

* Each database has a "project file" which lists the used Database segments.
Details TBD.

* A segment must end with a newline, must consist of complete Metamath
statements only, and may not contain unbalanced ${ $}. Thus, the lexer
state and the nesting state are both automatically known at the top of a
segment.

* $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
requires state (the include list). The exception is for compatibility.

* Top level $e statements are forbidden if there are more than one segment.
(They create difficulty for frame construction).

* Given a single segment in isolation, it can be parsed into Metamath
statements and blocks. For every assertion, we can identify the `$e`
hypotheses (but not the `$f`s, because we still allow those at top level and
they could be in scope from a former segment). Record all used global names
in an index.

* Once we've scanned all segments, we can detect errors from name collisons.

* Check all segments again. Now that we have the full `$c`/`$v`/`$f` index,
we can construct frames for all assertions.

* At this point we have frames for all `$a` statements, so we can construct
the grammar.

* Check all segments again. We have a grammar and can do a grammatical check;
we also have frames for everything and can do a proof validity check. This
phase works locally within assertions so it can be incremental at that
level.

I am fairly confident that all of this can be done in a strongly incremental
way: the only changes that will requiring scanning the full active database are
the ones that change the definition or parsing rules of a math token that is
actually used everwhere.

> It's almost true; moving local $v's to the start causes duplicate $v
> errors, but you can fix that by merging the duplicates, but local $f's
> might cause an irreparable duplicate problem, if the type of the variable
> is different in the two $f statements. A recent suggestion of mine for
> ensuring that local $f assignments always use the same type will resolve
> this.
>
> But the other half of the story is that moving the grammar up means that
> more statements will be accepted, for example a propositional calc
> statement might make reference to a constant declared later (like CHOICE),
> which would parse "correctly" even though it is not valid. I'm not sure how
> to resolve this; probably it is best to just load the full grammar and then
> postprocess the parse to make sure it does not use any forward references.
> (The same problem applies to user scripts, which may have hardcoded
> references to theorems from later on.)

$f statements can be handled by adding valid ranges to the grammar rules,
e.g. https://github.com/sorear/smm/blob/master/lib/smm/parser.js#L310

> > Absolutely version control should have a single point of truth. If we
> > decide to store MMC files on disk (absolute loading speed may not be
> > important once we have a mature maintenance mode, so I could go either way
> > on this), they should be .gitignored.
> >
>
> Well, you do have to version control the MMC files, because that's where
> the proofs are. (In this conception the proofs are stubbed out of all the
> $p statements.) It would even diff well, in the sense that a change to one
> proof affects only one subsection of the binary file. It's really not all
> that different from an archive of one-proof files concatenated together;
> the reason I'm avoiding that is because most proofs are small, and having
> that many files makes it hard to work with, as you have mentioned.

I'd slightly rather keep the proofs with the $p statements. To use a SQL
database analogy, I'm currently advocating for horizontal partitioning only.
For the database size and incremental processing use case, horizontal
partitioning is a much bigger win.

I'd also rather not do anything that requires binary diffing; the tools exist,
but they're not widely used and it would contribute to the barrier to entry if
we restricted usable tools in that way.

> I think it will be a win to be able to separate maintenance mode from proof
> mode as separate files. That is, if you change the .mmc file but not the
> .mm file, there are no changes needed to the index because you only changed
> a proof, and not any statements. Also, you can present the .mm file to the
> user as a pure source code file, rather than fiddling with collapsing
> markers to hide the proofs so that you can quickly look over the
> development at a high level.

> I really really don't want to reinvent the wheel here. The UI of mmj2
> simply was not built with large files in mind, and many of my attempts to
> add functionality have hit a performance wall. Syntax highlighting was a
> lot of work and was ultimately a failure because it is not fast enough to
> be usable in practice. I don't want to repeat the experience. I'm looking
> into an Eclipse or Emacs plugin, because text editing that scales well is
> too hard to get right without writing a lot of code that has nothing to do
> with Metamath.

I think I finally see what you're getting at here. You want to remove proofs
from the text that needs to be edited while maintaining the database, because
it's useless visual noise. I want to keep proofs interleaved in the .mm
file(s) that we store and distribute because the proof file and the statement
file would otherwise have to be kept closely in sync, which is doable but a bit
annoying. Undecided.

If you want to do this, I suggest a line-based "text" format, with one line per
proof, separated by equals signs, sorted, with the current format on the right,
e.g.:

id = ( wi ax-1 mpd ) AAABZAAACAECD
idd = ( wi id a1i ) BBCABDE

This should keep diff happy, even if the line length is a bit long.

Secondary proposal: The stripped Metamath files should be called .mms, to solve
our problem with Myanmar.

-sorear

Norman Megill

unread,

Apr 19, 2016, 11:47:10 PM4/19/16

to meta...@googlegroups.com

On 4/18/16 1:31 AM, Mario Carneiro wrote:
...

> What is an example of an error introduced by a careless edit that is
> hard to fix? (Personally, during an editing session I usually save
> the file frequently and run the fast 'verify proof */syntax_only'.)
>
> An almost trivial example: Change syl to $e |- ( ps -> ch ) $e |- ( ch
> -> ph ) $p |- ( ps -> ph ) $. By "fix" here I don't mean "undo the
> change", but "fix the file so that it's not broken, but keep the change".

As with any computer readable language, one needs to have a mental model
to know which changes can be done in a text editor without side effects.
Some changes are better done with assistance from syntax-aware tools.
A novice will quickly discover that this isn't a casual edit and undo
the change, or perhaps ask someone how to keep the change if it seems
sufficiently important.

This is certainly an example of something that would be best done with a
syntax-aware tool and not in a text editor. We don't have a tool with
such a command at this point, and the ability to do it would be a nice
feature to have. However, for the record in case anyone is wondering,
here is the way I would do this with the present tools.

1. Rename the original syl and all of its references to sylOLD.
2. Add the new syl.
3. Create and run a script that does 'minimize_with syl/allow_growth' in
MM-PA for each theorem using sylOLD.
4. Delete statement sylOLD.

Steps 1, 2, and 4 are done with a text editor. I have a script-building
script to semi-automate step 3, making it reasonably quick and easy for
me to do.

And for completeness:

5. Correct any references to the string sylOLD (as in ~ sylOLD).
6. Document the revision in the syl comment, date, and 'Recent label
changes'.

Norm

Norman Megill

unread,

Apr 20, 2016, 12:02:49 AM4/20/16

to meta...@googlegroups.com

On 4/19/16 11:28 PM, Stefan O'Rear wrote:
...

> I think I finally see what you're getting at here. You want to remove proofs
> from the text that needs to be edited while maintaining the database, because
> it's useless visual noise. I want to keep proofs interleaved in the .mm
> file(s) that we store and distribute because the proof file and the statement
> file would otherwise have to be kept closely in sync, which is doable but a bit
> annoying. Undecided.

Couldn't a syntax-aware editor simply have a mode that hides proofs?

Norm

Stefan O'Rear

unread,

Apr 20, 2016, 12:14:11 AM4/20/16

to meta...@googlegroups.com

I beleive that is precisely what Mario was aiming at with "rather than fiddling
with collapsing markers to hide the proofs".

-sorear

Stefan O'Rear

unread,

Apr 20, 2016, 1:30:58 AM4/20/16

to meta...@googlegroups.com

On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:

> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

After a more careful consideration of the algorithmic constraints, I now
believe it is technically feasable to target an ex nihilo (hot OS
caches, but no other saved state) verification of the current set.mm in
500 milliseconds on my dual core laptop. I'm going to start that now.

-sorear

Stefan O'Rear

unread,

Apr 20, 2016, 1:10:39 PM4/20/16

to meta...@googlegroups.com

On second thought, I've been rather thoroughly missing the forest for
the trees this week. What the things I've been doing would benefit most
from is a new proof editor with a MM-PA style UI but featuring
compressed proof handling, a grammar-aware unifier, and several other
odds and ends; I can do that with smm 0.2, no change to the database
format is required.

Parallel and incremental verification is still technically interesting
and I might do something along those ends, but it's not something I need
right now in any sense.

smm 0.2 will be fairly weak for maintenance mode operations, but I can
keep using a text editor for most of those.

I'd kind of like to split up set.mm to make tools happier, but one of
the few ways set.mm is *not* like a program is that it either works
(verifies) or it doesn't. With a program it's useful to have very
frequent commits because it makes it easier to understand "why" when a
regression is discovered, but set.mm has no regressions and so commits
every couple days are good enough for history-keeping.

Mario, as far as I can tell you want to split out the proofs to make it
easier to manage maintainance mode operations in Eclipse. I think that
can be done using very simple tools to convert between split files and
not-split files, while avoiding ecosystem fragmentation by only sharing
the not-split files. I think smm 0.2 could handle the splitting and
unsplitting with a <100 line program; I could do that if you'd like.

-sorear

Mario Carneiro

unread,

Apr 21, 2016, 8:07:36 AM4/21/16

to metamath

On Tue, Apr 19, 2016 at 5:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.

Consider using real tools. Emacs manages a 28M file as if it had 100 bytes.

I'm glad we've come to an agreement, that a proper tool is needed (and mmj2 does not fully fill this role). I'm personally not a fan of Emacs, but they have great customization options and a large programmer base so I can't ignore them off hand. (But those crazy shortcuts...) Also, Lean is usually used with an Emacs plugin, which I will likely be working a lot with during my PhD, so I'll need to learn it eventually.

On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:

On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:

> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>

> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

I've been playing with Rust for a few months now, and I've *already* been
thinking it'd be an excellent fit for a new metamath program. Also provides
nicer multithreading than C (and *much* nicer than any current Javascript), and
there are quite a few places we can leverage that for improved responsiveness
in the editor... Was going to try writing a one-shot verifier in Rust in a
week or so.

That's great. I'll open source my project soon, and then we can collaborate on this. (I'm calling it rmm.)

Given the low startup overhead of a pre-compiled program and the ability to
keep a cache file (see below), I think startup under 200ms is feasible. Will
need to do experiments.

What does this "startup" entail? Is it reading the whole .mm file (or equivalent)? I suppose the minimum necessary for "ready to go" state is to have parses and frames for all the statements. Is this what you mean?

Proposal for SMM 0.3:

* The Database is divided up into segments, which are coterminous with .mm
files. If proofs are split from statements, then each segment has a .mms
and a .mmc file.

It is possible to consider shorter segments, but some way to stably
identify them would become necessary.

* Each database has a "project file" which lists the used Database segments.
Details TBD.

* A segment must end with a newline, must consist of complete Metamath
statements only, and may not contain unbalanced ${ $}. Thus, the lexer
state and the nesting state are both automatically known at the top of a
segment.

* $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
requires state (the include list). The exception is for compatibility.

What is this "multiple segments" exception? I notice it a few more times below. If there is only one "segment", are you trying to make it the same as parsing a single .mm file?

* Top level $e statements are forbidden if there are more than one segment.
(They create difficulty for frame construction).

* Given a single segment in isolation, it can be parsed into Metamath
statements and blocks. For every assertion, we can identify the `$e`
hypotheses (but not the `$f`s, because we still allow those at top level and
they could be in scope from a former segment). Record all used global names
in an index.

* Once we've scanned all segments, we can detect errors from name collisons.

* Check all segments again. Now that we have the full `$c`/`$v`/`$f` index,
we can construct frames for all assertions.

* At this point we have frames for all `$a` statements, so we can construct
the grammar.

* Check all segments again. We have a grammar and can do a grammatical check;
we also have frames for everything and can do a proof validity check. This
phase works locally within assertions so it can be incremental at that
level.

I am fairly confident that all of this can be done in a strongly incremental
way: the only changes that will requiring scanning the full active database are
the ones that change the definition or parsing rules of a math token that is
actually used everwhere.

The key, I think, is figuring out how to do incremental changes in this setup. Let's number your phases, (1) gathering names (2) building frames (3) grammar check (4) proof check. We can make a cache file for each phase and each segment.

* The phase 1 cache is never invalidated by changes to other segments.

* The phase 2 cache is invalidated only by changes or removal of global $v and $f statements in previous segments.

* The phase 3 cache is invalidated by changes or removal of syntax $a statements in previous segments.

* The phase 4 cache is invalidated by changes or removal of $a and $p statements in previous segments.

Since the last two phases have no sequential dependencies, the cache can be done at a per-statement level, though, in which it is more discriminating:

* A single-statement phase 3 cache is invalidated by changes or removal of statements used in the parse.

* A single-statement phase 4 cache is invalidated by changes or removal of statements used in the proof.

At this point, the hard part is figuring out what caches to invalidate after a change, because of the complexity of the test. To that end I would introduce another cache, let's call it phase 5, which computes the used-by list for each statement. This is a straight inversion of the proof reference list, which depends only on the .mmc file, so it can be done in parallel with phase 1-4, although it may not make sense if one of the other phases fails. We have:

* A single-statement phase 5 cache is invalidated by addition or removal of statements using it.

The nice thing is that this cache is never affected by changes to .mms files, and it allows the efficient location of the phase 3,4 caches.

I think I finally see what you're getting at here. You want to remove proofs
from the text that needs to be edited while maintaining the database, because
it's useless visual noise. I want to keep proofs interleaved in the .mm
file(s) that we store and distribute because the proof file and the statement
file would otherwise have to be kept closely in sync, which is doable but a bit
annoying. Undecided.

If you want to do this, I suggest a line-based "text" format, with one line per
proof, separated by equals signs, sorted, with the current format on the right,
e.g.:

id = ( wi ax-1 mpd ) AAABZAAACAECD
idd = ( wi id a1i ) BBCABDE

This should keep diff happy, even if the line length is a bit long.

I can back this proposal, since it's basically the same as the binary .mmc format, but presented in text form which should make Norm happy. However, I'm sure you noticed that I took the opportunity of a new format to subtly change the algorithm to eliminate "Z" markers, and I still want to try that out. Also, without the need for dodging text searches, I'd prefer to open up the range of valid characters in the compressed proof to base 64 (which does not contain spaces or newlines as data).

I redid the analysis with 64 and 26 characters maximum, and threw in the original algorithm as well (which uses two regions and an extra byte for marked steps), for some choices of parameters (for the new algorithm it is the triple a+b+c = n, for the old it is a+b = n-1):

Alg. | chars | params | result

new | 256 | 208,46,2 | 6085 K

old | 256 | 253,2 | 6463 K

old | 64 | 57,6 | 8198 K

old | 64 | 56,7 | 8208 K
new | 64 | 48,9,7 | 8300 K

old | 26 | 19,6 | 10260 K

old | 26 | 20,5 | 10282 K
new | 26 | 16,2,8 | 10428 K

Interestingly, the gains made by the 'new' algorithm are lost in the smaller character sets. Note that the old algorithm beats the new in each case given a constant character set. (Also, it is interesting to note that 20,5 is not the optimal cutoff for the current setup, 19,6 beats it by a small margin, at least on set.mm's data.) Anyone willing to get behind 57,6 encoding (or 56,7 which falls on a power of 2) for compressed proofs in .mmc format?

Secondary proposal: The stripped Metamath files should be called .mms, to solve
our problem with Myanmar.

Sounds good.

On Wed, Apr 20, 2016 at 12:14 AM, Stefan O'Rear <sor...@gmail.com> wrote:

On Wed, Apr 20, 2016 at 12:02:40AM -0400, Norman Megill wrote:
> Couldn't a syntax-aware editor simply have a mode that hides proofs?

I believe that is precisely what Mario was aiming at with "rather than fiddling

with collapsing markers to hide the proofs".

Whenever you present the user with a text area that almost-but-not-quite matches the actual text being saved, there is an increase in complexity regarding the handling of these "hidden regions". What happens when you type in the vicinity of a hidden region? How do you make sure that the data stays with the statement as the user types around it? By exporting this to another file, there is a clearly defined relationship between the two, which is unaffected by manual edits.

This is one of my targets with this whole business: good maintenance mode editing, with a strong interrelation with proof mode. Here's an example edit which I just did manually, for which a program would have been great:

Theorem A is going away, to be replaced by theorem B. But it's not a straight replacement. Instead, A is usually applied as a subtree of the form syl2anc[x, y, A] where x,y are proof subtrees, and I want to replace this whole tree fragment with some other tree fragment T[x, y] which matches the original (and T contains B somewhere, say T[x, y] = syl[syl2anc[x, y, B], C]. In other words, there is a straight replacement, but it is at the proof tree level instead of the theorem level.

I'm sure Norm knows some clever trick for doing this with minimize/allow_growth. A simple primitive operation which would accomplish the above is "inline theorem" and "extract theorem". Extract theorem T would look for subtrees in a theorem proof that match the proof of T (or perhaps just match the signature of T), and uses T instead. "Minimize with T" from MM-PA does essentially this. Conversely, "inline theorem T" would go to each use of T and insert the proof of T instead of referencing T; afterward T will not be referenced and can be deleted. There are some subtleties in choosing dummy variables in the proof, if there are any.

I'd kind of like to split up set.mm to make tools happier, but one of
the few ways set.mm is *not* like a program is that it either works
(verifies) or it doesn't. With a program it's useful to have very
frequent commits because it makes it easier to understand "why" when a
regression is discovered, but set.mm has no regressions and so commits
every couple days are good enough for history-keeping.

I'm not sure I understand this. A program can fail in many places, and set.mm can also fail to compile in many places. If you want an example of a "run-time error", it's true that they are much less common in Metamath due to the stringent compilation, but it's not impossible. For example someone might try to use a theorem and discover that it no longer exists, or has the wrong signature, because some crazy guy is periodically reorganizing the library. They would then search the history for the theorem name, to find out what happened to it and what replaces it. The relationship to run-time errors here is that the database "works", it just doesn't work the way you think it does.

Mario, as far as I can tell you want to split out the proofs to make it
easier to manage maintainance mode operations in Eclipse. I think that
can be done using very simple tools to convert between split files and
not-split files, while avoiding ecosystem fragmentation by only sharing
the not-split files. I think smm 0.2 could handle the splitting and
unsplitting with a <100 line program; I could do that if you'd like.

If you have a self-contained program to do this, please share it, I'm sure I could make use of it.

I beg to differ regarding sharing not-split, though. The major benefits of splitting are in git maintenance and day to day work, so keeping it local largely defeats the purpose. Plus, I want to be able to choose the split file names, which isn't possible if the "default" is not-split. Instead, we should unsplit the git repo for producing deliverables such as the set.mm download on the website.

Stefan O'Rear

unread,

Apr 21, 2016, 12:17:59 PM4/21/16

to meta...@googlegroups.com

On Thu, Apr 21, 2016 at 08:07:35AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> > I've been playing with Rust for a few months now, and I've *already* been
> > thinking it'd be an excellent fit for a new metamath program. Also

...

>
> That's great. I'll open source my project soon, and then we can collaborate
> on this. (I'm calling it rmm.)

The main disadvantage of something like Rust is the difficulty of
handling user extensions. Rust's learning curve is not the shallowest,
and the 100 MB compiler download and the need for frequent slow
recompiles will create a high barrier of entry for things like macros.
Most likely I'd wind up embedding Lua; now we have a different problem,
namely two parallel ecosystems of support code.

> > Given the low startup overhead of a pre-compiled program and the ability to
> > keep a cache file (see below), I think startup under 200ms is feasible.
> > Will need to do experiments.
>
> What does this "startup" entail? Is it reading the whole .mm file (or
> equivalent)? I suppose the minimum necessary for "ready to go" state is to
> have parses and frames for all the statements. Is this what you mean?

Yes. "md5sum set.mm" completes in about 50ms, so I think it'll take
about that long to segment the set.mm file, generate checksums of the
segments, and verify in a database that they haven't changed. There's a
lot of fiddliness in that design; see "forest for the trees" below.

> > Proposal for SMM 0.3:
...

> > * $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
> > requires state (the include list). The exception is for compatibility.
>
> What is this "multiple segments" exception? I notice it a few more times
> below. If there is only one "segment", are you trying to make it the same
> as parsing a single .mm file?

Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
files just because they do things like declare `$e` statements in the
top-level scope.

> The key, I think, is figuring out how to do incremental changes in this
> setup. Let's number your phases, (1) gathering names (2) building frames
> (3) grammar check (4) proof check. We can make a cache file for each phase
> and each segment.
>
> * The phase 1 cache is never invalidated by changes to other segments.
> * The phase 2 cache is invalidated only by changes or removal of global $v
> and $f statements in previous segments.

No. The key to this entire programme is to maintain reverse mappings;
"all previous segments" will *never appear* in the final algorithm.
Remember, this is planning for a world where set.mm is 200+ MB, a world
where memories and disks have gotten larger but CPUs haven't gotten much
faster.

The phase 2 cache is invalidated when any math symbol which is used in
the segment is the subject of a changed global $v or $f in any other
segment.

> * The phase 3 cache is invalidated by changes or removal of syntax $a
> statements in previous segments.

Grammar check for a statement is invalidated if it contains all of the
`$c` symbols from an `$a` that was added or removed. (Yes, this means
that if you mess with `cv` you'll have a small problem.)

> * The phase 4 cache is invalidated by changes or removal of $a and $p
> statements in previous segments.

Proof check for a statement is invalidated by any change to the frames
of the statements to which the proof refers.

I suspect you'll find that the differences in alphabet size are
essentially moot after running through any modern data compressor with a
Huffman coding pass. This email, after all, is an example of a (mostly)
27-coded file.

At this point in time I view substantial changes to the compressed proof
format itself as churn for churn's sake and I still am not sold.

> On Wed, Apr 20, 2016 at 12:14 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Wed, Apr 20, 2016 at 12:02:40AM -0400, Norman Megill wrote:
> > > Couldn't a syntax-aware editor simply have a mode that hides proofs?
> >
> > I believe that is precisely what Mario was aiming at with "rather than
> > fiddling with collapsing markers to hide the proofs".
>
> Whenever you present the user with a text area that almost-but-not-quite
> matches the actual text being saved, there is an increase in complexity
> regarding the handling of these "hidden regions". What happens when you
> type in the vicinity of a hidden region? How do you make sure that the data
> stays with the statement as the user types around it? By exporting this to
> another file, there is a clearly defined relationship between the two,
> which is unaffected by manual edits.

Agree.

> This is one of my targets with this whole business: good maintenance mode
> editing, with a strong interrelation with proof mode. Here's an example
> edit which I just did manually, for which a program would have been great:
>
> Theorem A is going away, to be replaced by theorem B. But it's not a
> straight replacement. Instead, A is usually applied as a subtree of the
> form syl2anc[x, y, A] where x,y are proof subtrees, and I want to replace
> this whole tree fragment with some other tree fragment T[x, y] which
> matches the original (and T contains B somewhere, say T[x, y] =
> syl[syl2anc[x, y, B], C]. In other words, there is a straight replacement,
> but it is at the proof tree level instead of the theorem level.
>
> I'm sure Norm knows some clever trick for doing this with
> minimize/allow_growth. A simple primitive operation which would accomplish
> the above is "inline theorem" and "extract theorem". Extract theorem T
> would look for subtrees in a theorem proof that match the proof of T (or
> perhaps just match the signature of T), and uses T instead. "Minimize with
> T" from MM-PA does essentially this. Conversely, "inline theorem T" would
> go to each use of T and insert the proof of T instead of referencing T;
> afterward T will not be referenced and can be deleted. There are some
> subtleties in choosing dummy variables in the proof, if there are any.

If smm 0.2 had a proof parser it could have handled that. We don't need
a new architecture for that.

> > I'd kind of like to split up set.mm to make tools happier, but one of
> > the few ways set.mm is *not* like a program is that it either works
> > (verifies) or it doesn't. With a program it's useful to have very
> > frequent commits because it makes it easier to understand "why" when a
> > regression is discovered, but set.mm has no regressions and so commits
> > every couple days are good enough for history-keeping.
> >
>
> I'm not sure I understand this. A program can fail in many places, and
> set.mm can also fail to compile in many places. If you want an example of a
> "run-time error", it's true that they are much less common in Metamath due
> to the stringent compilation, but it's not impossible. For example someone
> might try to use a theorem and discover that it no longer exists, or has
> the wrong signature, because some crazy guy is periodically reorganizing
> the library. They would then search the history for the theorem name, to
> find out what happened to it and what replaces it. The relationship to
> run-time errors here is that the database "works", it just doesn't work the
> way you think it does.

You could still do that with fairly large commits.

> > Mario, as far as I can tell you want to split out the proofs to make it
> > easier to manage maintainance mode operations in Eclipse. I think that
> > can be done using very simple tools to convert between split files and
> > not-split files, while avoiding ecosystem fragmentation by only sharing
> > the not-split files. I think smm 0.2 could handle the splitting and
> > unsplitting with a <100 line program; I could do that if you'd like.
> >
>
> If you have a self-contained program to do this, please share it, I'm sure
> I could make use of it.

I don't have one now, but I could write one in less time than I've spent
on this reply.

> I beg to differ regarding sharing not-split, though. The major benefits of
> splitting are in git maintenance and day to day work, so keeping it local
> largely defeats the purpose. Plus, I want to be able to choose the split
> file names, which isn't possible if the "default" is not-split. Instead, we
> should unsplit the git repo for producing deliverables such as the set.mm
> download on the website.

I think we're having a vocabulary problem because of the two kinds of
splitting.

If we go down the route of section splitting, then we should keep the
section files in git, for many reasons including "being able to choose
the split file names".

For binary splitting into set.mms and set.mmc, there's no particular
advantage to having two files in git rather than one, so I'd recommend
one.

-sorear

Stefan O'Rear

unread,

Apr 22, 2016, 12:50:15 AM4/22/16

to meta...@googlegroups.com

On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> Proposal for SMM 0.3:

I've made more progress on the API design for SMM0.3. Segments aren't
just a performance hack; some form of them is semantically necessary if
you have multiple .mm files because we need to keep track of what gets
saved where. SMM0.2's flat array of statements won't cut it here.

File inclusion (with the restrictions added in
330680035.12827.14329...@alum.mit.edu) and top level
$e are no longer a problem in the design; in fact, my preferred outcome
for set.mm is a top level file which includes a few dozen small files.

-sorear

Mario Carneiro

unread,

Apr 22, 2016, 1:18:59 AM4/22/16

to metamath

On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:

On Thu, Apr 21, 2016 at 08:07:35AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> > I've been playing with Rust for a few months now, and I've *already* been
> > thinking it'd be an excellent fit for a new metamath program. Also
...
>
> That's great. I'll open source my project soon, and then we can collaborate
> on this. (I'm calling it rmm.)

The main disadvantage of something like Rust is the difficulty of
handling user extensions. Rust's learning curve is not the shallowest,
and the 100 MB compiler download and the need for frequent slow
recompiles will create a high barrier of entry for things like macros.
Most likely I'd wind up embedding Lua; now we have a different problem,
namely two parallel ecosystems of support code.

One way to handle this is if the Rust verifier, which acts as a command line tool, accepts commands passed on stdin, and a separate program, written in another language (like node.js) produces those commands and is piped through. That way the verifier acts as the "kernel" and the macro language can support higher-level stuff. (The Rust verifier will also support higher level operations like atomic maintenance mode operations, but the set.mm-specific stiff can be farmed out.) The reason for using node.js instead of Lua is mainly to keep the macro language common with smm and mmj2.

> > Given the low startup overhead of a pre-compiled program and the ability to
> > keep a cache file (see below), I think startup under 200ms is feasible.
> > Will need to do experiments.
>
> What does this "startup" entail? Is it reading the whole .mm file (or
> equivalent)? I suppose the minimum necessary for "ready to go" state is to
> have parses and frames for all the statements. Is this what you mean?

Yes. "md5sum set.mm" completes in about 50ms, so I think it'll take
about that long to segment the set.mm file, generate checksums of the
segments, and verify in a database that they haven't changed. There's a
lot of fiddliness in that design; see "forest for the trees" below.

> > Proposal for SMM 0.3:
...
> > * $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
> > requires state (the include list). The exception is for compatibility.
>
> What is this "multiple segments" exception? I notice it a few more times
> below. If there is only one "segment", are you trying to make it the same
> as parsing a single .mm file?

Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
files just because they do things like declare `$e` statements in the
top-level scope.

Part of the idea behind labeling this as "Metamath 2.0" is so that we don't feel any particular draw to maintain perfect backward compatibility if it gets in the way. I have no compunctions about disallowing global $e statements, it seems to be totally logical. (There's no way you can convince me that global $e is something you would ever want to do.) A more subtle question is global $d's, because I can actually imagine someone building a set.mm like database and deciding to make all set variables disjoint from each other, once and for all.

> The key, I think, is figuring out how to do incremental changes in this
> setup. Let's number your phases, (1) gathering names (2) building frames
> (3) grammar check (4) proof check. We can make a cache file for each phase
> and each segment.
>
> * The phase 1 cache is never invalidated by changes to other segments.
> * The phase 2 cache is invalidated only by changes or removal of global $v
> and $f statements in previous segments.

No. The key to this entire programme is to maintain reverse mappings;
"all previous segments" will *never appear* in the final algorithm.
Remember, this is planning for a world where set.mm is 200+ MB, a world
where memories and disks have gotten larger but CPUs haven't gotten much
faster.

The phase 2 cache is invalidated when any math symbol which is used in
the segment is the subject of a changed global $v or $f in any other
segment.

Fair enough. (Also, reordering of $f's will also invalidate the phase 2 cache.) Actually, we don't even need $v's to invalidate the cache; if there are any problems with undefined variables they will already be caught when checking the corresponding $f.

> * The phase 3 cache is invalidated by changes or removal of syntax $a
> statements in previous segments.

Grammar check for a statement is invalidated if it contains all of the
`$c` symbols from an `$a` that was added or removed. (Yes, this means
that if you mess with `cv` you'll have a small problem.)

I don't see the advantage of this, over looking at the RPN parse string instead, which is much more accurate. If we assume the grammar is unambiguous, as long as the statements in the RPN are not changed, the parse will still be the correct one. (This does not do "empirical unambiguity testing" though, and relies on either trusting the user or proving unambiguity by means of a KLR parser or something similar.)

> I'm sure Norm knows some clever trick for doing this with
> minimize/allow_growth. A simple primitive operation which would accomplish
> the above is "inline theorem" and "extract theorem". Extract theorem T
> would look for subtrees in a theorem proof that match the proof of T (or
> perhaps just match the signature of T), and uses T instead. "Minimize with
> T" from MM-PA does essentially this. Conversely, "inline theorem T" would
> go to each use of T and insert the proof of T instead of referencing T;
> afterward T will not be referenced and can be deleted. There are some
> subtleties in choosing dummy variables in the proof, if there are any.

If smm 0.2 had a proof parser it could have handled that. We don't need
a new architecture for that.

Well, I think smm *is* the new architecture, provided it makes it to the big time. Although currently I don't really know how to use it for anything except one-shot verification (if that).

> I beg to differ regarding sharing not-split, though. The major benefits of
> splitting are in git maintenance and day to day work, so keeping it local
> largely defeats the purpose. Plus, I want to be able to choose the split
> file names, which isn't possible if the "default" is not-split. Instead, we
> should unsplit the git repo for producing deliverables such as the set.mm
> download on the website.

I think we're having a vocabulary problem because of the two kinds of
splitting.

If we go down the route of section splitting, then we should keep the
section files in git, for many reasons including "being able to choose
the split file names".

For binary splitting into set.mms and set.mmc, there's no particular
advantage to having two files in git rather than one, so I'd recommend
one.

By "splitting" above I was referring to segmenting the .mm file by section or similar. For separating the proofs from the statements, I agree that since the translation in each direction is so simple, it is probably easier to keep them all together, unless we get to the point where split-proof form is by far the most commonly used.

On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:

On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> Proposal for SMM 0.3:

I've made more progress on the API design for SMM0.3. Segments aren't
just a performance hack; some form of them is semantically necessary if
you have multiple .mm files because we need to keep track of what gets
saved where. SMM0.2's flat array of statements won't cut it here.

File inclusion (with the restrictions added in
330680035.12827.14329...@alum.mit.edu) and top level
$e are no longer a problem in the design; in fact, my preferred outcome
for set.mm is a top level file which includes a few dozen small files.

Your link looks corrupted. Are there changes to the design from the earlier email?

I lost you at "my preferred outcome for set.mm is a top level file which includes a few dozen small files": does this differ from your earlier desire for a smaller upper bound on .mms file length, on the order of 200K?

Mario

Stefan O'Rear

unread,

Apr 22, 2016, 1:47:51 AM4/22/16

to meta...@googlegroups.com

On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:

...

> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

That sounds really, really painful. What's the advantage at this point
over keeping the entire stack in Javascript?

> > Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
> > files just because they do things like declare `$e` statements in the
> > top-level scope.
> >
>
> Part of the idea behind labeling this as "Metamath 2.0" is so that we don't
> feel any particular draw to maintain perfect backward compatibility if it
> gets in the way. I have no compunctions about disallowing global $e
> statements, it seems to be totally logical. (There's no way you can
> convince me that global $e is something you would ever want to do.) A more
> subtle question is global $d's, because I can actually imagine someone
> building a set.mm like database and deciding to make all set variables
> disjoint from each other, once and for all.

Fair point. I'm in favor of outlawing global $e and $d.

> > The phase 2 cache is invalidated when any math symbol which is used in
> > the segment is the subject of a changed global $v or $f in any other
> > segment.
>
> Fair enough. (Also, reordering of $f's will also invalidate the phase 2
> cache.) Actually, we don't even need $v's to invalidate the cache; if there
> are any problems with undefined variables they will already be caught when
> checking the corresponding $f.

Will have to think about that.

> > > * The phase 3 cache is invalidated by changes or removal of syntax $a
> > > statements in previous segments.
> >
> > Grammar check for a statement is invalidated if it contains all of the
> > `$c` symbols from an `$a` that was added or removed. (Yes, this means
> > that if you mess with `cv` you'll have a small problem.)
> >
>
> I don't see the advantage of this, over looking at the RPN parse string
> instead, which is much more accurate. If we assume the grammar is
> unambiguous, as long as the statements in the RPN are not changed, the
> parse will still be the correct one. (This does not do "empirical
> unambiguity testing" though, and relies on either trusting the user or
> proving unambiguity by means of a KLR parser or something similar.)

That would also preclude caching parse failures, which may or may not be
a problem. Not sure, will defer that decision until it's relevant.

> > If smm 0.2 had a proof parser it could have handled that. We don't need
> > a new architecture for that.
>
> Well, I think smm *is* the new architecture, provided it makes it to the
> big time. Although currently I don't really know how to use it for anything
> except one-shot verification (if that).

I don't follow "is the new architecture".

smm 0.2 doesn't have the proper plumbing to track which proofs came from
which files, so it's going to be much less useful in a many-small-proofs
world. I think I can fix that in smm 0.3, but it will break all API
consumers.

smm (any version) also does not currently have a proof object model.
There's been some discussion of that in the past; we had a rough
consensus that something like a MMP file was desired, but many details
were TBD and I never got around to implementing it.

> By "splitting" above I was referring to segmenting the .mm file by section
> or similar. For separating the proofs from the statements, I agree that
> since the translation in each direction is so simple, it is probably easier
> to keep them all together, unless we get to the point where split-proof
> form is by far the most commonly used.

In that case we are in complete agreement.

> On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> > > Proposal for SMM 0.3:
> >
> > I've made more progress on the API design for SMM0.3. Segments aren't
> > just a performance hack; some form of them is semantically necessary if
> > you have multiple .mm files because we need to keep track of what gets
> > saved where. SMM0.2's flat array of statements won't cut it here.
> >
> > File inclusion (with the restrictions added in
> > 330680035.12827.14329...@alum.mit.edu) and top level
> > $e are no longer a problem in the design; in fact, my preferred outcome
> > for set.mm is a top level file which includes a few dozen small files.
> >
>
> Your link looks corrupted. Are there changes to the design from the earlier
> email?

That's not a Web link, it's the Message-ID of an email Norm sent to you
and me on 2015-05-29.

> I lost you at "my preferred outcome for set.mm is a top level file which
> includes a few dozen small files": does this differ from your earlier
> desire for a smaller upper bound on .mms file length, on the order of 200K?

No, it fully supports it.

I'm saying I want set.mm to become a long list of file inclusion
statements, able to be read and verified (but not rewritten!) by today's
metamath.exe.

-sorear

Stefan O'Rear

unread,

Apr 22, 2016, 3:57:47 AM4/22/16

to meta...@googlegroups.com

On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:

> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

I now understand that resolving this issue is not a blocker. I can
proceed with design and implementation of smm03-rust, and if we need a
JS version one can be written with much less effort than the initial
smm3 design.

-sorear

Mario Carneiro

unread,

Apr 22, 2016, 4:24:35 AM4/22/16

to metamath

On Fri, Apr 22, 2016 at 1:47 AM, Stefan O'Rear <sor...@gmail.com> wrote:

On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...
> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

That sounds really, really painful. What's the advantage at this point
over keeping the entire stack in Javascript?

Reciprocally, what is the advantage of not just keeping everything in Rust (or more generally, the same language as the verifier)? As you know, I added macros to mmj2, and in hindsight there isn't that much gain if the user isn't writing his own macros, and even if he is this often overlaps with a need to add features or hooks to the main codebase, so there is low gain and high loss regarding multi-language development. Doing it all over again in Rust, I would instead have a whole new project for set-rmm as a specialization of rmm, written in the same language and reusing the library, with specific set.mm features built in (and with a configuration file for minor end-user tweaking). This would be a verifier that only works on set.mm-like databases. I think this better reflects the intent of separation of concerns between general metamath and set.mm, and skips the complexities of macro development.

If you still want it, it is possible to add JS/Lua macros on top of this, if you can get Rust to read an interpreted language. But in most of the immediate cases, I think just having a place to put set.mm-specific information is sufficient.

Why would you want to cache a parse failure? mmj2 just fatal-errors on the first bad parse, although I would extend this to a fatal error with a list of all bad parses. (More generally, if some stage fails, it is probably best to leave the caches that are blocked as a result alone, so that later segments will use the last-good cache. This seems preferable to having a whole cascade of failures.)

> > If smm 0.2 had a proof parser it could have handled that. We don't need
> > a new architecture for that.
>
> Well, I think smm *is* the new architecture, provided it makes it to the
> big time. Although currently I don't really know how to use it for anything
> except one-shot verification (if that).

I don't follow "is the new architecture".

I mean that smm is the first new verifier since mmj2 to seriously consider these issues during the design phase, which I believe will yield a better final product. I also have some designs of my own, but smm is much more well-developed.

smm 0.2 doesn't have the proper plumbing to track which proofs came from
which files, so it's going to be much less useful in a many-small-proofs
world. I think I can fix that in smm 0.3, but it will break all API
consumers.

That's okay, I don't think there are any, besides you. ;)

smm (any version) also does not currently have a proof object model.
There's been some discussion of that in the past; we had a rough
consensus that something like a MMP file was desired, but many details
were TBD and I never got around to implementing it.

Here is my conception of a proof object model. A proof file is a command stream for the command-line verifier, as in:

apply syl = |- ( ph -> ch )
apply thm1 = |- ( ph -> ps )
apply thm2 = |- ( ps -> ch )

The symbol streams after the = are optional, and are produced by proof object generation. This could be presented to the user as a script file, and they could modify the file, then execute it, and the editor would display the errors that result.

The prover state is an unordered collection of steps, in two types, "proven" (having a complete proof subtree) and "unproven" (incomplete or depending on an incomplete step). It also keeps a list of all the incomplete steps with no proof, and these are presented to the user as a message at each stage. The "apply thm" or "apply thm = ..." command, like MM-PA's, performs reverse proving, applying the given theorem to the last incomplete step, and producing new subgoals. For forward proving, the "have thm" or "have thm = ..." command will unify the given theorem and add it to the step list. A new unproven step can be created by "have ? = ...".

Steps can be labeled, by "name: apply thm" etc. To refer to a named step in an application, write "have thm step1 step2". Like mmj2, it will attempt to reorder the steps if necessary to unify, and it will also look for matching proofs from the "proven" list. (It will never automatically unify an unproven step, you have to name these explicitly.) All steps are aggressively deduplicated, so no statement will appear more than once in the list (it takes the one with the shortest proof if a duplication occurs later due to work variable specification). A proven step will of course take priority over an unproven step during deduplication.

Finally, there is a command "set W = ..." to set the value of a work variable. As in mmj2, this is not usually necessary, but may help to limit the possibilities for auto-generation.

Steps are automatically numbered if they are not given names, so in this respect the overall structure ends up quite similar to mmj2's. But the user will not work directly with an .mmp file that is being simultaneously edited by the computer, instead it is closer to a human-written script file, using an editor with syntax checking for live feedback on the current machine state.

At any point, the current state can be export as a script file, which will insert annotated "have" and "apply" commands to get the right state. (Perhaps there can be an option for whether to prefer forward or reverse proving; for mmp-like formatting, with the last step at the bottom, you would want forward proving.)

This setup is quite obviously inspired by the "tactic mode" of Lean, Isabelle, Coq, HOL Light. What do addons look like here? The power of this setup is of course that one can implement commands other than "have", "apply", "set". More command ideas:

* "hyp": the same as "have", but only matches theorem hypotheses

* "by <prover>": invoke a specific named algorithm (which might be built in or supplied by JS if available) which does its best to do something with the goal.
* "back": These algorithms will invariably be too aggressive, and the "back" command will undo the proof of the parent (or all direct parents) of the current step. Alternatively, using the other commands but using the same step name as an existing step will replace that step with a new proof.

One set.mm-specific feature which I would like to investigate is deduction mode, which interprets Horn clauses as deductions. That is, if the current state is

a: hyp = |- ( ph -> B e. RR )

b: have peano2re = |- ( B e. RR -> ( B + 1 ) e. RR )

c: have ? = |- ( ( ph /\ A e. RR ) -> ( A + B ) e. RR )

it detects that the current context is ( ph /\ A e. RR ), which directly contains "ph" and "A e. RR". At the next reachability level, it sees that all the hypotheses of step 'a' are in the context, so "B e. RR" is also true, and at one further level, we have that "( B + 1 ) e. RR" is true. Thus, the effective proof state at c is:

|- ph

|- A e. RR

a: |- B e. RR

b: |- ( B + 1 ) e. RR

=> |- ( A + B ) e. RR

The first two elements are the "hypotheses" (with automatic names), and the other two are "proven steps", which are named after the steps that produce them (there are usually a lot of these so they are not shown). They are all accessible to proof via the deduction-aware "apply" command, so that we can write

apply readdcl

and it will look at the structure of the theorem, |- ( ( ?A e. RR /\ ?B e. RR ) -> ( ?A + ?B ) e. RR ), as a Horn clause, producing the decomposition |- ?A e. RR & |- ?B e. RR => |- ( ?A + ?B ) e. RR. (It first tries to match the full implication against |- ( ( ph /\ A e. RR ) -> ( A + B ) e. RR ) and then |- ( A + B ) e. RR but those fail, and this is next.) It unifies A = ?A and B = ?B so it attempts to find |- A e. RR and |- B e. RR in the list of proven steps, and it will find them and finish the proof.

> On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> > > Proposal for SMM 0.3:
> >
> > I've made more progress on the API design for SMM0.3. Segments aren't
> > just a performance hack; some form of them is semantically necessary if
> > you have multiple .mm files because we need to keep track of what gets
> > saved where. SMM0.2's flat array of statements won't cut it here.
> >
> > File inclusion (with the restrictions added in
> > 330680035.12827.14329...@alum.mit.edu) and top level
> > $e are no longer a problem in the design; in fact, my preferred outcome
> > for set.mm is a top level file which includes a few dozen small files.
> >
>
> Your link looks corrupted. Are there changes to the design from the earlier
> email?

That's not a Web link, it's the Message-ID of an email Norm sent to you
and me on 2015-05-29.

Since this is a public discussion and I don't think I will get into trouble posting this, here is the conversation Stefan is referring to:

On Fri, May 29, 2015 at 6:19 PM, Norman Megill <n...@alum.mit.edu> wrote:

On 5/29/15 3:42 PM, Stefan O'Rear wrote:
> On Fri, May 29, 2015 at 11:34:47AM -0400, Norman Megill wrote:
>> To summarize, your edge cases are all flagged as errors in the current
>> metamath program, and the actual internal behavior ignoring the messages
>> may or may not be what was intended.
>>
>> Certainly I need to clarify the spec. If you think the behavior should
>> be different from what I described, let me know. My goal is to keep
>> both the spec and the implementation as simple as possible.
>>
>> Another issue is how the other half-dozen verifiers handle these cases;
>> I don't know.
>
> Soon to be a half-dozen plus one. My goals are for the behavior to be simple
> and fully specified. The code I have so far will flag an error if a comment or
> $[ trails off the end of a file, so it seems like that much is agreed on. I
> kind of like Mario's suggestion to only allow $[ $] where $c is currently
> allowed, although I don't beleive any verifier uses that rule currently.

I agree. I'll make it official that $[ $] is allowed only in the outermost
scope i.e. not between ${ $}.

I'll also add that comments have highest parse precedence and that an
incomplete comment may not occur in an included file.

I'll also add that $[ $] may not occur inside of a statement (e.g. may
not occur between the label of a $a statement and its $.) and that an
incomplete $[ $] may not occur in an included file.

I'll also add that an incomplete statement (e.g. $a but no $.) may not
occur in an included file. This might be implied by the above
depending on how you read it (since the end of the included file
would be "inside" of a statement), but better to make it explicit.

Have I missed anything?

Norm

fl

unread,

Apr 22, 2016, 7:37:07 AM4/22/16

to Metamath

Reciprocally, what is the advantage of not just keeping everything in Rust

I don't want to cool down your sudden affection for a new brand of language but I'd like to remember you that your main

problem is to get a gui good enough to manage efficiently large files. I doubt the gui provided by rust is better than the one

provided by java.

--

FL

Mario Carneiro

unread,

Apr 22, 2016, 8:40:09 AM4/22/16

to metamath

I don't appreciate your omission of the remark immediately after that quote. My "sudden affection" is not principally for a new language, but for a new architecture; frankly the language is largely irrelevant. The "GUI provided by Rust" is of course not better than java's, because it doesn't have a GUI. That thinking is what got us into this mess in the first place. I have no interest in reinventing the GUI wheel, and would much prefer to plug in to someone else's architecture. With the GUI removed, all that remains is a strong backend, and the fastest application programming languages are those that compile down to actual machine code, without a lot of translation - and it just so happens that Rust focuses on this side of things. (C would also work for this purpose, but I've never really liked writing C code, speaking only for myself. I'm sure there are plenty of Rust vs. C comparisons if you want to look at this carefully.)

But again, my main interest is in a better architecture; small differences between languages are dwarfed by efficiencies from a better approach. I would be happy to accept smm in node.js or any other language if it can deliver the performance and usability benefits over existing tools that seem possible.

Mario

Norman Megill

unread,

Apr 22, 2016, 8:44:39 AM4/22/16

to meta...@googlegroups.com

On 4/22/16 1:47 AM, Stefan O'Rear wrote:
> On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
>> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...

>> Part of the idea behind labeling this as "Metamath 2.0" is so that we don't
>> feel any particular draw to maintain perfect backward compatibility if it
>> gets in the way. I have no compunctions about disallowing global $e
>> statements, it seems to be totally logical. (There's no way you can
>> convince me that global $e is something you would ever want to do.) A more
>> subtle question is global $d's, because I can actually imagine someone
>> building a set.mm like database and deciding to make all set variables
>> disjoint from each other, once and for all.
>
> Fair point. I'm in favor of outlawing global $e and $d.

I'm fine with disallowing global $e but think we should keep global
$d's. See
https://groups.google.com/d/msg/metamath/A2p5GPDwXxw/JIn_Z3YVCAAJ for
how this can be used to emulate individual variables (instead of
metavariables ranging over them).

Here is a collection of proposed changes and clarifications to the
Metamath 1.0 spec collected from various emails.

1. Comments have highest parse precedence e.g. "$( $[ $)" is a comment
and does not violate item 4 below. Nested comments will continue to be
disallowed e.g. "$( $( $)" is not allowed.

2. $[ ... $] is allowed only in the outermost scope i.e. not between ${ $}.

3. $[ ... $] may not occur inside of a statement (e.g. may not occur
between the label of a $a statement and its $.).

4. $( without a matching $) or $[ without a matching $] may not occur in
an included file.

5. An incomplete statement (e.g. $a but no $.) may not occur in an
included file.

6. The type declared for a variable by a $f statement is effectively
global, even if the variable is not (e.g. "wff P" in one local scope
and "class P" in another is not allowed).

7. A $e statement may not occur in the outermost scope.

Norm

fl

unread,

Apr 23, 2016, 6:35:17 AM4/23/16

to Metamath

Hi Norm,

> Here is a collection of proposed changes and clarifications to the
> Metamath 1.0 spec collected from various emails.

I've not followed carefully the debate and I hope I'm not wide of

the mark but here is an evolution I would like.

${ <- beginning of a group. Only one is possible. No nested curly braces.

aa $e |- A = ....

ab $e |- B = ....

ac $e |- C = ....

ad $e |- D = ....

$d .... $. q<- refers to ba only.

$( $)

ba $p |- .... uses only aa and ac ....

$d .... $. q<- refers to bb only.

$( $)

bb $p |- .... uses only ad

$( $)

bc $p |- .... uses only ac and ad

$}

#1 On a html page only the premises used by a theorem are showed.

#2 In a proof only the premises really used are referenced.

#3 $d statements only concern the next $p statement.

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 6:50:48 AM4/23/16

to metamath

On Sat, Apr 23, 2016 at 6:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

Hi Norm,

> Here is a collection of proposed changes and clarifications to the
> Metamath 1.0 spec collected from various emails.

I've not followed carefully the debate and I hope I'm not wide of
the mark but here is an evolution I would like.

#1 On a html page only the premises used by a theorem are showed.

#2 In a proof only the premises really used are referenced.

I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.

Without "automatic premise selection", limiting groups to one nesting level becomes too restrictive.

#3 $d statements only concern the next $p statement.

This isn't a big deal, but I don't see any advantage. Particularly if you are using a group to share $e's, you will often want DV pairs in the variables of these shared $e's, which will then be repeated for each proof. In fact, I tend to go in the complete opposite direction, collecting all $d's in a group right at the start (of the outermost level), unless there is a specific $d I need to avoid that would be over-shared by this. This always gives the shortest specification, when it is possible to do. Conversely, totally unsharing $d's will make specification longer for no apparent benefit.

Mario

Mario Carneiro

unread,

Apr 23, 2016, 6:54:06 AM4/23/16

to metamath

On Sat, Apr 23, 2016 at 6:50 AM, Mario Carneiro <di....@gmail.com> wrote:

On Sat, Apr 23, 2016 at 6:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:
#2 In a proof only the premises really used are referenced.

I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.

By the way, I made a very similar suggestion to this in Lean (look at the proof to figure out which optional hypotheses are needed), and Leo de Moura gave me essentially this answer, and I agree.

Mario

fl

unread,

Apr 23, 2016, 7:09:08 AM4/23/16

to Metamath

#2 In a proof only the premises really used are referenced.

I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.

By the way, I made a very similar suggestion to this in Lean (look at the proof to figure out which optional hypotheses are needed),

I don't understand what you mean. We use premises to register shortcuts "A = ...". If A or B is not present in

a $p the premise is simply not used. You don't need to look at the proof. And after all if you look at the html

page you know which premises are discarded.

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 7:14:21 AM4/23/16

to metamath

This logic won't work for a lot of lemmas, which define abbreviations used in the proof but not otherwise appearing in the premises or conclusion. For example, http://us.metamath.org/mpegif/metnrmlem3.html defines abbreviations F,J,U,G,V but only J appears in the "true" hypotheses (the ones starting "ph ->") or the conclusion.

fl

unread,

Apr 23, 2016, 7:20:16 AM4/23/16

to Metamath

> #2 In a proof only the premises really used are referenced.

Obviously the intent is to avoid repeating again and again the same premises.

(By the way Patti Smith is in Paris. The last chance to see the eyes who saw

Ginsberg, Cobain, Warhol and so on.)

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 7:28:39 AM4/23/16

to metamath

On Sat, Apr 23, 2016 at 7:20 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

> #2 In a proof only the premises really used are referenced.

Obviously the intent is to avoid repeating again and again the same premises.

But you don't need to repeat $e's, that's the whole point of the scoping system for building frames. They show up multiple times in the HTML because it's easier to read a proof when all the hypotheses in the frame are presented (to make the page self-contained). If you use hypotheses in a weird off and on pattern you may need to write the same $e twice, but usually you can get them to operate on a contiguous list of theorems, in which case scoping will let you write the $e just once. Many lemmas use this technique for pushing and popping assumptions, sometimes with high scope depth as a result (like lsppratlem1-6, to take a random example).

Mario

fl

unread,

Apr 23, 2016, 7:52:37 AM4/23/16

to Metamath

> This logic won't work for a lot of lemmas, which define abbreviations used in the proof but not otherwise appearing in the > premises or conclusion.

OK but you add a bit of recursion obviously.

--

FL

fl

unread,

Apr 23, 2016, 7:54:57 AM4/23/16

to Metamath

But you don't need to repeat $e's, that's the whole point of the scoping system for building frames.

Currently it doesn't work well. You need a lot of boring (and hard to read) curly braces. And even with

that you get stuck quickly and need to repeat all the abbreviations once again.

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 8:00:41 AM4/23/16

to metamath

On Sat, Apr 23, 2016 at 7:52 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

> This logic won't work for a lot of lemmas, which define abbreviations used in the proof but not otherwise appearing in the > premises or conclusion.

OK but you add a bit of recursion obviously.

I've lost you. Recursion on what? Assuming you can't look at the proof, how would you deduce that those other abbreviations in metnrmlem3 are needed? (Beyond the point about forcing serial proof checking, it might also be the case that the proof is incomplete, but you still want to check later theorems.)

I have nothing to say to this, but clearly our experiences are different.

Mario

fl

unread,

Apr 23, 2016, 8:16:45 AM4/23/16

to Metamath

I've lost you. Recursion on what? Assuming you can't look at the proof, how would you deduce that those other abbreviations in metnrmlem3 are needed?

OK I was thinking about abbreviation only "A = ...", but even if you need to look at the proof,

what's the problem if you want to generate a html page?

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 8:36:12 AM4/23/16

to metamath

There's no problem for HTML generation, but the frame of the assertion itself, as far as verification is concerned, has to be completely determined by non-proof data. So we are back to shared $e's and nested scopes in .mm files at that point. Given this, it is only natural to present the complete frame of the $p in the proof page, because otherwise it would be misleading.

If we're talking Metamath 2.0 here, perhaps what you actually want is a *true* abbreviation mechanism, a way to replace a substring with a new temporary constant, which is purely visual (the underlying string being manipulated does not actually contain the abbreviation). Something like:

${

$b abbr $= ( 2 + 2 ) $.

thm $p |- abbr = 4 $= 2p2e4 $.
$}

Note that "thm" is for all intents and purposes equivalent to 2p2e4, and within the proof one would be free to manipulate strings with "abbr" as if it said "( 2 + 2 )" instead. The display would be no more than a simple find/replace "( 2 + 2 )" -> "abbr" in the proof, meaning that any complete ( 2 + 2 ) expression is automatically displayed as "abbr". Abbreviations don't need names because expansion is automatic.

Since abbreviations do not contribute to the frame (insofar as they don't change the stack machine operation inside proof bodies), they can be safely "autodetected" in the way that you want, i.e. you could declare a whole bunch of abbreviations and only show the abbreviations that are used in the hypotheses, conclusion, or the proof steps, when creating the proof HTML.

Mario

fl

unread,

Apr 23, 2016, 8:57:26 AM4/23/16

to Metamath

If we're talking Metamath 2.0 here, perhaps what you actually want is a *true* abbreviation mechanism,

I want a way to define my abbreviation for a set of theorems once for all and that they are discarded for

a specific theorem if they are not actually used by its proof either when the theorem is applied later or

when it is htmlized.

--

FL

Mario Carneiro

unread,

Apr 23, 2016, 9:00:30 AM4/23/16

to metamath

My $b counter-proposal has this property.

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To post to this group, send email to meta...@googlegroups.com.
Visit this group at https://groups.google.com/group/metamath.
For more options, visit https://groups.google.com/d/optout.

fl

unread,

Apr 23, 2016, 9:25:30 AM4/23/16

to Metamath

My $b counter-proposal has this property.

Maybe but I don't like the idea of adding new tags.

(Carl Solomon [https://en.wikipedia.org/wiki/Carl_Solomon] met Ginsberg in the waiting room of

Greystone Park Psychiatric Hospital . It sounds like a plot by Woody Allen:)

--

FL

Thierry Arnoux

unread,

Jun 21, 2016, 10:07:39 AM6/21/16

to meta...@googlegroups.com

Hi Mario,

A while ago (back in 2011...), I have been writing a Metamath plugin for Eclipse.
There are still some screenshots and downloads here:

http://arnoux.ch/thierry/metamath/

At that time I have been syntax-highlighting the whole set.mm, it was already at 9Mb.
There was navigation functions to move directly to the definition of an entity, tool tips, and I had also started to integrate MMJ2 so that there was both a way to browse the whole set.mm file and write proofs in the same environment.
You can get an idea of the features and what I still had in mind on the web page.

I had communicated with Norman but since I got no answer, I moved on to something new.
Though, this was "Ganymede", and I assume it is not working on the latest "Mars", and there is probably some important changes to be done to get it working.

Tell me if you're interested, I could put the code on GitHub and who knows, maybe revive it?

Regards,
_
Thierry

On 16/04/2016 18:53, Mario Carneiro wrote:

On Sat, Apr 16, 2016 at 1:09 AM, Norman Megill <n...@alum.mit.edu> wrote:

On 4/15/16 12:25 AM, Marnix Klooster wrote:

Hi all,

I'm not invested in any design direction here, but here is just one
remark from my side.

It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).

Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.

(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)

[...]

Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility. If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.

[...]

As for modularity, in set.mm you will find commented out file includes for things like logic.mm. At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't). I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor. Eventually set.mm will have to be broken up, but we aren't at that point yet.

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this. Most file editors are just not designed for this, and it is not good to need to read it all and store it in memory in order to get to a single theorem. Edits to an .mm file almost always are local, touching only a tiny tiny part of that giant file, and splitting it up is one way to focus development a bit more.

Could you repeat the reasons why you think having set.mm in one file is a good thing?

1. Moving sections around is trivial if sections are in separate files, it just means moving a file and possibly changing the dependencies, while moving a theorem or range of theorems is just the same with multiple files, just cut and paste.

2. Global renames are completely identical with multiple files; almost all file editors have a "find/replace in files" feature, which would do the trick. Ideally this should be done by the aforementioned "perfect metamath editor", though, so it can be more syntax-aware and only do the replacement where it is syntactically valid.

If you could start over from scratch, that is, some kind of "Metamath 2.0", how would you design the interface? What is the core structure of MM file data? Some thoughts:

* The basic design is a sequence of theorems with possibly shared environments. Within each theorem, though, the data structure is something complex, which we have decided not to store in plain text in the present version, necessitating interaction with a proof editor in any case. (That is, a user can not practically write a complete mm file by hand, and must resort to a program to write the proof parts.)

* There is a logical two-level interaction pattern with data in the file:
1. Maintenance mode: Adding and removing theorems, changing theorem statements, changing hypotheses and disjoint variable conditions.

2. Proof mode: Writing proof bodies.

* The raw MM file presentation is suboptimal in that on the one hand users must interact with it directly for maintenance mode, yet are presented with human-unreadable data in the proofs, while on the other hand existing tools allow proof mode work but do not handle any maintenance mode activities.

* An easy presentation format for maintenance mode is the raw MM file, but with proofs collapsed or omitted. The operations on such a representation may also be limited beyond arbitrary text edits, to a more well-formed approach such as adding and removing groups, and moving statements around as atomic elements. A graphical interface for this might work, but would need a lot of usability testing to get right.

* For proof mode, the best current contender is the MMP file format, but mmj2 is the only program that can read it, and there is no connection between this and maintenance mode in mmj2 (in fact mmj2 does not edit the MM file at all). I don't think this is the best direction in order to support proof scripts in the future, and I would support something more like a sequence of commands. (Tactic modes exist in Lean, Coq, HOL Light, Isabelle and I can explain them in more detail if anyone doesn't know how they work.) An MMP style output could be written as a sequence of ` apply xxx "|- math" ` commands like those used in MM-PA (the math strings are optional for user input).

* Linking in to an existing IDE like Eclipse is an easy way to get a lot of nice features like syntax highlighting out of the box. But any attempt to do this in maintenance mode is severely restricted by the sheer size of the file, even if proofs are omitted. Eclipse assumes that a software project has a modular structure, and I daresay that most other IDEs do as well. Even simple section-by-section splits would mostly resolve this issue.

* A primary problem that is not currently addressed by mmj2 is the ability to change the state of the database while the program is running (that is, maintenance mode changes affecting proof mode). MM-PA is able to handle this well, although it does not allow many maintenance mode changes in the first place. A good editor would have both modes and could use each to affect the other.

I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?

Mario

Mario Carneiro

unread,

Jun 21, 2016, 10:25:46 AM6/21/16

to metamath

Wow, those screenshots look really nice, and well thought out. (Nice icons!) My biggest fear has always been that syntax highlighting and outlining for set.mm would be prohibitive, but even at 9MB if you found a way to make it work that's definitely something I'd want to use.

fl

unread,

Jun 24, 2016, 9:12:28 AM6/24/16

to Metamath

Hi Thierry,

I doubt parsing set.mm in Eclipse is a good idea. Too big. But parsing the mmj2 files deserves

to be done especially because Eclipse's Hoover Windows can be a good device to show quickly

what a referenced proof looks like when you read a mmj2 file. Such a feature currently exists in

the metamath site but the given information is only the comment. What we need to see is the

premisses and the theorem themselves (with colored variables for the ease of reading).

--

FL

Thierry Arnoux

unread,

Jul 10, 2016, 6:00:23 AM7/10/16

to meta...@googlegroups.com

Hi all,

I've completed the adaptation of my Metamath plugin for Eclipse (actually an adaptation of MMJ2 for Eclipse), though it is still largely a work in progress.
The plugin is now compatible with Eclipse Mars. I did not test it with the newest Eclipse Neon, though I don't expect issues.
Until I register the plugin in the Eclipse Marketplace, you can follow these steps to install it:

Open the menu Help / Install New Software
Choose Add to add a new site
Enter the URL of the Metamath update site: http://www.arnoux.ch/thierry/eclipse
Tick the "Emetamath" feature and follow the steps.

You can also download the plugin file directly under http://www.arnoux.ch/thierry/metamath
It is now usable, i.e. it can load and browse set.mm and individual proofs, edit and export new proofs.
A notable issue is that the grammatical errors are shown one line below their actual position in the proof assistant.
Step selector is not launched by double clicking like in MMJ2 but by clicking the button or from the menu.

I'd of course like to hear from you if you're interested and what I can do to improve it!

_
Thierry

Mario Carneiro

unread,

Jul 10, 2016, 7:07:07 AM7/10/16

to metamath

When I attempt to open set.mm, while the metamath view is coming up, I get an error "Failed to create the part's controls": here's the stack trace:

java.lang.NullPointerException
    at org.tirix.emetamath.editors.MMScanner.<init>(MMScanner.java:71)
    at org.tirix.emetamath.editors.MMSourceViewerConfiguration.getMMScanner(MMSourceViewerConfiguration.java:88)
    at org.tirix.emetamath.editors.MMSourceViewerConfiguration.getPresentationReconciler(MMSourceViewerConfiguration.java:119)
    at org.eclipse.jface.text.source.SourceViewer.configure(SourceViewer.java:457)
    at org.eclipse.ui.texteditor.AbstractTextEditor.createPartControl(AbstractTextEditor.java:3427)
    at org.eclipse.ui.texteditor.StatusTextEditor.createPartControl(StatusTextEditor.java:54)
    at org.eclipse.ui.texteditor.AbstractDecoratedTextEditor.createPartControl(AbstractDecoratedTextEditor.java:447)
    at org.tirix.emetamath.editors.MetamathEditor.createPartControl(MetamathEditor.java:41)
    at org.eclipse.ui.internal.e4.compatibility.CompatibilityPart.createPartControl(CompatibilityPart.java:151)
    at org.eclipse.ui.internal.e4.compatibility.CompatibilityEditor.createPartControl(CompatibilityEditor.java:99)
    at org.eclipse.ui.internal.e4.compatibility.CompatibilityPart.create(CompatibilityPart.java:341)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.eclipse.e4.core.internal.di.MethodRequestor.execute(MethodRequestor.java:56)
    at org.eclipse.e4.core.internal.di.InjectorImpl.processAnnotated(InjectorImpl.java:898)
    at org.eclipse.e4.core.internal.di.InjectorImpl.processAnnotated(InjectorImpl.java:879)
    at org.eclipse.e4.core.internal.di.InjectorImpl.inject(InjectorImpl.java:121)
    at org.eclipse.e4.core.internal.di.InjectorImpl.internalMake(InjectorImpl.java:345)
    at org.eclipse.e4.core.internal.di.InjectorImpl.make(InjectorImpl.java:264)
    at org.eclipse.e4.core.contexts.ContextInjectionFactory.make(ContextInjectionFactory.java:162)
    at org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.createFromBundle(ReflectionContributionFactory.java:104)
    at org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.doCreate(ReflectionContributionFactory.java:73)
    at org.eclipse.e4.ui.internal.workbench.ReflectionContributionFactory.create(ReflectionContributionFactory.java:55)
    at org.eclipse.e4.ui.workbench.renderers.swt.ContributedPartRenderer.createWidget(ContributedPartRenderer.java:129)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.createWidget(PartRenderingEngine.java:971)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.safeCreateGui(PartRenderingEngine.java:640)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.safeCreateGui(PartRenderingEngine.java:746)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.access$0(PartRenderingEngine.java:717)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$2.run(PartRenderingEngine.java:711)
    at org.eclipse.core.runtime.SafeRunner.run(SafeRunner.java:42)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.createGui(PartRenderingEngine.java:695)
    at org.eclipse.e4.ui.internal.workbench.PartServiceImpl$1.handleEvent(PartServiceImpl.java:99)
    at org.eclipse.e4.ui.services.internal.events.UIEventHandler$1.run(UIEventHandler.java:40)
    at org.eclipse.swt.widgets.Synchronizer.syncExec(Synchronizer.java:186)
    at org.eclipse.ui.internal.UISynchronizer.syncExec(UISynchronizer.java:145)
    at org.eclipse.swt.widgets.Display.syncExec(Display.java:4761)
    at org.eclipse.e4.ui.internal.workbench.swt.E4Application$1.syncExec(E4Application.java:211)
    at org.eclipse.e4.ui.services.internal.events.UIEventHandler.handleEvent(UIEventHandler.java:36)
    at org.eclipse.equinox.internal.event.EventHandlerWrapper.handleEvent(EventHandlerWrapper.java:197)
    at org.eclipse.equinox.internal.event.EventHandlerTracker.dispatchEvent(EventHandlerTracker.java:197)
    at org.eclipse.equinox.internal.event.EventHandlerTracker.dispatchEvent(EventHandlerTracker.java:1)
    at org.eclipse.osgi.framework.eventmgr.EventManager.dispatchEvent(EventManager.java:230)
    at org.eclipse.osgi.framework.eventmgr.ListenerQueue.dispatchEventSynchronous(ListenerQueue.java:148)
    at org.eclipse.equinox.internal.event.EventAdminImpl.dispatchEvent(EventAdminImpl.java:135)
    at org.eclipse.equinox.internal.event.EventAdminImpl.sendEvent(EventAdminImpl.java:78)
    at org.eclipse.equinox.internal.event.EventComponent.sendEvent(EventComponent.java:39)
    at org.eclipse.e4.ui.services.internal.events.EventBroker.send(EventBroker.java:85)
    at org.eclipse.e4.ui.internal.workbench.UIEventPublisher.notifyChanged(UIEventPublisher.java:59)
    at org.eclipse.emf.common.notify.impl.BasicNotifierImpl.eNotify(BasicNotifierImpl.java:374)
    at org.eclipse.e4.ui.model.application.ui.impl.ElementContainerImpl.setSelectedElement(ElementContainerImpl.java:171)
    at org.eclipse.e4.ui.internal.workbench.ModelServiceImpl.showElementInWindow(ModelServiceImpl.java:494)
    at org.eclipse.e4.ui.internal.workbench.ModelServiceImpl.bringToTop(ModelServiceImpl.java:458)
    at org.eclipse.e4.ui.internal.workbench.PartServiceImpl.delegateBringToTop(PartServiceImpl.java:724)
    at org.eclipse.e4.ui.internal.workbench.PartServiceImpl.bringToTop(PartServiceImpl.java:396)
    at org.eclipse.e4.ui.internal.workbench.PartServiceImpl.showPart(PartServiceImpl.java:1166)
    at org.eclipse.ui.internal.WorkbenchPage.busyOpenEditor(WorkbenchPage.java:3234)
    at org.eclipse.ui.internal.WorkbenchPage.access$25(WorkbenchPage.java:3149)
    at org.eclipse.ui.internal.WorkbenchPage$10.run(WorkbenchPage.java:3131)
    at org.eclipse.swt.custom.BusyIndicator.showWhile(BusyIndicator.java:70)
    at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:3126)
    at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:3090)
    at org.eclipse.ui.internal.WorkbenchPage.openEditor(WorkbenchPage.java:3071)
    at org.eclipse.ui.ide.IDE.openEditorOnFileStore(IDE.java:1159)
    at org.eclipse.ui.internal.ide.actions.OpenLocalFileAction.run(OpenLocalFileAction.java:97)
    at org.eclipse.ui.internal.ide.actions.OpenLocalFileAction.run(OpenLocalFileAction.java:70)
    at org.eclipse.ui.internal.PluginAction.runWithEvent(PluginAction.java:247)
    at org.eclipse.ui.internal.WWinPluginAction.runWithEvent(WWinPluginAction.java:228)
    at org.eclipse.jface.action.ActionContributionItem.handleWidgetSelection(ActionContributionItem.java:595)
    at org.eclipse.jface.action.ActionContributionItem.access$2(ActionContributionItem.java:511)
    at org.eclipse.jface.action.ActionContributionItem$5.handleEvent(ActionContributionItem.java:420)
    at org.eclipse.swt.widgets.EventTable.sendEvent(EventTable.java:84)
    at org.eclipse.swt.widgets.Display.sendEvent(Display.java:4362)
    at org.eclipse.swt.widgets.Widget.sendEvent(Widget.java:1113)
    at org.eclipse.swt.widgets.Display.runDeferredEvents(Display.java:4180)
    at org.eclipse.swt.widgets.Display.readAndDispatch(Display.java:3769)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine$4.run(PartRenderingEngine.java:1127)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.e4.ui.internal.workbench.swt.PartRenderingEngine.run(PartRenderingEngine.java:1018)
    at org.eclipse.e4.ui.internal.workbench.E4Workbench.createAndRunUI(E4Workbench.java:156)
    at org.eclipse.ui.internal.Workbench$5.run(Workbench.java:654)
    at org.eclipse.core.databinding.observable.Realm.runWithDefault(Realm.java:337)
    at org.eclipse.ui.internal.Workbench.createAndRunWorkbench(Workbench.java:598)
    at org.eclipse.ui.PlatformUI.createAndRunWorkbench(PlatformUI.java:150)
    at org.eclipse.ui.internal.ide.application.IDEApplication.start(IDEApplication.java:139)
    at org.eclipse.equinox.internal.app.EclipseAppHandle.run(EclipseAppHandle.java:196)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.runApplication(EclipseAppLauncher.java:134)
    at org.eclipse.core.runtime.internal.adaptor.EclipseAppLauncher.start(EclipseAppLauncher.java:104)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:380)
    at org.eclipse.core.runtime.adaptor.EclipseStarter.run(EclipseStarter.java:235)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at org.eclipse.equinox.launcher.Main.invokeFramework(Main.java:669)
    at org.eclipse.equinox.launcher.Main.basicRun(Main.java:608)
    at org.eclipse.equinox.launcher.Main.run(Main.java:1515)

Thierry Arnoux

unread,

Jul 10, 2016, 7:14:36 AM7/10/16

to meta...@googlegroups.com

Are you opening set.mm as a file within of your new Metamath eclipse project?
I guess you might open set.mm outside of it, and this is something I have not tested.

I can add a protection for this case and build a new package.
_
Thierry

Mario Carneiro

unread,

Jul 10, 2016, 7:19:57 AM7/10/16

to metamath

I opened it as a plain file not associated to a project. There is no Metamath project builder that I can find.

Mario Carneiro

unread,

Jul 10, 2016, 7:22:09 AM7/10/16

to metamath

Do you have a quickstart guide? What is the intended sequence of operations to set up editing a metamath file?

Thierry Arnoux

unread,

Jul 10, 2016, 7:29:10 AM7/10/16

to meta...@googlegroups.com

You can find the quickstart guide here, under "first use"
http://arnoux.ch/thierry/metamath/

Open the metamath perspective, create a metamath project, add files to the project, and set the "main file".

You can open the "Metamath perspective" by choosing the menu "Window / Perspective / Open Perspective / Others... " and then choosing Metamath.
Once you are in the Metamath perspective, the Metamath project Wizard shall show up.

If you have not opened the "Metamath perspective", you should still find the Metamath project wizard under "Others...".

Thanks for testing ;-)
_
Thierry

Mario Carneiro

unread,

Jul 10, 2016, 7:50:50 AM7/10/16

to metamath

Ah, great I managed to get off the ground. How do I rebuild the project after a change to the .mm file? I was expecting "build project" to do the job but apparently not. And if there's an mmj2 fatal error in the file then none of the other views work.

David A. Wheeler

unread,

Jul 10, 2016, 8:43:50 AM7/10/16

to meta...@googlegroups.com, Mario Carneiro

On 21/06/2016 22:25, Mario Carneiro wrote:
> Wow, those screenshots look really nice, and well thought out. (Nice icons!) My biggest fear has always been that syntax highlighting and outlining for set.mm would be prohibitive, but even at 9MB if you found a way to make it work that's definitely something I'd want to use.

When I wrote the vim syntax highlighting I was worried about that also. However, it turns out to be a non-issue. Modern editors with syntax highlighting are super fast, and even 9 megabyte files are simply not a real problem. I could easily work with the files even if I synced from the beginning every time. I also found that metamath syntax is so simple that is easy to optimize. I have it restart parsing at "$)" so that it does not have to go back to the beginning of the file. It doesn't seem to be necessary, but I like the idea of saving a few cycles.
--- David A.Wheeler

Glauco

unread,

Feb 17, 2018, 3:00:18 PM2/17/18

to Metamath

I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?

I understand this is an old topic, but since Mario asked, I've been collecting a few ideas. Here is a list of features that I believe would improve my productivity with mmj2 (some/most are set.mm specific)

I've ordered them in (supposed) decreasing Return On Investment:

> select and copy a wff with a single keyboard shortcut

* when the user clicks ctrl+w the smallest wff that contains both the char to the left and the char to the right of the cursor

is SELECTED (highlighted) and COPIED to the clipboard

> select and copy a class with a single keyboard shortcut

* when the user clicks ctrl+l the smaller wff that contains both the char to the left and the char to the right of the cursor

is selected and copied to the clipboard

* an alternative implementation could be to dynamically move the highlighted (selectable) text while the mouse moves over the statement; the wff or the class would automatically be

highlighted so that you can precisely select the one you are interested in

> use a different background color for the current line (where the cursor is)

> try to always have visible a few lines above and below the current selected line

> automatically add |- if missing

> highlight matching parentheses

> round parentheses balancing inferrer

* if a ctrl+u is going to rise a syntax error, then open and closed round brackets are counted (n open and m closed)

* if n = m+1 then open brackets are removed (one at a time) and the resulting statement is checked, until a valid one is found

* if m = n+1 then one open bracket is added (in each possible position) and the resulting statement is checked, until a valid one is found

* it may be a similar approach for abs(n-m)=2 could also be taken (adding or removing two brackets)

> copy paste labels from search mode

* now, when I do a General Search, I have to look at the label, memorize it and then type it in manually

* at present, the theorem can be selected only if you are trying to unify a statement

> intellisense (autocomplete) for labels

* when the user begins writing a label, a small window is open filtering on labels matching what's been written till now

* the user can select from the small list (or the Enter key will select the currently highlighted label in the list)

* a regular expression filter should be applied (with a \w* between any couple of letters)

* if a statement is present on the right side, it would be cool if only unifiable theorems' labels were shown

> add a key shortcut for "Reformat step: swap alt"

> configurable keyboard shortcuts

* in order to avoid conflicts: for instance, ctrl+shift+u in Linux conflicts with the key combination to enter Unicode characters

* in order to avoid conflicts, ctrl+m could also be used as a generic way to enter a state waiting for the subsequent characters to specify the actual keyboard shortcut

> an horizontal split view that keeps non scrolling hypotheses at the top (so that hypotheses are always visible)

> show only most significant steps (meaning the n statements with highest theorem number) option

* in the mailing list Mario asked for a similar feature in the Metamath Explorer (that would be so cool, imho), it may be nice to have it in mmj2

> antecedent manager/optimizer

* we often have a complex antecedent (with a complex tree of "and conjunctions" of atomic wffs) that we use to prove a consequent

* it be nice to have a tool to develop the subproof with two goals

1. use a small stub for the antecedent

2. when the subproof is done, automatically reduce the antecedent to the atomic wffs actually needed for the subproof, discharging those not needed

* example of such a process: we want to proof the following statement

* |- ( ( ( ( ph /\ ( e e. RR+ /\ f e. RR+ ) ) /\ z e. A /\ A. z e. A ( z mod 2 ) = 0 ) /\ ( abs ` ( z - D ) ) < if ( e <_ f , e , f ) ) -> ( abs ` ( z - D ) ) < e )

* we use a "much shorter" stub for the antecedent (so that all lines are easier to read)

* when the statement has been proven (using the stub) mmj2 "determines" that A. z e. A ( z mod 2 ) = 0 is not needed in the sub proof

* the antecedent stub is then replaced (in all sub steps) with ( ( ( ph /\ ( e e. RR+ /\ f e. RR+ ) ) /\ z e. A ) /\ ( abs ` ( z - D ) ) < if ( e <_ f , e , f ) )

* the original statement is finally proven using 3adantl3 (that removes the unneded hypothesis from the antecedent and automatically refers to the subproof)

> a keyboard shortcut to insert a line above with the same antecedent as the line where the cursor is

* for instance I'm currently on |- ( ph -> ch ), insert a line above as |- ( ph -> &W1 ) with &W1 highlighted and ready to be replaced by typing

> automatically add $d statements for couple of variables involving a dummy variable (adding a new option to "Edit > Set Soft Dj Vars Error Handling")

> add a mmj2 syntax to define couple of variables you don't want to be distinct and a corresponding option in "Edit > Set Soft Dj Vars Error Handling"

* Mario requested a similar feature in the metamath spec, but it may be useful to have it at least in mmj2 (if there are drawbacks

for it to be in the mm spec)

> an option like unify+renumber that doesn't change hypothesis labels

> use a different color for variables placeholders, class placeholders and wffs placeholders (without having to use the whole syntax highlighting feature)

> remove all statements that are not used (recursively) option

> a simple find feature (replace also?)

* now I alt+tab between mmj2 and notepad++ (or notepadqq under linux)

> reorder statements according to a "standard" (rpn stack) option

Then a couple that could be hard to implement, but that may turn out to be very useful:

> a constrained wff builder (and/or class builder)

* when in constrained mode, a &W1 is given and highlighted

* if you click (let's say) -> and then return, automatically the line is replaced by ( &W1 -> &W2 ) and the new &W1 is highlighted

* if you click e. the whole line is replaced by ( &C1 e. &C2 -> &W2 ) and &C1 is highlighted

* you can "tab" from a placeholder to the other (thus highlighting the next one you want to expand)

> an expression inferrer

* a window pops up

* you write something like 2 pi / 3 + 5

* some valid classes are listed, for instance ( ( ( 2 x. pi ) / 3 ) + 5 )

* one of the listed classes is double clicked

* the pop up window is closed and the selected (valid) class is automatically pasted in the mmj2 editor

Glauco

unread,

Sep 10, 2018, 2:37:54 PM9/10/18

to Metamath

Since Mel is now on the forum, I just can't resist reposting this wish list... :-)

ocatmetamath at icloud.com

unread,

Sep 14, 2018, 1:32:59 PM9/14/18

to Metamath

I will throw in here. I love the idea of running mmj2 inside Eclipse.

Writing proofs is basically the same thing as writing programs. (In

Metamath a proof literally computes the formula of the conclusion!)

Eclipse has many powerful tools that can be applied in different views

and perspectives while the user is simultaneously working in the editor.

I imagine that a two or three monitor rig would be optimal.

The key thing is an Environment. Thinking of a proof assistant *program*

is far too narrow. For one thing, a major theorem could require years of

work, and many different subjects need to be researched and brought

together in the final work. (I don't want to ask a person to keep her

nose glued to a particular program display for that long!)

And, really, time spent grokking is part of the process. Deep thought.

Skull sweat. Pondering mightily...having realizations.

So, likely the Environment needs a "GUI" explorer (or VR/AR world)

containing a math ontology with links to resources. Lets you pull out

topics of interest onto your virtual 3D blackboard/sub-world and

highlight links (such as common attributes or relationships to 3rd

party entities.)

Programming and proving both involve defining and describing Entities,

their Attributes and their Relationships. So an "assistant" would

probably assist in doing those actual things...in the specific language

being used (and help doing the difficult task of mapping from ontology A

to ontology B.)

And really, if the mathematician is going to spend weeks, months, and

years working on a theorem then I would expect that the mathematician's

computer system would be working in the background 24/7/365.25 to work

out whatever it can (via Big Data, AI/ML, web searches, whatever.)

Perhaps this system could be hybrid: public/private cloud. Perhaps the

user runs a local server, like a spare Mac Mini doing nothing but

assistanting stuff.

Let me give an example. Suppose we analyze and store all proof step

formulas in set.mm. That means we can know every use of a certain

theorem and every set of substitutions used anywhere, so far. Thus,

given the Pythagorean Theorem I can know every expression substituted

into variable "C". That means I know a lot of entities that can be

treated as hypotenuses, and everything I know about hypotenuses I can

apply to those entities/expressions. So, if you tell me -- the computer

-- some theorem you wish to prove, I can automatically begin the

retrieval and analysis of related entities and their attributes and

relationships. The program might not find the proof but rather make it

easier for the mathematician to make the linguistic linkages between the

various formulas. And then you can run the predicate/propositional logic

prover to finish up.

And, if the mathematician has collaborators then they need to be able to

interact with the others' virtual 3D/sub-world environments. So...into

the cloud with that, eh?

* * *

W.R.T. Metamath's database format: I hope that in the future it will

always be possible to generate .mm files in the Metamath.pdf format

according to it's specifications. Too many programs have been written.

What a loss. But I am interested in exfiltrating .mm databases into

another database format(s) -- and starting with delimited text extract

files that can be uploaded to the cloud and run in Google or wherever.

Define a "common programming environment" so that people can contribute

simple modules (or statements in a new language.) Invent new

capabilities that could provide support to mathematicians doing proofs,

in other words...

Thierry Arnoux

unread,

Sep 15, 2018, 1:02:52 AM9/15/18

to meta...@googlegroups.com, ocatmetamath at icloud.com

Hi Mel,

That's great!

If you have not done so yet, I suggest you have a look at: http://emetamath.tirix.org/, and: https://github.com/tirix/Emetamath

Running mmj2 inside Eclipse is exactly what "emetamath" does: it's a Eclipse plugin for editing Metamath (.MM) and Metamath Proof (.MMP) files, which embeds the core of mmj2.

Most recently, I've added e.g. highlighting matching parenthesis ...that's still far away from VR powered proof building!

BTW, let me take this opportunity to thank you for the excellent MMJ2!

_
Thierry

fl

unread,

Sep 15, 2018, 12:43:06 PM9/15/18

to Metamath

Hi Thierry,

is there a way in emetamath to enter the proofs the way they are entered in mmj2?

--

FL

Thierry Arnoux

unread,

Sep 15, 2018, 1:21:55 PM9/15/18

to meta...@googlegroups.com

Hi Frederic,

The proofs are entered exactly in the same way as in MMJ2: the same proof file format is used (.MMP). In the end, it’s an ASCII text editor with special features.

One difference in the interface is that the “double click” action was already associated with “select the nearest word” (in emetamath, it’s the nearest metamath token). Therefore, I’ve introduced a special shortcut CTRL+T for the step selector.
_
Thierry

fl

unread,

Sep 16, 2018, 7:46:12 AM9/16/18

to Metamath

The proofs are entered exactly in the same way as in MMJ2: the same proof file format is used (.MMP). In the end, it’s an ASCII text editor with special features.

Can you make a bundle for linux including eclipse itself (32-bit), and your module? Nothing else in eclipse so that it is as light as possible.

--

FL

Marnix Klooster

unread,

Sep 17, 2018, 2:03:55 PM9/17/18

to Metamath, b252...@gmail.com

Hi Thierry,

Could you please confirm what the correct URLs are to download the plugin jar-file, or to use as the Eclipse download site?

I'm asking because the links on http://emetamath.tirix.org/ don't seem to work for me, e.g., http://emetamath.tirix.org/release/org.tirix.emetamath_0.1.6.v20170606.jar gives a 404 Not Found. And also related URLs don't work. And you suggested that you've recently made changes, so I would expect recent commits and a build newer than June 2017...

Thanks in advance!

<><

Marnix

Op za 15 sep. 2018 om 07:02 schreef Thierry Arnoux <thierry...@gmx.net>:

--

Marnix Klooster
marnix....@gmail.com

Thierry Arnoux

unread,

Sep 18, 2018, 8:56:30 PM9/18/18

to meta...@googlegroups.com, Marnix Klooster, b252...@gmail.com

Hi Marnix,

I've updated the links on the page, and added the build for the latest changes, from April 2018.
The plugin also works at least with Eclipse Oxygen, so I've updated this mention (I've not updated my own version to Photon, so I'm not sure if it works with Photon).

I think until now only Mario tried, so if anything does not go smoothly don't hesitate to ask!

_
Thierry

Thierry Arnoux

unread,

Sep 18, 2018, 9:01:41 PM9/18/18

to meta...@googlegroups.com

Hi Frédéric,

Unfortunately I don't have a Linux distribution available right now. I suggest you install Eclipse and then just download and copy the plugin in the eclipse "plugins" directory.
If this tool becomes popular I might create an upload site (you can then install and update the plugin directly from Eclipse), or even register it in the "Eclipse Marketplace".

_
Thierry

Marnix Klooster

unread,

Dec 18, 2018, 3:15:41 PM12/18/18

to Thierry Arnoux, Metamath, b252...@gmail.com

Hi Thierry,

Here is a quick update: In the last couple of days I again got around to trying EMetamath, I was finally able to install 0.1.7.v20180417, and this time I decided to just follow David's mmj2 video (https://youtu.be/Rst2hZpWUbU, for which thanks David!). And that made things work for me, and I now have a better feeling for how things work.

Given that over the years I've tried perhaps 5 times unsuccessfully to get mmj2 installed and running, I'm now quite happy with a working EMetamath setup. :-)

Note that I've used Eclipse Oxygen on Linux, and Eclipse Photon on Windows, without any difference I can see.

Some configuration information that might help others. I discovered that the only way to install, is to download the .jar file and put it in the top-level 'dropins' directory of Eclipse. Also, I think I've increased the -Xmx max heap memory in eclipse.ini to give Eclipse some more room, especially when parsing set.mm.

At one point in time I managed to get the plugin stuck, so that at least Unify / Ctrl-U didn't have any effect anymore, and even reopening the .mmp editor I was working in didn't resolve the issue: an Eclipse restart was necessary. I have no reproduction scenario, if I find one I'll certainly report it. (Do you prefer a GitHub issue?)

(I have some other things that don't feel right, but I'm not sure enough of my ground yet. For example, I would expect the 'Problems' view to be up-to-date continously, but it seems that saving always clears it, and only doing Unify (Ctrl-U) again showed me which steps were still open, or unused.)

One thing I don't like is that the editor seems to do line wrapping at 80 characters, while I have a lot more screen real estate available. I think mmj2 has a configuration option for that-- can that width be set in EMetamath as well, or perhaps to disable line wrapping completely?

Overall: yesterday and today I've done my first two successful Metamath proofs outside of metamath.exe, and that was a really nice experience. Thanks for your work on this!!

Groetjes,
<><

Marnix

Op wo 19 sep. 2018 om 02:56 schreef Thierry Arnoux <thierry...@gmx.net>:

--

Marnix Klooster
marnix....@gmail.com

Thierry Arnoux

unread,

Dec 18, 2018, 7:14:08 PM12/18/18

to Marnix Klooster, Metamath, b252...@gmail.com

Hi Marnix,

I'm glad to count you among the users of EMetamath! To be honest, I think it's just you and me :-)
Congratulations on getting the plugin working and please keep comments coming!

The GitHub issues list is fine to track any issues, but informal emails will do too.

I'm surprised the editor does line wrapping at 80 characters, this is not how it works for me. Did you try going to the "Preference" menu, the "Metamath/TMFF" choice, and setting e.g. format number 7? (this is the one I use, single line per step, no limit)

For proofs in set.mm, please also make sure you have auto-transformations activated (in "Project/Properties", choosing "Metamath Properties"). This is really a big enhancement for writing proofs.

The credit for the proof assistant really goes to Mel'O Cat and Mario, I've just done the integration into Eclipse!

BR,
_
Thierry

Marnix Klooster

unread,

Dec 19, 2018, 3:51:23 PM12/19/18

to Thierry Arnoux, Metamath, b252...@gmail.com

Hi Thierry,

OK, that TMFF choice of 7 worked great-- when I realized I had to restart Eclipse to make it take effect. :-) Or so it seems... Thanks!

-Marnix

Op wo 19 dec. 2018 om 01:14 schreef Thierry Arnoux <thierry...@gmx.net>:

--

Marnix Klooster
marnix....@gmail.com

Reply all

Reply to author

Forward