Modularity for set.mm; interface design for the "perfect editor"

435 views
Skip to first unread message

Mario Carneiro

unread,
Apr 16, 2016, 6:53:12 AM4/16/16
to metamath


On Sat, Apr 16, 2016 at 1:09 AM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/15/16 12:25 AM, Marnix Klooster wrote:
Hi all,

I'm not invested in any design direction here, but here is just one
remark from my side.

It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).

Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.

(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)

[...]

Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility.  If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.

[...]

As for modularity, in set.mm you will find commented out file includes for things like logic.mm.  At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't).  I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor.  Eventually set.mm will have to be broken up, but we aren't at that point yet.

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this. Most file editors are just not designed for this, and it is not good to need to read it all and store it in memory in order to get to a single theorem. Edits to an .mm file almost always are local, touching only a tiny tiny part of that giant file, and splitting it up is one way to focus development a bit more.

Could you repeat the reasons why you think having set.mm in one file is a good thing?

1. Moving sections around is trivial if sections are in separate files, it just means moving a file and possibly changing the dependencies, while moving a theorem or range of theorems is just the same with multiple files, just cut and paste.

2. Global renames are completely identical with multiple files; almost all file editors have a "find/replace in files" feature, which would do the trick. Ideally this should be done by the aforementioned "perfect metamath editor", though, so it can be more syntax-aware and only do the replacement where it is syntactically valid.


If you could start over from scratch, that is, some kind of "Metamath 2.0", how would you design the interface? What is the core structure of MM file data? Some thoughts:

* The basic design is a sequence of theorems with possibly shared environments. Within each theorem, though, the data structure is something complex, which we have decided not to store in plain text in the present version, necessitating interaction with a proof editor in any case. (That is, a user can not practically write a complete mm file by hand, and must resort to a program to write the proof parts.)

* There is a logical two-level interaction pattern with data in the file:
1. Maintenance mode: Adding and removing theorems, changing theorem statements, changing hypotheses and disjoint variable conditions.
2. Proof mode: Writing proof bodies.

* The raw MM file presentation is suboptimal in that on the one hand users must interact with it directly for maintenance mode, yet are presented with human-unreadable data in the proofs, while on the other hand existing tools allow proof mode work but do not handle any maintenance mode activities.

* An easy presentation format for maintenance mode is the raw MM file, but with proofs collapsed or omitted. The operations on such a representation may also be limited beyond arbitrary text edits, to a more well-formed approach such as adding and removing groups, and moving statements around as atomic elements. A graphical interface for this might work, but would need a lot of usability testing to get right.

* For proof mode, the best current contender is the MMP file format, but mmj2 is the only program that can read it, and there is no connection between this and maintenance mode in mmj2 (in fact mmj2 does not edit the MM file at all). I don't think this is the best direction in order to support proof scripts in the future, and I would support something more like a sequence of commands. (Tactic modes exist in Lean, Coq, HOL Light, Isabelle and I can explain them in more detail if anyone doesn't know how they work.) An MMP style output could be written as a sequence of ` apply xxx "|- math" ` commands like those used in MM-PA (the math strings are optional for user input).

* Linking in to an existing IDE like Eclipse is an easy way to get a lot of nice features like syntax highlighting out of the box. But any attempt to do this in maintenance mode is severely restricted by the sheer size of the file, even if proofs are omitted. Eclipse assumes that a software project has a modular structure, and I daresay that most other IDEs do as well. Even simple section-by-section splits would mostly resolve this issue.

* A primary problem that is not currently addressed by mmj2 is the ability to change the state of the database while the program is running (that is, maintenance mode changes affecting proof mode). MM-PA is able to handle this well, although it does not allow many maintenance mode changes in the first place. A good editor would have both modes and could use each to affect the other.


I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?

Mario

fl

unread,
Apr 16, 2016, 7:21:30 AM4/16/16
to Metamath

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this.

A normal database is big. Having only one file has its benefits. You can download it easily. You can
use an editor to look for the proofs. Interfacing metamath with powerful existing text tools is very good.
Parceling out the file would make everything more restrictive.

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you
may place the proofs somewhere else.

--
FL

fl

unread,
Apr 16, 2016, 7:40:10 AM4/16/16
to Metamath

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you
may place the proofs somewhere else.

I think that mathematics are linear in their essence. This is currently the way metamath
manages them: one theorems after the other. A structure in the form of a net would make everything
more difficult to understand and manage. With a net you never know where the things are.

But a simple db2 database can be considered. I'm not sure however the potential benefits are worth the trouble.

The current most serious drawbacks in metamath are the time required to minimize proofs or generate
html pages. Maybe additional elements and caches could be added to minimize those times.

--
FL

fl

unread,
Apr 16, 2016, 7:43:08 AM4/16/16
to Metamath
And for the perfect editor, I think that more well designed algorithms and routines can help to achieve
such a goal.

(In particular the ability to add keywords to the theorems to find them easilier.)

--
FL

Norman Megill

unread,
Apr 16, 2016, 10:17:04 PM4/16/16
to meta...@googlegroups.com
> logic theorems in set.mm <http://set.mm>, there is no reason not to
> make it part of set.mm <http://set.mm> itself, say in a mathbox, so
> that everyone will benefit.
>
> [...]
>
> As for modularity, in set.mm <http://set.mm> you will find commented
> out file includes for things like logic.mm <http://logic.mm>. At
> one point I had several pieces, but they were a pain to keep in sync
> (without special software that didn't exist and still doesn't). I
> think anyone who has done major work on set.mm <http://set.mm> will
> agree that it is very convenient to have everything in one file for
> global label/symbol changes, moving sections around, etc. with a
> text editor. Eventually set.mm <http://set.mm> will have to be
> broken up, but we aren't at that point yet.
>
>
> Actually, I've been brainstorming how to fix this, what the perfect
> metamath editor should look like. It's quite clear to me, at least, that
> we are not there, and having one 16MB file is a symptom of this. Most
> file editors are just not designed for this, and it is not good to need
> to read it all and store it in memory in order to get to a single
> theorem.

I'm not sure why it's not good. On today's computers, 28MB is nothing
compared to the available 4 to 8GB of memory. If an editor has trouble
with a file of that size, maybe the editor design is not good. While I
haven't checked
https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features
carefully, the currently supported ones I recognize handle several GB or
more. Emacs supposedly handles 2EB (10^9 GB).

The main editor I use reads set.mm in well under 1 sec. When I search
for a string at the end of the file it gets there essentially instantly.
I just tried a search and replace of the 27K instances of '$p' with a
larger string, and it completed before I could lift my finger from the
mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it
became a little sluggish, taking 6 sec to find a string at the end of
the file.

> Edits to an .mm file almost always are local, touching only a
> tiny tiny part of that giant file, and splitting it up is one way to
> focus development a bit more.
>
> Could you repeat the reasons why you think having set.mm <http://set.mm>
> in one file is a good thing?

Simplicity. Speed. Lowest common denominator. Convenience.

> 1. Moving sections around is trivial if sections are in separate files,
> it just means moving a file and possibly changing the dependencies,
> while moving a theorem or range of theorems is just the same with
> multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

> 2. Global renames are completely identical with multiple files; almost
> all file editors have a "find/replace in files" feature, which would do
> the trick.

What if we forget to load one of them into the editor?

Anyway, the main editor I use doesn't have this feature. I could change
editors, but it would be asking me to sacrifice something I'm
comfortable with.

> Ideally this should be done by the aforementioned "perfect
> metamath editor", though, so it can be more syntax-aware and only do the
> replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $]
includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would
be similar to what is in set.mm now if you search for '$( $['.
Conversion between single and multiple would of course be automated in
metamath.exe or another program.

I think something like this should satisfy everyone.
FYI the current plans on my to-do list for metamath.exe include a change
to its data structure to allow any part of a statement to be modified
internally.

Applications I have in mind are global label and math token renames,
checking for conflicts before applying and also updating comment markup
and htmldefs. This will prevent accidental changes to English words
that might happen in an editor. Maybe I'll update the Recent Label
Changes automatically.

I don't have plans at this point to provide for insertion, deletion, or
changing the order of statements, or to provide for editing comments.

> I'd like to collect opinions on what a good editor should look like, and
> establish known shortcomings of existing methods, to get a roadmap for
> future editors. What does your "perfect Metamath editor" look like?

These are interesting ideas for a Metamath 2.0, and I'd like to hear
other opinions too.

As a general comment, I think it would be useful for Metamath 2.0 to be
translatable to and from the current version. If 2.0 needs additional
keywords for its features, maybe they could be hidden in special
comments in the 1.0 version so the versions could be completely
compatible. If 2.0 has say local scoping of things like labels, the
translator could rename them for the 1.0 version with some standard
suffix to make them unique. The tools we have for 1.0 (metamath.exe,
numerous verifiers) may or may not be practical to rewrite for 2.0, but
with perfect translatability they could continue to be used.

And ideally 2.0 could be verified with a 74-line program. :)

Norm

Mario Carneiro

unread,
Apr 17, 2016, 6:24:41 AM4/17/16
to metamath
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good.  On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory.  If an editor has trouble with a file of that size, maybe the editor design is not good.  While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more.  Emacs supposedly handles 2EB (10^9 GB).

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

The main editor I use reads set.mm in well under 1 sec.  When I search for a string at the end of the file it gets there essentially instantly.  I just tried a search and replace of the 27K instances of '$p' with a larger string, and it completed before I could lift my finger from the mouse.  When I tried browsing a 2.1GB ASCII file of a human genome, it became a little sluggish, taking 6 sec to find a string at the end of the file.

You have managed to get impressively far using only plain text tools, and the mm file format is somewhat tailored for use this way (e.g. spaces around label names everywhere). But I think that Metamath 2.0 should not rely on this, and use instead a more structured format, with tools that respect this format. (This might still basically be the current mm format, but the tool would not just be a text editor.)

Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity.  Speed.  Lowest common denominator.  Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.
 
1. Moving sections around is trivial if sections are in separate files,
it just means moving a file and possibly changing the dependencies,
while moving a theorem or range of theorems is just the same with
multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.
 
2. Global renames are completely identical with multiple files; almost
all file editors have a "find/replace in files" feature, which would do
the trick.

What if we forget to load one of them into the editor?

Usually you don't have to; my editor allows you to select a directory to search, and it will automatically open all the files that need to change, or you can just change them in place without opening them. A good maintenance tool for Metamath will search for and update all downstream files under a global rename (because otherwise it wouldn't be correctness-preserving).

Anyway, the main editor I use doesn't have this feature.  I could change editors, but it would be asking me to sacrifice something I'm comfortable with.

Like I said, my goal is to figure out what features we need in an editor of the future, so that you don't have to sacrifice anything.
 
Ideally this should be done by the aforementioned "perfect
metamath editor", though, so it can be more syntax-aware and only do the
replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $] includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would be similar to what is in set.mm now if you search for '$( $['. Conversion between single and multiple would of course be automated in metamath.exe or another program.

I think something like this should satisfy everyone.

A simple short term solution would be a tool to separate set.mm by section, and daisy-chain the files together with include statements. If this can be done in a perfectly reversible manner, then we could pass between each form painlessly. That is, you could take:

$( =-=- section 1 =-=- $)
thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm, and form the files section1.mm:

$( =-=- section 1 =-=- $)
thm1 $p ...

$( =-=- section 2 =-=- $)
thm2 $p ...

$( =-=- section 3 =-=- $)
thm3 $p ...
 

I believe Stefan has already used such a tool for the set.mm-history project.

FYI the current plans on my to-do list for metamath.exe include a change to its data structure to allow any part of a statement to be modified internally.

Applications I have in mind are global label and math token renames, checking for conflicts before applying and also updating comment markup and htmldefs.  This will prevent accidental changes to English words that might happen in an editor.  Maybe I'll update the Recent Label Changes automatically.

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause unforeseen mistakes.

As a general comment, I think it would be useful for Metamath 2.0 to be translatable to and from the current version. If 2.0 needs additional keywords for its features, maybe they could be hidden in special comments in the 1.0 version so the versions could be completely compatible. If 2.0 has say local scoping of things like labels, the translator could rename them for the 1.0 version with some standard suffix to make them unique.  The tools we have for 1.0 (metamath.exe, numerous verifiers) may or may not be practical to rewrite for 2.0, but with perfect translatability they could continue to be used.

I totally agree. I think that full backwards compatibility is very important for Metamath in general, so that proofs are supported long into the future, and we retain the current simplicity that has encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the current mm format, in particular the separation of proofs and theorem statements, so I would relax this restriction to having a (fairly simple) translation from 1.0 to and from 2.0.


Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

label $p math $.

For forward compatibility, the label $p math $= ... $. form is also valid, and in this case a separate proof is not needed. Then, have a parallel file, say set.mmc to go with set.mm, which contains any proofs in binary format. Details of this format are up for discussion, but let me point out a few facts:

* It only needs to have the structure {{"thm1", "proof of thm1"}, {"thm2", "proof of thm2"}, ...} where the entries for each theorem can come in any order convenient for lookup or update (like a sorted binary tree or something).

* The file can be binary because all the human-readable information has been removed and placed in the other file. The file should be optimized for quick seeks, so that single-proof examination can operate without reading the whole file.

* Proof checking can be done in parallel (this has actually always been true, but it is harder to mix reading the file with proof checking at the same time), provided that care is taken with sequence numbers from the mm file to ensure that a theorem does not reference later theorems.

Here's an actual binary format which satisfies the above constraints:

Part 1: Header

4D4D4301, 4 bytes = [MMC<hex 1>]: magic number (format specifier)
length of header, 4 bytes (big endian)

List of records of the form:
Label string, zero-terminated. This is the search key, the first label in alphabetical order in the right subtree.
Length of left subtree, 4 bytes (skip this many bytes to get to the right subtree)
Length of all proofs in the left subtree, 4 bytes (skip this many bytes in part 2 to jump over all the proofs for the left subtree)

A zero byte (empty label string) indicates an empty subtree.

Part 2: Proof data

List of records of the form:
Length of record, 4 bytes
Label string, zero-terminated (This and the record length are not necessary if you use the header, but make it easier to traverse all proofs)
Proof data. The proof data has a format reminiscent of the compressed proof format, but without the printable text restriction. It has the following structure:
List of label strings, separated by zero bytes (the referenced theorems)
A second zero byte denoting the end of the label list
The "letters" part of the compressed proof (discussed below)


That's all that is needed for efficient storage and retrieval. In order to find a specific record, one needs to read log n entries in the header, then skip straight to the proof location. Traversal over all records is easily accomplished by skipping the header, then reading each proof data record. Insertion or proof data replacement is also efficient, requiring updates to log n entries in the header. Renames are most easily accomplished as a deletion followed by an insertion. Rebalancing may be necessary, but only requires adjustment to the header.

Is it over-engineered? Probably, but the idea here is that an editor would not store this data in memory at all, and would instead just open the file whenever it needed to consult a new proof, and just use the mm file for contextual information otherwise. This I think can scale up much better than having one giant .mm file the way we currently have.

For the "letter list" part, I'll start by describing the general scheme, without picking the cutoffs yet. Divide the 256 possible byte values into three regions, A,B,C. A single number takes the form CC...CA or CC...CB (that is, all but the last byte have values in region C, and the last is in region A or B). If there are a,b,c numbers in each region (with a+b+c = 256), then sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a0 translates to (((cn*c+...)*c+c2)*c+c1)*a+a0, and the sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a+b0 translates to (((cn*c+...)*c+c2)*c+c1)*b+b0.

This is identical to the mixed-base representation in the standard compressed proof format, except there are three regions instead of two. The difference between the A numbers and the B numbers is that B numbers are marked, in the same sense as the "Z" mark in a compressed proof. This way, a single marked step can fit in only one byte instead of two, if the number is small enough.

I gathered statistics on how often each number, marked or not, is used in the compressed proofs, and tested what choices {a,b,c} lead to the shortest overall size. The largest number in set.mm is 905, and 10% of steps are marked, with a nice exponential decay for both groups.

Surprisingly, the best choices have c being incredibly small; the best solution was {208, 46, 2} with 6084 KB to store all the compressed numbers in set.mm. (For comparison, the currrent compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.) My instinct is to go with a more balanced choice, such as {192, 56, 8} = 6118 KB, but I guess the fact that there is a loose upper bound on proof lengths currently in force in set.mm means that numbers never get so large that more than 3 digits is needed.

Mario

fl

unread,
Apr 17, 2016, 11:15:41 AM4/17/16
to Metamath

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. 

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add
syntax-aware routines.

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

And all those problems are not related to the way the database is stored on a disk anyway.

> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause 
> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues 
but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say  80% of their time with text-based
tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always
captured by the checker (at least for the proofs an formulas).

-- 
FL

Norman Megill

unread,
Apr 17, 2016, 7:51:21 PM4/17/16
to meta...@googlegroups.com
On 4/17/16 6:24 AM, Mario Carneiro wrote:
>
>
>
> On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu
> <mailto:n...@alum.mit.edu>> wrote:
>
> On 4/16/16 6:53 AM, Mario Carneiro wrote:
...
>
> The multiple-module feature you want could be done now with the $[
> $] includes.
>
> For those wanting a single file, the includes would be commented out.
> For those wanting multiple files, the includes would be uncommented.
>
> We'd have to agree on the commenting convention, but I imagine it
> would be similar to what is in set.mm <http://set.mm> now if you
> search for '$( $['. Conversion between single and multiple would of
> course be automated in metamath.exe or another program.
>
> I think something like this should satisfy everyone.
>
>
> A simple short term solution would be a tool to separate set.mm
> <http://set.mm> by section, and daisy-chain the files together with
> include statements. If this can be done in a perfectly reversible
> manner, then we could pass between each form painlessly. That is, you
> could take:
>
> $( =-=- section 1 =-=- $)

Do you mean the existing section headers? If so, how would the user put
multiple tiny sections in one file or alternately break a huge section
into 2 files, or is that not possible/desirable with your scheme? What
is the algorithm for mapping the string "section 1" to "section1.mm"?
Or if you don't mean that, wouldn't it be clearer just to put the file
name instead of "section 1"?

> thm1 $p ...
> $( =-=- section 2 =-=- $)
> thm2 $p ...
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>
> from set.mm <http://set.mm>, and form the files section1.mm
> <http://section1.mm>:
>
> $( =-=- section 1 =-=- $)
> thm1 $p ...
>
> section2.mm <http://section2.mm>:

(Aside: Is there any way to turn off Google's insertion of spurious
http links on file names?)

>
> $[ section1.mm <http://section1.mm> $]
>
> $( =-=- section 2 =-=- $)
> thm2 $p ...
>
> section3.mm <http://section3.mm>:
>
> $[ section2.mm <http://section2.mm> $]
>
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>
>
> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?


...

>
> As a general comment, I think it would be useful for Metamath 2.0 to
> be translatable to and from the current version. If 2.0 needs
> additional keywords for its features, maybe they could be hidden in
> special comments in the 1.0 version so the versions could be
> completely compatible. If 2.0 has say local scoping of things like
> labels, the translator could rename them for the 1.0 version with
> some standard suffix to make them unique. The tools we have for 1.0
> (metamath.exe, numerous verifiers) may or may not be practical to
> rewrite for 2.0, but with perfect translatability they could
> continue to be used.
>
>
> I totally agree. I think that full backwards compatibility is very
> important for Metamath in general, so that proofs are supported long
> into the future, and we retain the current simplicity that has
> encouraged the existing proliferation of verifiers.
>
> However, I am not certain that all features can be supported with the
> current mm format, in particular the separation of proofs and theorem
> statements, so I would relax this restriction to having a (fairly
> simple) translation from 1.0 to and from 2.0.

Can you clarify this? I'm not opposed to hiding new 2.0 features in
special 1.0 comments so that 2.0 could be recovered from it exactly. If
we have perfect (lossless) translations back and forth between 1.0 and
2.0 (which is what I was trying to suggest), what are you relaxing?

>
>
> Here's a simple proposal for separating the proofs: take the proofs out
> of the .mm file, so that a $p statement takes the simple form:
>
> label $p math $.
>
> For forward compatibility, the label $p math $= ... $. form is also
> valid, and in this case a separate proof is not needed. Then, have a
> parallel file, say set.mmc to go with set.mm <http://set.mm>, which
> shortest overall size. The largest number in set.mm <http://set.mm> is
> 905, and 10% of steps are marked, with a nice exponential decay for both
> groups.
>
> Surprisingly, the best choices have c being incredibly small; the best
> solution was {208, 46, 2} with 6084 KB to store all the compressed
> numbers in set.mm <http://set.mm>. (For comparison, the currrent
> compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
> My instinct is to go with a more balanced choice, such as {192, 56, 8} =
> 6118 KB, but I guess the fact that there is a loose upper bound on proof
> lengths currently in force in set.mm <http://set.mm> means that numbers
> never get so large that more than 3 digits is needed.

Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?

Norm

Mario Carneiro

unread,
Apr 17, 2016, 8:16:24 PM4/17/16
to metamath
On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. 

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add
syntax-aware routines.

But it's not an *editor*. It was explicitly designed not to modify the .mm file, which was IMO a mistake. The task of simply reading an .mm file and converting it to an internal data structure is not that hard, and is performed by every verifier. It is true that mmj2 handles "proof mode" better than any other existing solutions, but it does not do "maintenance mode" at all, nor can it handle the interplay between the two (I have a script to restart mmj2 whenever I do a maintenance mode edit, which I do using a text editor open to the .mm file).

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

True. Ideas are welcome.
 
And all those problems are not related to the way the database is stored on a disk anyway.

The disk storage question is an attempt to address the scalability of the full library going forward. Almost all activities you want to do with the library are "local", and as set.mm becomes more modular this will become even more so. You can't "seek" in a text file.
 
> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause 
> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues 
but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say  80% of their time with text-based
tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

You're kidding yourself if you think that compressed proofs are "readable". Currently an .mm file is this odd mix of human readable and unreadable data. The basic idea here is to take the unreadable part out of an .mm file so it becomes pure "source code", then make the human-unreadable part *really* unreadable, i.e. full binary, taking advantage of all the unutilized byte values to achieve a significant savings in proof size (~30% decrease), and making it possible to jump around the file more efficiently than just loading the whole thing into memory.

Also, a binary format like this one is naturally paired with the aforementioned "proof mode" software, which will unpack a proof into a human readable format like .mmp style. That way you get the best of both worlds.
 
The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always
captured by the checker (at least for the proofs an formulas).

 Just because an error is recognized doesn't mean it's easy to fix. It is quite easy to cause thousands of errors with a simple change in one spot, and syntax aware editors can automatically resolve these downstream errors in many cases.



On Sun, Apr 17, 2016 at 7:51 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/17/16 6:24 AM, Mario Carneiro wrote:
A simple short term solution would be a tool to separate set.mm
<http://set.mm> by section, and daisy-chain the files together with

include statements. If this can be done in a perfectly reversible
manner, then we could pass between each form painlessly. That is, you
could take:

$( =-=- section 1 =-=- $)
Do you mean the existing section headers?  If so, how would the user put multiple tiny sections in one file or alternately break a huge section into 2 files, or is that not possible/desirable with your scheme?  What is the algorithm for mapping the string "section 1" to "section1.mm"? Or if you don't mean that, wouldn't it be clearer just to put the file name instead of "section 1"?

Yes, those are the existing section headers. Splitting and joining files is done just like you would in a software project: make a new file, copy the theorems of interest over. I don't think it is necessary to specifically support this action just because it happens to be simple with the current approach. Also, you could just add a second section to one of the single-section files, then round trip it and it will get broken up.

You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.
 

thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm <http://set.mm>, and form the files section1.mm
<http://section1.mm>:


$( =-=- section 1 =-=- $)
thm1 $p ...


(Aside:  Is there any way to turn off Google's insertion of spurious http links on file names?)

Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

I totally agree. I think that full backwards compatibility is very
important for Metamath in general, so that proofs are supported long
into the future, and we retain the current simplicity that has
encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the
current mm format, in particular the separation of proofs and theorem
statements, so I would relax this restriction to having a (fairly
simple) translation from 1.0 to and from 2.0.

Can you clarify this?  I'm not opposed to hiding new 2.0 features in special 1.0 comments so that 2.0 could be recovered from it exactly.  If we have perfect (lossless) translations back and forth between 1.0 and 2.0 (which is what I was trying to suggest), what are you relaxing?

Oh, I just meant it may not be backwards or forwards compatible, in the sense that one spec is a subset or superset of the other, but there is an embedding and maybe a bijection between formats.

I gathered statistics on how often each number, marked or not, is used
in the compressed proofs, and tested what choices {a,b,c} lead to the
shortest overall size. The largest number in set.mm <http://set.mm> is

905, and 10% of steps are marked, with a nice exponential decay for both
groups.

Surprisingly, the best choices have c being incredibly small; the best
solution was {208, 46, 2} with 6084 KB to store all the compressed
numbers in set.mm <http://set.mm>. (For comparison, the currrent

compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
My instinct is to go with a more balanced choice, such as {192, 56, 8} =
6118 KB, but I guess the fact that there is a loose upper bound on proof
lengths currently in force in set.mm <http://set.mm> means that numbers

never get so large that more than 3 digits is needed.
Interesting.  I'm curious - how well do these formats compress further with say bzip2 or xz?

There is probably still some room for compression, since the label lists are still english-like. But a proper answer would require implementing the algorithm (those KB numbers are projections based on the statistics of set.mm, I never actually made a 6084 KB file). I'll let you know if/when 2.0 is actually built.

Mario

Stefan O'Rear

unread,
Apr 17, 2016, 8:53:13 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 3:24:41 AM UTC-7, Mario Carneiro wrote:
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good.  On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory.  If an editor has trouble with a file of that size, maybe the editor design is not good.  While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more.  Emacs supposedly handles 2EB (10^9 GB).

Being able to load a 28MB file is not the same thing as being able to handle a 28MB file _quickly_.  Most files encountered in programming are much smaller (look at the C files in the Metamath distribution), and most specialized programmer's editors don't place a high priority on being able to edit responsively with a large file loaded, even if that file can theoretically be loaded.  I believe I reported a long time ago that vim was having 1/4 - 1/2 seconds of lag to move around in the 17MB set.mm.  It's great that your editor of choice has made tradeoffs that allow it to efficiently handle large files, but other people have editors of choice that make different tradeoffs, and keeping set.mm in smaller pieces will help make them happy.

The version control systems I use also get noticably sluggish with files that large.

Moreover, it's important to plan for the future as well as the present.  If we ever succeed in attracting a significant number of new contributors, that file will increase in size much faster than it has done in the past.  I consider the fact that set.mm is growing slower than Moore's law to be a bug, not a feature. 

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

Exactly, and this goes double for my zany idea of the browser-based proof editor, because Javascript implementations give you very little control over the heap layout (security reasons) and many data structures will require a lot more memory than they would in a C program.  (Unless we do the asm.js "everything in a big Int8Array" approach).
 
One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

My plan for SMM-PA was to keep the proofs in text format when they weren't being edited, because that's denser than almost any other possible data structure.  Still, maintaining the parse trees for all statements took ~ 120 MB last time I tried it.  Developing a denser format for the parse trees was on my TODO list.
 
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity.  Speed.  Lowest common denominator.  Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.

Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

Important: This proposal DOES NOT use $[ $].  The directory structure is king.  I regard $[ $] as a mistake because it gives the database author too much flexibility; the database should be a database, not a program.

This gives us a large number of small files (although Construction_of_a_vector_space_from_a_Hilbert_lattice could still stand to be split up :) ), making our tree similar to other trees and play nicer with the widest variety of existing text manipulation tools.  (A current Linux kernel checkout is 624 MB, or 12 KB per source file...)

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.

Agreed on commits and the value of integrating version control into our workflow.  Again, git and other systems I've used will be happier with hundreds of small files than one largish one.
 
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

-sorear 

Stefan O'Rear

unread,
Apr 17, 2016, 9:02:19 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 4:51:21 PM UTC-7, Norman Megill wrote: 
> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?


A newer version of essentially the same logic: https://github.com/sorear/smm/blob/master/lib/smm/outline.js

I have already stated my opposition to $[ $].
 
Interesting.  I'm curious - how well do these formats compress further
with say bzip2 or xz?

I'd really encourage us to _not_ spend much time on optimizing compression with databases at rest.  I think I could make something half the size of .mm.xz, given my experience with format-specific data compressors (the general-purpose compressors are a lot worse for specific formats than people realize), but at least for me 28MB on disk is very little.  I care much more about 120MB of parse trees in RAM, occupying space in a browser heap already split between N tabs...

This might be US city-dweller bias.  FL might have choice words on the importance of optimizing databases for transmission over very slow unreliable links.

-sorear

Stefan O'Rear

unread,
Apr 17, 2016, 9:05:20 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 5:16:24 PM UTC-7, Mario Carneiro wrote:
You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.

I posted two implementations of the "news website approach" above.  It needs a little tweaking for Metamath but works well.
 
Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

Two-letter file extensions are best avoided for new projects because of the ccTLD issue...
 
-sorear

David A. Wheeler

unread,
Apr 17, 2016, 10:18:31 PM4/17/16
to meta...@googlegroups.com, Mario Carneiro
I do like the readability of text files. If seekability is important, perhaps they could be stored as a bunch of small text files, with one proof profile.

--- David A.Wheeler

Norman Megill

unread,
Apr 17, 2016, 11:35:35 PM4/17/16
to meta...@googlegroups.com
On 4/17/16 8:16 PM, Mario Carneiro wrote:
>
>
> On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath
> <meta...@googlegroups.com <mailto:meta...@googlegroups.com>> wrote:
...

> > In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause
> > unforeseen mistakes.
>
> And frankly readable, text-based databases are never superseded by
> binary databases. Binary data can be useful for speed or other
> efficiency issues
> but it is always at the cost of poorer interfacing. I think computer
> engineers must spend I would say 80% of their time with text-based
> tools to manage data, systems and so on. Text-based databases have
> the peculiarity of interfacing well with important element of a
> computer: human beings.
>
>
> You're kidding yourself if you think that compressed proofs are
> "readable". Currently an .mm file is this odd mix of human readable and
> unreadable data. The basic idea here is to take the unreadable part out
> of an .mm file so it becomes pure "source code", then make the
> human-unreadable part *really* unreadable, i.e. full binary, taking
> advantage of all the unutilized byte values to achieve a significant
> savings in proof size (~30% decrease), and making it possible to jump
> around the file more efficiently than just loading the whole thing into
> memory.
>
> Also, a binary format like this one is naturally paired with the
> aforementioned "proof mode" software, which will unpack a proof into a
> human readable format like .mmp style. That way you get the best of both
> worlds.

I'll mention that I am somewhat uncomfortable with binary formats, but
as long as we can convert to ASCII 1.0 I guess it's tolerable. For some
things like label renames, a syntax-aware Metamath-specific tool is
certainly preferred. But no such tool can anticipate all the possible
manipulations we may want to do to a .mm file. Having the ability to
process it with text-manipulating tools from sed to powerful text editor
scripting has been extremely useful in the past, and that ability goes
away with binary formats. Unlike a computer program where editing can
introduce hidden bugs, the ability to verify proofs makes .mm files
relatively robust against editing mistakes.

A computer program source file 28MB in size is hard to even imagine, and
I wouldn't want to be responsible for its maintenance. The fact that
set.mm flourishes with no significant problems in spite of its size
suggests that is has a fundamentally different character from a computer
program. The same principles of structure and modularity, intended to
reduce bugs and increase maintainability, may or may not apply. set.mm
is also quite different from an ordinary SQL database, which could be
riddled with unverifiable data entry errors that can only partially be
checked for consistency (such as correct zip code). I don't know what
category of data object set.mm would be classified as, but it seems
different from the usual ones of our experience.

As for file size, of course smaller files are preferable as a general
principle, but I don't think it is worth giving up flexibility. I still
don't understand the concern with our present file sizes, where set.mm
takes less than 1% of available memory on most computers. (BTW my text
editor uses an additional 13MB to work with the 28MB set.mm, growing up
to a user-specified limit for the undo stack.) I know you are trying to
anticipate future scalability, and theoretical studies of different
proof compression methods is certainly interesting, but it will be quite
a while before saving 30% of the size will make a noticeable difference.

> The quantity of errors introduce by a non aware-based system are
> easily corrected I think especially with metamath where an error in
> the data is always
> captured by the checker (at least for the proofs an formulas).
>
>
> Just because an error is recognized doesn't mean it's easy to fix. It
> is quite easy to cause thousands of errors with a simple change in one
> spot, and syntax aware editors can automatically resolve these
> downstream errors in many cases.

What is an example of an error introduced by a careless edit that is
hard to fix? (Personally, during an editing session I usually save the
file frequently and run the fast 'verify proof */syntax_only'.)

Norm

Mario Carneiro

unread,
Apr 18, 2016, 1:31:23 AM4/18/16
to metamath
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:
Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).
 
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.


On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable.  For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred.  But no such tool can anticipate all the possible manipulations we may want to do to a .mm file.  Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats.  Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.
 
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance.  The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program.  The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.
 
set.mm is also quite different from an ordinary SQL database, which could be riddled with unverifiable data entry errors that can only partially be checked for consistency (such as correct zip code).  I don't know what category of data object set.mm would be classified as, but it seems different from the usual ones of our experience.

As we've seen, not all data errors in set.mm can be checked. The definition checker, grammar parsing, and theorem scoping are all examples of data integrity checks that are not checkable by the average verifier. Furthermore, testing that theorems and definitions mean what we want will never be automatically checkable. This is exactly the same sort of thing as an SQL database with unverifiable data entry errors, or a C program which might compile but might still have logical bugs.

The biggest difference between set.mm and an SQL database is instead the online nature of the data (online as in https://en.wikipedia.org/wiki/Online_algorithm). If all the grammar was defined up front, it actually wouldn't be all that bad to store a metamath file in a DBMS.

As for file size, of course smaller files are preferable as a general principle, but I don't think it is worth giving up flexibility.  I still don't understand the concern with our present file sizes, where set.mm takes less than 1% of available memory on most computers.  (BTW my text editor uses an additional 13MB to work with the 28MB set.mm, growing up to a user-specified limit for the undo stack.)  I know you are trying to anticipate future scalability, and theoretical studies of different proof compression methods is certainly interesting, but it will be quite a while before saving 30% of the size will make a noticeable difference.

I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.
 
Just because an error is recognized doesn't mean it's easy to fix. It
is quite easy to cause thousands of errors with a simple change in one
spot, and syntax aware editors can automatically resolve these
downstream errors in many cases.

What is an example of an error introduced by a careless edit that is hard to fix?  (Personally, during an editing session I usually save the file frequently and run the fast 'verify proof */syntax_only'.)

An almost trivial example: Change syl to $e |- ( ps -> ch ) $e |- ( ch -> ph ) $p |- ( ps -> ph ) $. By "fix" here I don't mean "undo the change", but "fix the file so that it's not broken, but keep the change".

Mario

fl

unread,
Apr 18, 2016, 6:08:17 AM4/18/16
to Metamath


> But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the > syntax tree delivered the by mmj2 to add

Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

You can also externalize the parser and make an external filter in the "find" spirit. I think it would
allow to do think.

By the way if could add automatic periodic backups.

--
FL

fl

unread,
Apr 18, 2016, 6:13:11 AM4/18/16
to Metamath

>  I think it would allow to do think.

To do things.

--
FL

Stefan O'Rear

unread,
Apr 19, 2016, 12:51:00 AM4/19/16
to Metamath
On Sunday, April 17, 2016 at 10:31:23 PM UTC-7, Mario Carneiro wrote:
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:
Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

That was a very quick idea, and sure to have a number of problems… this is a fairly hard problem to solve, and I still don't have all the answers yet.  Definitely interested in future exploration and discussion of the "how do we make a _good_ dual-mode editor" space.

What do we want to do in terms of platform issues?  I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.
 
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well.  At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

Vaguely I feel like the Database should always be, abstractly, a singular entity, but it need not be worked with _as_ a singular entity, if that makes any sense at all.
 
the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?
 
Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

Sounds good, assuming we have a splitting mechanism that is rationalized enough that we can actually handle files in isolation.  (Even C compiler authors are turning against textual inclusion these days.  http://clang.llvm.org/docs/Modules.html — I have my hopes.)
 
The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).

When I did the memory profiling on smm that came up with the 120MB number, the worst offender was large numbers of short string objects, specifically the "(" "A" "+" "B" ")" strings that came from first-stage tokenizing the math strings.  I'm pretty sure the memory usage could be cut substantially, first by using an intern table for tokens, second by more restricted caching of things like parse trees that can be locally constructed.  There are tradeoffs here; I was trying to cut every last centisecond from the verification time, and I reasoned that would add quite a few (although to be honest I never measured it, I could do that).

"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches.  We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them.  We can't go around rebuilding all of the indexes on load and on every edit.  A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here.  They'll have to respect tree structure and scope structure in a few ways.
 
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.

That's what I had in mind - splitting the database into 100 or so files with separate timestamps.  Don't go crazy though as nobody ever added transitive directory timestamps to UNIX and stat overhead can be a killer when you have 50,000+ files (see also: doing almost anything with said Linux checkout).

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

Absolutely version control should have a single point of truth.  If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.
 
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable.  For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred.  But no such tool can anticipate all the possible manipulations we may want to do to a .mm file.  Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats.  Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one.  Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier.  I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with.  A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.
 
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance.  The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program.  The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.

The largest legitimate (not a firmware dump, etc) single source file in my Linux checkout is 534 KB.  It's a matter of degree, not kind.
 
I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.

Worth noting that one of my big research plans for a proof tool is a fast highly-indexed search function.  I haven't worked out very many of the details yet but I've spent a fair amount of time looking at the algorithms used by Lucene et al, and I do think something can be usefully done, *but* managing memory use is going to be one of the hardest parts.
 
-sorear

Stefan O'Rear

unread,
Apr 19, 2016, 12:55:52 AM4/19/16
to Metamath
On Monday, April 18, 2016 at 3:08:17 AM UTC-7, fl wrote:
Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.

And don't even think about enabling the syntax highlighter. 

Mario Carneiro

unread,
Apr 19, 2016, 4:47:50 AM4/19/16
to metamath
On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
What do we want to do in terms of platform issues?  I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.

I've been investigating a possible replacement for metamath.exe written in Rust (which targets the same platforms as C). As long as you are running local, it seems useful to just run as a command line tool, like a compiler, which is invoked by actual text editors to do the heavy lifting. The biggest problem with this model is the large startup cost, which would seem to suggest a more server-oriented approach in which the compiler runs as a service and updates its memory based on commands from the editor.
 
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well.  At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

One idea I've been toying with is snapshotting the database after each file, so that you have a restart point when some file in the middle changes. Not sure if it's not simpler to just filter the full database and throw out everything after a given sequence number.

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?

It's almost true; moving local $v's to the start causes duplicate $v errors, but you can fix that by merging the duplicates, but local $f's might cause an irreparable duplicate problem, if the type of the variable is different in the two $f statements. A recent suggestion of mine for ensuring that local $f assignments always use the same type will resolve this.

But the other half of the story is that moving the grammar up means that more statements will be accepted, for example a propositional calc statement might make reference to a constant declared later (like CHOICE), which would parse "correctly" even though it is not valid. I'm not sure how to resolve this; probably it is best to just load the full grammar and then postprocess the parse to make sure it does not use any forward references. (The same problem applies to user scripts, which may have hardcoded references to theorems from later on.)
 
"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches.  We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them.  We can't go around rebuilding all of the indexes on load and on every edit.  A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here.  They'll have to respect tree structure and scope structure in a few ways.

Alternatively, you can present the user with the raw .mm file, and diff any changes against the indexed version, then attempt to reverse engineer which statements are untouched and only rebuild the index for changed statements. Doing this quickly is tough, though, even assuming you are working with a reasonably sized .mm file fragment.

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

Absolutely version control should have a single point of truth.  If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.

Well, you do have to version control the MMC files, because that's where the proofs are. (In this conception the proofs are stubbed out of all the $p statements.) It would even diff well, in the sense that a change to one proof affects only one subsection of the binary file. It's really not all that different from an archive of one-proof files concatenated together; the reason I'm avoiding that is because most proofs are small, and having that many files makes it hard to work with, as you have mentioned.
 
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable.  For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred.  But no such tool can anticipate all the possible manipulations we may want to do to a .mm file.  Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats.  Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one.  Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier.  I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with.  A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.

I think it will be a win to be able to separate maintenance mode from proof mode as separate files. That is, if you change the .mmc file but not the .mm file, there are no changes needed to the index because you only changed a proof, and not any statements. Also, you can present the .mm file to the user as a pure source code file, rather than fiddling with collapsing markers to hide the proofs so that you can quickly look over the development at a high level.
I really really don't want to reinvent the wheel here. The UI of mmj2 simply was not built with large files in mind, and many of my attempts to add functionality have hit a performance wall. Syntax highlighting was a lot of work and was ultimately a failure because it is not fast enough to be usable in practice. I don't want to repeat the experience. I'm looking into an Eclipse or Emacs plugin, because text editing that scales well is too hard to get right without writing a lot of code that has nothing to do with Metamath.

Mario

fl

unread,
Apr 19, 2016, 5:35:42 AM4/19/16
to Metamath

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.


Consider using real tools. Emacs manages a 28M file as if it had 100 bytes.

But anyway the filter I was speaking about above may also be a good idea.
 
And don't even think about enabling the syntax highlighter. 


Incremental parsers are your friends. (And you need a deterministic LR. You will be happy.)

-- 
FL 

Stefan O'Rear

unread,
Apr 19, 2016, 11:28:56 PM4/19/16
to meta...@googlegroups.com
On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

I've been playing with Rust for a few months now, and I've *already* been
thinking it'd be an excellent fit for a new metamath program. Also provides
nicer multithreading than C (and *much* nicer than any current Javascript), and
there are quite a few places we can leverage that for improved responsiveness
in the editor... Was going to try writing a one-shot verifier in Rust in a
week or so.

Given the low startup overhead of a pre-compiled program and the ability to
keep a cache file (see below), I think startup under 200ms is feasible. Will
need to do experiments.

There would be a lot of problems to solve relative to integrating it into an
editor / constructing a GUI shell.

(BTW, I am presently running a Linux desktop)

> > That could work well. At that point it would essentially be a project
> > file; I feel like there's something _almost_ here that would be a big help
> > to local work / handling subsets of the database, but I'm not quite sure
> > what it is.
> >
>
> One idea I've been toying with is snapshotting the database after each
> file, so that you have a restart point when some file in the middle
> changes. Not sure if it's not simpler to just filter the full database and
> throw out everything after a given sequence number.

I don't like restart points.

I spent some time thinking about this last night and I think the reason I was
hitting a wall with smm was that I was trying to handle it at the wrong
granularity. smm wanted to support swapping out individual metamath
*statements*, which could be things like `${`; it'd work much better if we
forced the reparsing granularity to be a much larger segment.

Proposal for SMM 0.3:

* The Database is divided up into segments, which are coterminous with .mm
files. If proofs are split from statements, then each segment has a .mms
and a .mmc file.

It is possible to consider shorter segments, but some way to stably
identify them would become necessary.

* Each database has a "project file" which lists the used Database segments.
Details TBD.

* A segment must end with a newline, must consist of complete Metamath
statements only, and may not contain unbalanced ${ $}. Thus, the lexer
state and the nesting state are both automatically known at the top of a
segment.

* $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
requires state (the include list). The exception is for compatibility.

* Top level $e statements are forbidden if there are more than one segment.
(They create difficulty for frame construction).

* Given a single segment in isolation, it can be parsed into Metamath
statements and blocks. For every assertion, we can identify the `$e`
hypotheses (but not the `$f`s, because we still allow those at top level and
they could be in scope from a former segment). Record all used global names
in an index.

* Once we've scanned all segments, we can detect errors from name collisons.

* Check all segments again. Now that we have the full `$c`/`$v`/`$f` index,
we can construct frames for all assertions.

* At this point we have frames for all `$a` statements, so we can construct
the grammar.

* Check all segments again. We have a grammar and can do a grammatical check;
we also have frames for everything and can do a proof validity check. This
phase works locally within assertions so it can be incremental at that
level.

I am fairly confident that all of this can be done in a strongly incremental
way: the only changes that will requiring scanning the full active database are
the ones that change the definition or parsing rules of a math token that is
actually used everwhere.

> It's almost true; moving local $v's to the start causes duplicate $v
> errors, but you can fix that by merging the duplicates, but local $f's
> might cause an irreparable duplicate problem, if the type of the variable
> is different in the two $f statements. A recent suggestion of mine for
> ensuring that local $f assignments always use the same type will resolve
> this.
>
> But the other half of the story is that moving the grammar up means that
> more statements will be accepted, for example a propositional calc
> statement might make reference to a constant declared later (like CHOICE),
> which would parse "correctly" even though it is not valid. I'm not sure how
> to resolve this; probably it is best to just load the full grammar and then
> postprocess the parse to make sure it does not use any forward references.
> (The same problem applies to user scripts, which may have hardcoded
> references to theorems from later on.)

$f statements can be handled by adding valid ranges to the grammar rules,
e.g. https://github.com/sorear/smm/blob/master/lib/smm/parser.js#L310

> > Absolutely version control should have a single point of truth. If we
> > decide to store MMC files on disk (absolute loading speed may not be
> > important once we have a mature maintenance mode, so I could go either way
> > on this), they should be .gitignored.
> >
>
> Well, you do have to version control the MMC files, because that's where
> the proofs are. (In this conception the proofs are stubbed out of all the
> $p statements.) It would even diff well, in the sense that a change to one
> proof affects only one subsection of the binary file. It's really not all
> that different from an archive of one-proof files concatenated together;
> the reason I'm avoiding that is because most proofs are small, and having
> that many files makes it hard to work with, as you have mentioned.

I'd slightly rather keep the proofs with the $p statements. To use a SQL
database analogy, I'm currently advocating for horizontal partitioning only.
For the database size and incremental processing use case, horizontal
partitioning is a much bigger win.

I'd also rather not do anything that requires binary diffing; the tools exist,
but they're not widely used and it would contribute to the barrier to entry if
we restricted usable tools in that way.

> I think it will be a win to be able to separate maintenance mode from proof
> mode as separate files. That is, if you change the .mmc file but not the
> .mm file, there are no changes needed to the index because you only changed
> a proof, and not any statements. Also, you can present the .mm file to the
> user as a pure source code file, rather than fiddling with collapsing
> markers to hide the proofs so that you can quickly look over the
> development at a high level.

> I really really don't want to reinvent the wheel here. The UI of mmj2
> simply was not built with large files in mind, and many of my attempts to
> add functionality have hit a performance wall. Syntax highlighting was a
> lot of work and was ultimately a failure because it is not fast enough to
> be usable in practice. I don't want to repeat the experience. I'm looking
> into an Eclipse or Emacs plugin, because text editing that scales well is
> too hard to get right without writing a lot of code that has nothing to do
> with Metamath.

I think I finally see what you're getting at here. You want to remove proofs
from the text that needs to be edited while maintaining the database, because
it's useless visual noise. I want to keep proofs interleaved in the .mm
file(s) that we store and distribute because the proof file and the statement
file would otherwise have to be kept closely in sync, which is doable but a bit
annoying. Undecided.

If you want to do this, I suggest a line-based "text" format, with one line per
proof, separated by equals signs, sorted, with the current format on the right,
e.g.:

id = ( wi ax-1 mpd ) AAABZAAACAECD
idd = ( wi id a1i ) BBCABDE

This should keep diff happy, even if the line length is a bit long.

Secondary proposal: The stripped Metamath files should be called .mms, to solve
our problem with Myanmar.

-sorear

Norman Megill

unread,
Apr 19, 2016, 11:47:10 PM4/19/16
to meta...@googlegroups.com
On 4/18/16 1:31 AM, Mario Carneiro wrote:
...

> What is an example of an error introduced by a careless edit that is
> hard to fix? (Personally, during an editing session I usually save
> the file frequently and run the fast 'verify proof */syntax_only'.)
>
> An almost trivial example: Change syl to $e |- ( ps -> ch ) $e |- ( ch
> -> ph ) $p |- ( ps -> ph ) $. By "fix" here I don't mean "undo the
> change", but "fix the file so that it's not broken, but keep the change".

As with any computer readable language, one needs to have a mental model
to know which changes can be done in a text editor without side effects.
Some changes are better done with assistance from syntax-aware tools.
A novice will quickly discover that this isn't a casual edit and undo
the change, or perhaps ask someone how to keep the change if it seems
sufficiently important.

This is certainly an example of something that would be best done with a
syntax-aware tool and not in a text editor. We don't have a tool with
such a command at this point, and the ability to do it would be a nice
feature to have. However, for the record in case anyone is wondering,
here is the way I would do this with the present tools.

1. Rename the original syl and all of its references to sylOLD.
2. Add the new syl.
3. Create and run a script that does 'minimize_with syl/allow_growth' in
MM-PA for each theorem using sylOLD.
4. Delete statement sylOLD.

Steps 1, 2, and 4 are done with a text editor. I have a script-building
script to semi-automate step 3, making it reasonably quick and easy for
me to do.

And for completeness:

5. Correct any references to the string sylOLD (as in ~ sylOLD).
6. Document the revision in the syl comment, date, and 'Recent label
changes'.

Norm

Norman Megill

unread,
Apr 20, 2016, 12:02:49 AM4/20/16
to meta...@googlegroups.com
On 4/19/16 11:28 PM, Stefan O'Rear wrote:
...
> I think I finally see what you're getting at here. You want to remove proofs
> from the text that needs to be edited while maintaining the database, because
> it's useless visual noise. I want to keep proofs interleaved in the .mm
> file(s) that we store and distribute because the proof file and the statement
> file would otherwise have to be kept closely in sync, which is doable but a bit
> annoying. Undecided.

Couldn't a syntax-aware editor simply have a mode that hides proofs?

Norm

Stefan O'Rear

unread,
Apr 20, 2016, 12:14:11 AM4/20/16
to meta...@googlegroups.com
I beleive that is precisely what Mario was aiming at with "rather than fiddling
with collapsing markers to hide the proofs".

-sorear

Stefan O'Rear

unread,
Apr 20, 2016, 1:30:58 AM4/20/16
to meta...@googlegroups.com
On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

After a more careful consideration of the algorithmic constraints, I now
believe it is technically feasable to target an ex nihilo (hot OS
caches, but no other saved state) verification of the current set.mm in
500 milliseconds on my dual core laptop. I'm going to start that now.

-sorear

Stefan O'Rear

unread,
Apr 20, 2016, 1:10:39 PM4/20/16
to meta...@googlegroups.com
On second thought, I've been rather thoroughly missing the forest for
the trees this week. What the things I've been doing would benefit most
from is a new proof editor with a MM-PA style UI but featuring
compressed proof handling, a grammar-aware unifier, and several other
odds and ends; I can do that with smm 0.2, no change to the database
format is required.

Parallel and incremental verification is still technically interesting
and I might do something along those ends, but it's not something I need
right now in any sense.

smm 0.2 will be fairly weak for maintenance mode operations, but I can
keep using a text editor for most of those.

I'd kind of like to split up set.mm to make tools happier, but one of
the few ways set.mm is *not* like a program is that it either works
(verifies) or it doesn't. With a program it's useful to have very
frequent commits because it makes it easier to understand "why" when a
regression is discovered, but set.mm has no regressions and so commits
every couple days are good enough for history-keeping.

Mario, as far as I can tell you want to split out the proofs to make it
easier to manage maintainance mode operations in Eclipse. I think that
can be done using very simple tools to convert between split files and
not-split files, while avoiding ecosystem fragmentation by only sharing
the not-split files. I think smm 0.2 could handle the splitting and
unsplitting with a <100 line program; I could do that if you'd like.

-sorear

Mario Carneiro

unread,
Apr 21, 2016, 8:07:36 AM4/21/16
to metamath
On Tue, Apr 19, 2016 at 5:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.


Consider using real tools. Emacs manages a 28M file as if it had 100 bytes.

I'm glad we've come to an agreement, that a proper tool is needed (and mmj2 does not fully fill this role). I'm personally not a fan of Emacs, but they have great customization options and a large programmer base so I can't ignore them off hand. (But those crazy shortcuts...) Also, Lean is usually used with an Emacs plugin, which I will likely be working a lot with during my PhD, so I'll need to learn it eventually.



On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues?  I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.

I've been playing with Rust for a few months now, and I've *already* been
thinking it'd be an excellent fit for a new metamath program.  Also provides
nicer multithreading than C (and *much* nicer than any current Javascript), and
there are quite a few places we can leverage that for improved responsiveness
in the editor...  Was going to try writing a one-shot verifier in Rust in a
week or so.

That's great. I'll open source my project soon, and then we can collaborate on this. (I'm calling it rmm.)

Given the low startup overhead of a pre-compiled program and the ability to
keep a cache file (see below), I think startup under 200ms is feasible.  Will
need to do experiments.

What does this "startup" entail? Is it reading the whole .mm file (or equivalent)? I suppose the minimum necessary for "ready to go" state is to have parses and frames for all the statements. Is this what you mean?
 
Proposal for SMM 0.3:

 * The Database is divided up into segments, which are coterminous with .mm
   files.  If proofs are split from statements, then each segment has a .mms
   and a .mmc file.

   It is possible to consider shorter segments, but some way to stably
   identify them would become necessary.

 * Each database has a "project file" which lists the used Database segments.
   Details TBD.

 * A segment must end with a newline, must consist of complete Metamath
   statements only, and may not contain unbalanced ${ $}.  Thus, the lexer
   state and the nesting state are both automatically known at the top of a
   segment.

 * $[ $] is forbidden if there are multiple segments.  Rationale: handle $[ $]
   requires state (the include list).  The exception is for compatibility.

What is this "multiple segments" exception? I notice it a few more times below. If there is only one "segment", are you trying to make it the same as parsing a single .mm file?
 
  * Top level $e statements are forbidden if there are more than one segment.
   (They create difficulty for frame construction).

 * Given a single segment in isolation, it can be parsed into Metamath
   statements and blocks.  For every assertion, we can identify the `$e`
   hypotheses (but not the `$f`s, because we still allow those at top level and
   they could be in scope from a former segment).  Record all used global names
   in an index.

 * Once we've scanned all segments, we can detect errors from name collisons.

 * Check all segments again.  Now that we have the full `$c`/`$v`/`$f` index,
   we can construct frames for all assertions.

 * At this point we have frames for all `$a` statements, so we can construct
   the grammar.

 * Check all segments again.  We have a grammar and can do a grammatical check;
   we also have frames for everything and can do a proof validity check.  This
   phase works locally within assertions so it can be incremental at that
   level.

I am fairly confident that all of this can be done in a strongly incremental
way: the only changes that will requiring scanning the full active database are
the ones that change the definition or parsing rules of a math token that is
actually used everwhere.

The key, I think, is figuring out how to do incremental changes in this setup. Let's number your phases, (1) gathering names (2) building frames (3) grammar check (4) proof check. We can make a cache file for each phase and each segment.

* The phase 1 cache is never invalidated by changes to other segments.
* The phase 2 cache is invalidated only by changes or removal of global $v and $f statements in previous segments.
* The phase 3 cache is invalidated by changes or removal of syntax $a statements in previous segments.
* The phase 4 cache is invalidated by changes or removal of $a and $p statements in previous segments.

Since the last two phases have no sequential dependencies, the cache can be done at a per-statement level, though, in which it is more discriminating:

* A single-statement phase 3 cache is invalidated by changes or removal of statements used in the parse.
* A single-statement phase 4 cache is invalidated by changes or removal of statements used in the proof.

At this point, the hard part is figuring out what caches to invalidate after a change, because of the complexity of the test. To that end I would introduce another cache, let's call it phase 5, which computes the used-by list for each statement. This is a straight inversion of the proof reference list, which depends only on the .mmc file, so it can be done in parallel with phase 1-4, although it may not make sense if one of the other phases fails. We have:

* A single-statement phase 5 cache is invalidated by addition or removal of statements using it.

The nice thing is that this cache is never affected by changes to .mms files, and it allows the efficient location of the phase 3,4 caches.

I think I finally see what you're getting at here.  You want to remove proofs
from the text that needs to be edited while maintaining the database, because
it's useless visual noise.  I want to keep proofs interleaved in the .mm
file(s) that we store and distribute because the proof file and the statement
file would otherwise have to be kept closely in sync, which is doable but a bit
annoying.  Undecided.

If you want to do this, I suggest a line-based "text" format, with one line per
proof, separated by equals signs, sorted, with the current format on the right,
e.g.:

id = ( wi ax-1 mpd ) AAABZAAACAECD
idd = ( wi id a1i ) BBCABDE

This should keep diff happy, even if the line length is a bit long.

I can back this proposal, since it's basically the same as the binary .mmc format, but presented in text form which should make Norm happy. However, I'm sure you noticed that I took the opportunity of a new format to subtly change the algorithm to eliminate "Z" markers, and I still want to try that out. Also, without the need for dodging text searches, I'd prefer to open up the range of valid characters in the compressed proof to base 64 (which does not contain spaces or newlines as data).

I redid the analysis with 64 and 26 characters maximum, and threw in the original algorithm as well (which uses two regions and an extra byte for marked steps), for some choices of parameters (for the new algorithm it is the triple a+b+c = n, for the old it is a+b = n-1):

Alg. | chars | params   | result
new  | 256   | 208,46,2 | 6085 K
old  | 256   | 253,2    | 6463 K
old  | 64    | 57,6     | 8198 K
old  | 64    | 56,7     | 8208 K
new  | 64    | 48,9,7   | 8300 K
old  | 26    | 19,6     | 10260 K
old  | 26    | 20,5     | 10282 K
new  | 26    | 16,2,8   | 10428 K

Interestingly, the gains made by the 'new' algorithm are lost in the smaller character sets. Note that the old algorithm beats the new in each case given a constant character set. (Also, it is interesting to note that 20,5 is not the optimal cutoff for the current setup, 19,6 beats it by a small margin, at least on set.mm's data.) Anyone willing to get behind 57,6 encoding (or 56,7 which falls on a power of 2) for compressed proofs in .mmc format?
 
Secondary proposal: The stripped Metamath files should be called .mms, to solve
our problem with Myanmar.

Sounds good.

On Wed, Apr 20, 2016 at 12:14 AM, Stefan O'Rear <sor...@gmail.com> wrote:
On Wed, Apr 20, 2016 at 12:02:40AM -0400, Norman Megill wrote:
> Couldn't a syntax-aware editor simply have a mode that hides proofs?

I believe that is precisely what Mario was aiming at with "rather than fiddling

with collapsing markers to hide the proofs".

Whenever you present the user with a text area that almost-but-not-quite matches the actual text being saved, there is an increase in complexity regarding the handling of these "hidden regions". What happens when you type in the vicinity of a hidden region? How do you make sure that the data stays with the statement as the user types around it? By exporting this to another file, there is a clearly defined relationship between the two, which is unaffected by manual edits.

This is one of my targets with this whole business: good maintenance mode editing, with a strong interrelation with proof mode. Here's an example edit which I just did manually, for which a program would have been great:

Theorem A is going away, to be replaced by theorem B. But it's not a straight replacement. Instead, A is usually applied as a subtree of the form syl2anc[x, y, A] where x,y are proof subtrees, and I want to replace this whole tree fragment with some other tree fragment T[x, y] which matches the original (and T contains B somewhere, say T[x, y] = syl[syl2anc[x, y, B], C]. In other words, there is a straight replacement, but it is at the proof tree level instead of the theorem level.

I'm sure Norm knows some clever trick for doing this with minimize/allow_growth. A simple primitive operation which would accomplish the above is "inline theorem" and "extract theorem". Extract theorem T would look for subtrees in a theorem proof that match the proof of T (or perhaps just match the signature of T), and uses T instead. "Minimize with T" from MM-PA does essentially this. Conversely, "inline theorem T" would go to each use of T and insert the proof of T instead of referencing T; afterward T will not be referenced and can be deleted. There are some subtleties in choosing dummy variables in the proof, if there are any.
 
I'd kind of like to split up set.mm to make tools happier, but one of
the few ways set.mm is *not* like a program is that it either works
(verifies) or it doesn't.  With a program it's useful to have very
frequent commits because it makes it easier to understand "why" when a
regression is discovered, but set.mm has no regressions and so commits
every couple days are good enough for history-keeping.

I'm not sure I understand this. A program can fail in many places, and set.mm can also fail to compile in many places. If you want an example of a "run-time error", it's true that they are much less common in Metamath due to the stringent compilation, but it's not impossible. For example someone might try to use a theorem and discover that it no longer exists, or has the wrong signature, because some crazy guy is periodically reorganizing the library. They would then search the history for the theorem name, to find out what happened to it and what replaces it. The relationship to run-time errors here is that the database "works", it just doesn't work the way you think it does.

Mario, as far as I can tell you want to split out the proofs to make it
easier to manage maintainance mode operations in Eclipse.  I think that
can be done using very simple tools to convert between split files and
not-split files, while avoiding ecosystem fragmentation by only sharing
the not-split files.  I think smm 0.2 could handle the splitting and
unsplitting with a <100 line program; I could do that if you'd like.

If you have a self-contained program to do this, please share it, I'm sure I could make use of it.

I beg to differ regarding sharing not-split, though. The major benefits of splitting are in git maintenance and day to day work, so keeping it local largely defeats the purpose. Plus, I want to be able to choose the split file names, which isn't possible if the "default" is not-split. Instead, we should unsplit the git repo for producing deliverables such as the set.mm download on the website.

Stefan O'Rear

unread,
Apr 21, 2016, 12:17:59 PM4/21/16
to meta...@googlegroups.com
On Thu, Apr 21, 2016 at 08:07:35AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> > I've been playing with Rust for a few months now, and I've *already* been
> > thinking it'd be an excellent fit for a new metamath program. Also
...
>
> That's great. I'll open source my project soon, and then we can collaborate
> on this. (I'm calling it rmm.)

The main disadvantage of something like Rust is the difficulty of
handling user extensions. Rust's learning curve is not the shallowest,
and the 100 MB compiler download and the need for frequent slow
recompiles will create a high barrier of entry for things like macros.
Most likely I'd wind up embedding Lua; now we have a different problem,
namely two parallel ecosystems of support code.

> > Given the low startup overhead of a pre-compiled program and the ability to
> > keep a cache file (see below), I think startup under 200ms is feasible.
> > Will need to do experiments.
>
> What does this "startup" entail? Is it reading the whole .mm file (or
> equivalent)? I suppose the minimum necessary for "ready to go" state is to
> have parses and frames for all the statements. Is this what you mean?

Yes. "md5sum set.mm" completes in about 50ms, so I think it'll take
about that long to segment the set.mm file, generate checksums of the
segments, and verify in a database that they haven't changed. There's a
lot of fiddliness in that design; see "forest for the trees" below.

> > Proposal for SMM 0.3:
...
> > * $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
> > requires state (the include list). The exception is for compatibility.
>
> What is this "multiple segments" exception? I notice it a few more times
> below. If there is only one "segment", are you trying to make it the same
> as parsing a single .mm file?

Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
files just because they do things like declare `$e` statements in the
top-level scope.

> The key, I think, is figuring out how to do incremental changes in this
> setup. Let's number your phases, (1) gathering names (2) building frames
> (3) grammar check (4) proof check. We can make a cache file for each phase
> and each segment.
>
> * The phase 1 cache is never invalidated by changes to other segments.
> * The phase 2 cache is invalidated only by changes or removal of global $v
> and $f statements in previous segments.

No. The key to this entire programme is to maintain reverse mappings;
"all previous segments" will *never appear* in the final algorithm.
Remember, this is planning for a world where set.mm is 200+ MB, a world
where memories and disks have gotten larger but CPUs haven't gotten much
faster.

The phase 2 cache is invalidated when any math symbol which is used in
the segment is the subject of a changed global $v or $f in any other
segment.

> * The phase 3 cache is invalidated by changes or removal of syntax $a
> statements in previous segments.

Grammar check for a statement is invalidated if it contains all of the
`$c` symbols from an `$a` that was added or removed. (Yes, this means
that if you mess with `cv` you'll have a small problem.)

> * The phase 4 cache is invalidated by changes or removal of $a and $p
> statements in previous segments.

Proof check for a statement is invalidated by any change to the frames
of the statements to which the proof refers.
I suspect you'll find that the differences in alphabet size are
essentially moot after running through any modern data compressor with a
Huffman coding pass. This email, after all, is an example of a (mostly)
27-coded file.

At this point in time I view substantial changes to the compressed proof
format itself as churn for churn's sake and I still am not sold.

> On Wed, Apr 20, 2016 at 12:14 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Wed, Apr 20, 2016 at 12:02:40AM -0400, Norman Megill wrote:
> > > Couldn't a syntax-aware editor simply have a mode that hides proofs?
> >
> > I believe that is precisely what Mario was aiming at with "rather than
> > fiddling with collapsing markers to hide the proofs".
>
> Whenever you present the user with a text area that almost-but-not-quite
> matches the actual text being saved, there is an increase in complexity
> regarding the handling of these "hidden regions". What happens when you
> type in the vicinity of a hidden region? How do you make sure that the data
> stays with the statement as the user types around it? By exporting this to
> another file, there is a clearly defined relationship between the two,
> which is unaffected by manual edits.

Agree.

> This is one of my targets with this whole business: good maintenance mode
> editing, with a strong interrelation with proof mode. Here's an example
> edit which I just did manually, for which a program would have been great:
>
> Theorem A is going away, to be replaced by theorem B. But it's not a
> straight replacement. Instead, A is usually applied as a subtree of the
> form syl2anc[x, y, A] where x,y are proof subtrees, and I want to replace
> this whole tree fragment with some other tree fragment T[x, y] which
> matches the original (and T contains B somewhere, say T[x, y] =
> syl[syl2anc[x, y, B], C]. In other words, there is a straight replacement,
> but it is at the proof tree level instead of the theorem level.
>
> I'm sure Norm knows some clever trick for doing this with
> minimize/allow_growth. A simple primitive operation which would accomplish
> the above is "inline theorem" and "extract theorem". Extract theorem T
> would look for subtrees in a theorem proof that match the proof of T (or
> perhaps just match the signature of T), and uses T instead. "Minimize with
> T" from MM-PA does essentially this. Conversely, "inline theorem T" would
> go to each use of T and insert the proof of T instead of referencing T;
> afterward T will not be referenced and can be deleted. There are some
> subtleties in choosing dummy variables in the proof, if there are any.

If smm 0.2 had a proof parser it could have handled that. We don't need
a new architecture for that.

> > I'd kind of like to split up set.mm to make tools happier, but one of
> > the few ways set.mm is *not* like a program is that it either works
> > (verifies) or it doesn't. With a program it's useful to have very
> > frequent commits because it makes it easier to understand "why" when a
> > regression is discovered, but set.mm has no regressions and so commits
> > every couple days are good enough for history-keeping.
> >
>
> I'm not sure I understand this. A program can fail in many places, and
> set.mm can also fail to compile in many places. If you want an example of a
> "run-time error", it's true that they are much less common in Metamath due
> to the stringent compilation, but it's not impossible. For example someone
> might try to use a theorem and discover that it no longer exists, or has
> the wrong signature, because some crazy guy is periodically reorganizing
> the library. They would then search the history for the theorem name, to
> find out what happened to it and what replaces it. The relationship to
> run-time errors here is that the database "works", it just doesn't work the
> way you think it does.

You could still do that with fairly large commits.

> > Mario, as far as I can tell you want to split out the proofs to make it
> > easier to manage maintainance mode operations in Eclipse. I think that
> > can be done using very simple tools to convert between split files and
> > not-split files, while avoiding ecosystem fragmentation by only sharing
> > the not-split files. I think smm 0.2 could handle the splitting and
> > unsplitting with a <100 line program; I could do that if you'd like.
> >
>
> If you have a self-contained program to do this, please share it, I'm sure
> I could make use of it.

I don't have one now, but I could write one in less time than I've spent
on this reply.

> I beg to differ regarding sharing not-split, though. The major benefits of
> splitting are in git maintenance and day to day work, so keeping it local
> largely defeats the purpose. Plus, I want to be able to choose the split
> file names, which isn't possible if the "default" is not-split. Instead, we
> should unsplit the git repo for producing deliverables such as the set.mm
> download on the website.

I think we're having a vocabulary problem because of the two kinds of
splitting.

If we go down the route of section splitting, then we should keep the
section files in git, for many reasons including "being able to choose
the split file names".

For binary splitting into set.mms and set.mmc, there's no particular
advantage to having two files in git rather than one, so I'd recommend
one.

-sorear

Stefan O'Rear

unread,
Apr 22, 2016, 12:50:15 AM4/22/16
to meta...@googlegroups.com
On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> Proposal for SMM 0.3:

I've made more progress on the API design for SMM0.3. Segments aren't
just a performance hack; some form of them is semantically necessary if
you have multiple .mm files because we need to keep track of what gets
saved where. SMM0.2's flat array of statements won't cut it here.

File inclusion (with the restrictions added in
330680035.12827.14329...@alum.mit.edu) and top level
$e are no longer a problem in the design; in fact, my preferred outcome
for set.mm is a top level file which includes a few dozen small files.

-sorear

Mario Carneiro

unread,
Apr 22, 2016, 1:18:59 AM4/22/16
to metamath
On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
On Thu, Apr 21, 2016 at 08:07:35AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> > I've been playing with Rust for a few months now, and I've *already* been
> > thinking it'd be an excellent fit for a new metamath program.  Also
...
>
> That's great. I'll open source my project soon, and then we can collaborate
> on this. (I'm calling it rmm.)

The main disadvantage of something like Rust is the difficulty of
handling user extensions.  Rust's learning curve is not the shallowest,
and the 100 MB compiler download and the need for frequent slow
recompiles will create a high barrier of entry for things like macros.
Most likely I'd wind up embedding Lua; now we have a different problem,
namely two parallel ecosystems of support code.

One way to handle this is if the Rust verifier, which acts as a command line tool, accepts commands passed on stdin, and a separate program, written in another language (like node.js) produces those commands and is piped through. That way the verifier acts as the "kernel" and the macro language can support higher-level stuff. (The Rust verifier will also support higher level operations like atomic maintenance mode operations, but the set.mm-specific stiff can be farmed out.) The reason for using node.js instead of Lua is mainly to keep the macro language common with smm and mmj2.

> > Given the low startup overhead of a pre-compiled program and the ability to
> > keep a cache file (see below), I think startup under 200ms is feasible.
> > Will need to do experiments.
>
> What does this "startup" entail? Is it reading the whole .mm file (or
> equivalent)? I suppose the minimum necessary for "ready to go" state is to
> have parses and frames for all the statements. Is this what you mean?

Yes.  "md5sum set.mm" completes in about 50ms, so I think it'll take
about that long to segment the set.mm file, generate checksums of the
segments, and verify in a database that they haven't changed.  There's a
lot of fiddliness in that design; see "forest for the trees" below.

> > Proposal for SMM 0.3:
...
> >  * $[ $] is forbidden if there are multiple segments.  Rationale: handle $[ $]
> >    requires state (the include list).  The exception is for compatibility.
>
> What is this "multiple segments" exception? I notice it a few more times
> below. If there is only one "segment", are you trying to make it the same
> as parsing a single .mm file?

Yes.  I'm reluctant to write a verifier that _cannot_ handle legal .mm
files just because they do things like declare `$e` statements in the
top-level scope.

Part of the idea behind labeling this as "Metamath 2.0" is so that we don't feel any particular draw to maintain perfect backward compatibility if it gets in the way. I have no compunctions about disallowing global $e statements, it seems to be totally logical. (There's no way you can convince me that global $e is something you would ever want to do.) A more subtle question is global $d's, because I can actually imagine someone building a set.mm like database and deciding to make all set variables disjoint from each other, once and for all.

> The key, I think, is figuring out how to do incremental changes in this
> setup. Let's number your phases, (1) gathering names (2) building frames
> (3) grammar check (4) proof check. We can make a cache file for each phase
> and each segment.
>
> * The phase 1 cache is never invalidated by changes to other segments.
> * The phase 2 cache is invalidated only by changes or removal of global $v
> and $f statements in previous segments.

No.  The key to this entire programme is to maintain reverse mappings;
"all previous segments" will *never appear* in the final algorithm.
Remember, this is planning for a world where set.mm is 200+ MB, a world
where memories and disks have gotten larger but CPUs haven't gotten much
faster.

The phase 2 cache is invalidated when any math symbol which is used in
the segment is the subject of a changed global $v or $f in any other
segment.

Fair enough. (Also, reordering of $f's will also invalidate the phase 2 cache.) Actually, we don't even need $v's to invalidate the cache; if there are any problems with undefined variables they will already be caught when checking the corresponding $f.
 
> * The phase 3 cache is invalidated by changes or removal of syntax $a
> statements in previous segments.

Grammar check for a statement is invalidated if it contains all of the
`$c` symbols from an `$a` that was added or removed.  (Yes, this means
that if you mess with `cv` you'll have a small problem.)

I don't see the advantage of this, over looking at the RPN parse string instead, which is much more accurate. If we assume the grammar is unambiguous, as long as the statements in the RPN are not changed, the parse will still be the correct one. (This does not do "empirical unambiguity testing" though, and relies on either trusting the user or proving unambiguity by means of a KLR parser or something similar.)

> I'm sure Norm knows some clever trick for doing this with
> minimize/allow_growth. A simple primitive operation which would accomplish
> the above is "inline theorem" and "extract theorem". Extract theorem T
> would look for subtrees in a theorem proof that match the proof of T (or
> perhaps just match the signature of T), and uses T instead. "Minimize with
> T" from MM-PA does essentially this. Conversely, "inline theorem T" would
> go to each use of T and insert the proof of T instead of referencing T;
> afterward T will not be referenced and can be deleted. There are some
> subtleties in choosing dummy variables in the proof, if there are any.

If smm 0.2 had a proof parser it could have handled that.  We don't need
a new architecture for that.

Well, I think smm *is* the new architecture, provided it makes it to the big time. Although currently I don't really know how to use it for anything except one-shot verification (if that).

> I beg to differ regarding sharing not-split, though. The major benefits of
> splitting are in git maintenance and day to day work, so keeping it local
> largely defeats the purpose. Plus, I want to be able to choose the split
> file names, which isn't possible if the "default" is not-split. Instead, we
> should unsplit the git repo for producing deliverables such as the set.mm
> download on the website.

I think we're having a vocabulary problem because of the two kinds of
splitting.

If we go down the route of section splitting, then we should keep the
section files in git, for many reasons including "being able to choose
the split file names".

For binary splitting into set.mms and set.mmc, there's no particular
advantage to having two files in git rather than one, so I'd recommend
one.

By "splitting" above I was referring to segmenting the .mm file by section or similar. For separating the proofs from the statements, I agree that since the translation in each direction is so simple, it is probably easier to keep them all together, unless we get to the point where split-proof form is by far the most commonly used.

On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:

On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> Proposal for SMM 0.3:

I've made more progress on the API design for SMM0.3.  Segments aren't
just a performance hack; some form of them is semantically necessary if
you have multiple .mm files because we need to keep track of what gets
saved where.  SMM0.2's flat array of statements won't cut it here.

File inclusion (with the restrictions added in
330680035.12827.14329...@alum.mit.edu) and top level
$e are no longer a problem in the design; in fact, my preferred outcome
for set.mm is a top level file which includes a few dozen small files.

Your link looks corrupted. Are there changes to the design from the earlier email?

I lost you at "my preferred outcome for set.mm is a top level file which includes a few dozen small files": does this differ from your earlier desire for a smaller upper bound on .mms file length, on the order of 200K?

Mario

Stefan O'Rear

unread,
Apr 22, 2016, 1:47:51 AM4/22/16
to meta...@googlegroups.com
On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...
> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

That sounds really, really painful. What's the advantage at this point
over keeping the entire stack in Javascript?

> > Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
> > files just because they do things like declare `$e` statements in the
> > top-level scope.
> >
>
> Part of the idea behind labeling this as "Metamath 2.0" is so that we don't
> feel any particular draw to maintain perfect backward compatibility if it
> gets in the way. I have no compunctions about disallowing global $e
> statements, it seems to be totally logical. (There's no way you can
> convince me that global $e is something you would ever want to do.) A more
> subtle question is global $d's, because I can actually imagine someone
> building a set.mm like database and deciding to make all set variables
> disjoint from each other, once and for all.

Fair point. I'm in favor of outlawing global $e and $d.

> > The phase 2 cache is invalidated when any math symbol which is used in
> > the segment is the subject of a changed global $v or $f in any other
> > segment.
>
> Fair enough. (Also, reordering of $f's will also invalidate the phase 2
> cache.) Actually, we don't even need $v's to invalidate the cache; if there
> are any problems with undefined variables they will already be caught when
> checking the corresponding $f.

Will have to think about that.

> > > * The phase 3 cache is invalidated by changes or removal of syntax $a
> > > statements in previous segments.
> >
> > Grammar check for a statement is invalidated if it contains all of the
> > `$c` symbols from an `$a` that was added or removed. (Yes, this means
> > that if you mess with `cv` you'll have a small problem.)
> >
>
> I don't see the advantage of this, over looking at the RPN parse string
> instead, which is much more accurate. If we assume the grammar is
> unambiguous, as long as the statements in the RPN are not changed, the
> parse will still be the correct one. (This does not do "empirical
> unambiguity testing" though, and relies on either trusting the user or
> proving unambiguity by means of a KLR parser or something similar.)

That would also preclude caching parse failures, which may or may not be
a problem. Not sure, will defer that decision until it's relevant.

> > If smm 0.2 had a proof parser it could have handled that. We don't need
> > a new architecture for that.
>
> Well, I think smm *is* the new architecture, provided it makes it to the
> big time. Although currently I don't really know how to use it for anything
> except one-shot verification (if that).

I don't follow "is the new architecture".

smm 0.2 doesn't have the proper plumbing to track which proofs came from
which files, so it's going to be much less useful in a many-small-proofs
world. I think I can fix that in smm 0.3, but it will break all API
consumers.

smm (any version) also does not currently have a proof object model.
There's been some discussion of that in the past; we had a rough
consensus that something like a MMP file was desired, but many details
were TBD and I never got around to implementing it.

> By "splitting" above I was referring to segmenting the .mm file by section
> or similar. For separating the proofs from the statements, I agree that
> since the translation in each direction is so simple, it is probably easier
> to keep them all together, unless we get to the point where split-proof
> form is by far the most commonly used.

In that case we are in complete agreement.

> On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> > > Proposal for SMM 0.3:
> >
> > I've made more progress on the API design for SMM0.3. Segments aren't
> > just a performance hack; some form of them is semantically necessary if
> > you have multiple .mm files because we need to keep track of what gets
> > saved where. SMM0.2's flat array of statements won't cut it here.
> >
> > File inclusion (with the restrictions added in
> > 330680035.12827.14329...@alum.mit.edu) and top level
> > $e are no longer a problem in the design; in fact, my preferred outcome
> > for set.mm is a top level file which includes a few dozen small files.
> >
>
> Your link looks corrupted. Are there changes to the design from the earlier
> email?

That's not a Web link, it's the Message-ID of an email Norm sent to you
and me on 2015-05-29.

> I lost you at "my preferred outcome for set.mm is a top level file which
> includes a few dozen small files": does this differ from your earlier
> desire for a smaller upper bound on .mms file length, on the order of 200K?

No, it fully supports it.

I'm saying I want set.mm to become a long list of file inclusion
statements, able to be read and verified (but not rewritten!) by today's
metamath.exe.

-sorear

Stefan O'Rear

unread,
Apr 22, 2016, 3:57:47 AM4/22/16
to meta...@googlegroups.com
On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

I now understand that resolving this issue is not a blocker. I can
proceed with design and implementation of smm03-rust, and if we need a
JS version one can be written with much less effort than the initial
smm3 design.

-sorear

Mario Carneiro

unread,
Apr 22, 2016, 4:24:35 AM4/22/16
to metamath
On Fri, Apr 22, 2016 at 1:47 AM, Stefan O'Rear <sor...@gmail.com> wrote:
On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...
> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.

That sounds really, really painful.  What's the advantage at this point
over keeping the entire stack in Javascript?

Reciprocally, what is the advantage of not just keeping everything in Rust (or more generally, the same language as the verifier)? As you know, I added macros to mmj2, and in hindsight there isn't that much gain if the user isn't writing his own macros, and even if he is this often overlaps with a need to add features or hooks to the main codebase, so there is low gain and high loss regarding multi-language development. Doing it all over again in Rust, I would instead have a whole new project for set-rmm as a specialization of rmm, written in the same language and reusing the library, with specific set.mm features built in (and with a configuration file for minor end-user tweaking). This would be a verifier that only works on set.mm-like databases. I think this better reflects the intent of separation of concerns between general metamath and set.mm, and skips the complexities of macro development.

If you still want it, it is possible to add JS/Lua macros on top of this, if you can get Rust to read an interpreted language. But in most of the immediate cases, I think just having a place to put set.mm-specific information is sufficient.
Why would you want to cache a parse failure? mmj2 just fatal-errors on the first bad parse, although I would extend this to a fatal error with a list of all bad parses. (More generally, if some stage fails, it is probably best to leave the caches that are blocked as a result alone, so that later segments will use the last-good cache. This seems preferable to having a whole cascade of failures.)

> > If smm 0.2 had a proof parser it could have handled that.  We don't need
> > a new architecture for that.
>
> Well, I think smm *is* the new architecture, provided it makes it to the
> big time. Although currently I don't really know how to use it for anything
> except one-shot verification (if that).

I don't follow "is the new architecture".

I mean that smm is the first new verifier since mmj2 to seriously consider these issues during the design phase, which I believe will yield a better final product. I also have some designs of my own, but smm is much more well-developed.
 
smm 0.2 doesn't have the proper plumbing to track which proofs came from
which files, so it's going to be much less useful in a many-small-proofs
world.  I think I can fix that in smm 0.3, but it will break all API
consumers.

That's okay, I don't think there are any, besides you. ;)

smm (any version) also does not currently have a proof object model.
There's been some discussion of that in the past; we had a rough
consensus that something like a MMP file was desired, but many details
were TBD and I never got around to implementing it.

Here is my conception of a proof object model. A proof file is a command stream for the command-line verifier, as in:

apply syl = |- ( ph -> ch )
apply thm1 = |- ( ph -> ps )
apply thm2 = |- ( ps -> ch )

The symbol streams after the = are optional, and are produced by proof object generation. This could be presented to the user as a script file, and they could modify the file, then execute it, and the editor would display the errors that result.

The prover state is an unordered collection of steps, in two types, "proven" (having a complete proof subtree) and "unproven" (incomplete or depending on an incomplete step). It also keeps a list of all the incomplete steps with no proof, and these are presented to the user as a message at each stage. The "apply thm" or "apply thm = ..." command, like MM-PA's, performs reverse proving, applying the given theorem to the last incomplete step, and producing new subgoals. For forward proving, the "have thm" or "have thm = ..." command will unify the given theorem and add it to the step list. A new unproven step can be created by "have ? = ...".

Steps can be labeled, by "name: apply thm" etc. To refer to a named step in an application, write "have thm step1 step2". Like mmj2, it will attempt to reorder the steps if necessary to unify, and it will also look for matching proofs from the "proven" list. (It will never automatically unify an unproven step, you have to name these explicitly.) All steps are aggressively deduplicated, so no statement will appear more than once in the list (it takes the one with the shortest proof if a duplication occurs later due to work variable specification). A proven step will of course take priority over an unproven step during deduplication.

Finally, there is a command "set W = ..." to set the value of a work variable. As in mmj2, this is not usually necessary, but may help to limit the possibilities for auto-generation.

Steps are automatically numbered if they are not given names, so in this respect the overall structure ends up quite similar to mmj2's. But the user will not work directly with an .mmp file that is being simultaneously edited by the computer, instead it is closer to a human-written script file, using an editor with syntax checking for live feedback on the current machine state.

At any point, the current state can be export as a script file, which will insert annotated "have" and "apply" commands to get the right state. (Perhaps there can be an option for whether to prefer forward or reverse proving; for mmp-like formatting, with the last step at the bottom, you would want forward proving.)

This setup is quite obviously inspired by the "tactic mode" of Lean, Isabelle, Coq, HOL Light. What do addons look like here? The power of this setup is of course that one can implement commands other than "have", "apply", "set". More command ideas:

* "hyp": the same as "have", but only matches theorem hypotheses
* "by <prover>": invoke a specific named algorithm (which might be built in or supplied by JS if available) which does its best to do something with the goal.
* "back": These algorithms will invariably be too aggressive, and the "back" command will undo the proof of the parent (or all direct parents) of the current step. Alternatively, using the other commands but using the same step name as an existing step will replace that step with a new proof.

One set.mm-specific feature which I would like to investigate is deduction mode, which interprets Horn clauses as deductions. That is, if the current state is

a: hyp = |- ( ph -> B e. RR )
b: have peano2re = |- ( B e. RR -> ( B + 1 ) e. RR )
c: have ? = |- ( ( ph /\ A e. RR ) -> ( A + B ) e. RR )

it detects that the current context is ( ph /\ A e. RR ), which directly contains "ph" and "A e. RR". At the next reachability level, it sees that all the hypotheses of step 'a' are in the context, so "B e. RR" is also true, and at one further level, we have that "( B + 1 ) e. RR" is true. Thus, the effective proof state at c is:

  |- ph
  |- A e. RR
a: |- B e. RR
b: |- ( B + 1 ) e. RR
=> |- ( A + B ) e. RR

The first two elements are the "hypotheses" (with automatic names), and the other two are "proven steps", which are named after the steps that produce them (there are usually a lot of these so they are not shown). They are all accessible to proof via the deduction-aware "apply" command, so that we can write

apply readdcl

and it will look at the structure of the theorem, |- ( ( ?A e. RR /\ ?B e. RR ) -> ( ?A + ?B ) e. RR ), as a Horn clause, producing the decomposition |- ?A e. RR & |- ?B e. RR => |- ( ?A + ?B ) e. RR. (It first tries to match the full implication against |- ( ( ph /\ A e. RR ) -> ( A + B ) e. RR ) and then |- ( A + B ) e. RR but those fail, and this is next.) It unifies A = ?A and B = ?B so it attempts to find |- A e. RR and |- B e. RR in the list of proven steps, and it will find them and finish the proof.

> On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> > > Proposal for SMM 0.3:
> >
> > I've made more progress on the API design for SMM0.3.  Segments aren't
> > just a performance hack; some form of them is semantically necessary if
> > you have multiple .mm files because we need to keep track of what gets
> > saved where.  SMM0.2's flat array of statements won't cut it here.
> >
> > File inclusion (with the restrictions added in
> > 330680035.12827.14329...@alum.mit.edu) and top level
> > $e are no longer a problem in the design; in fact, my preferred outcome
> > for set.mm is a top level file which includes a few dozen small files.
> >
>
> Your link looks corrupted. Are there changes to the design from the earlier
> email?

That's not a Web link, it's the Message-ID of an email Norm sent to you
and me on 2015-05-29.

Since this is a public discussion and I don't think I will get into trouble posting this, here is the conversation Stefan is referring to:

On Fri, May 29, 2015 at 6:19 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 5/29/15 3:42 PM, Stefan O'Rear wrote:
> On Fri, May 29, 2015 at 11:34:47AM -0400, Norman Megill wrote:
>> To summarize, your edge cases are all flagged as errors in the current
>> metamath program, and the actual internal behavior ignoring the messages
>> may or may not be what was intended.
>>
>> Certainly I need to clarify the spec.  If you think the behavior should
>> be different from what I described, let me know.  My goal is to keep
>> both the spec and the implementation as simple as possible.
>>
>> Another issue is how the other half-dozen verifiers handle these cases;
>> I don't know.
>
> Soon to be a half-dozen plus one.  My goals are for the behavior to be simple
> and fully specified.  The code I have so far will flag an error if a comment or
> $[ trails off the end of a file, so it seems like that much is agreed on.  I
> kind of like Mario's suggestion to only allow $[ $] where $c is currently
> allowed, although I don't beleive any verifier uses that rule currently.

I agree.  I'll make it official that $[ $] is allowed only in the outermost
scope i.e. not between ${ $}.

I'll also add that comments have highest parse precedence and that an
incomplete comment may not occur in an included file.

I'll also add that $[ $] may not occur inside of a statement (e.g. may
not occur between the label of a $a statement and its $.) and that an
incomplete $[ $] may not occur in an included file.

I'll also add that an incomplete statement (e.g. $a but no $.) may not
occur in an included file.  This might be implied by the above
depending on how you read it (since the end of the included file
would be "inside" of a statement), but better to make it explicit.

Have I missed anything?

Norm


 

fl

unread,
Apr 22, 2016, 7:37:07 AM4/22/16
to Metamath
 
Reciprocally, what is the advantage of not just keeping everything in Rust


I don't want to cool down your sudden affection for a new brand of language but I'd like to remember you that your main
problem is to get a gui good enough to manage efficiently large files. I doubt the gui provided by rust is better than the one
provided by java.

-- 
FL

Mario Carneiro

unread,
Apr 22, 2016, 8:40:09 AM4/22/16
to metamath
I don't appreciate your omission of the remark immediately after that quote. My "sudden affection" is not principally for a new language, but for a new architecture; frankly the language is largely irrelevant. The "GUI provided by Rust" is of course not better than java's, because it doesn't have a GUI. That thinking is what got us into this mess in the first place. I have no interest in reinventing the GUI wheel, and would much prefer to plug in to someone else's architecture. With the GUI removed, all that remains is a strong backend, and the fastest application programming languages are those that compile down to actual machine code, without a lot of translation - and it just so happens that Rust focuses on this side of things. (C would also work for this purpose, but I've never really liked writing C code, speaking only for myself. I'm sure there are plenty of Rust vs. C comparisons if you want to look at this carefully.)

But again, my main interest is in a better architecture; small differences between languages are dwarfed by efficiencies from a better approach. I would be happy to accept smm in node.js or any other language if it can deliver the performance and usability benefits over existing tools that seem possible.

Mario

Norman Megill

unread,
Apr 22, 2016, 8:44:39 AM4/22/16
to meta...@googlegroups.com
On 4/22/16 1:47 AM, Stefan O'Rear wrote:
> On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
>> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...
>> Part of the idea behind labeling this as "Metamath 2.0" is so that we don't
>> feel any particular draw to maintain perfect backward compatibility if it
>> gets in the way. I have no compunctions about disallowing global $e
>> statements, it seems to be totally logical. (There's no way you can
>> convince me that global $e is something you would ever want to do.) A more
>> subtle question is global $d's, because I can actually imagine someone
>> building a set.mm like database and deciding to make all set variables
>> disjoint from each other, once and for all.
>
> Fair point. I'm in favor of outlawing global $e and $d.

I'm fine with disallowing global $e but think we should keep global
$d's. See
https://groups.google.com/d/msg/metamath/A2p5GPDwXxw/JIn_Z3YVCAAJ for
how this can be used to emulate individual variables (instead of
metavariables ranging over them).

Here is a collection of proposed changes and clarifications to the
Metamath 1.0 spec collected from various emails.

1. Comments have highest parse precedence e.g. "$( $[ $)" is a comment
and does not violate item 4 below. Nested comments will continue to be
disallowed e.g. "$( $( $)" is not allowed.

2. $[ ... $] is allowed only in the outermost scope i.e. not between ${ $}.

3. $[ ... $] may not occur inside of a statement (e.g. may not occur
between the label of a $a statement and its $.).

4. $( without a matching $) or $[ without a matching $] may not occur in
an included file.

5. An incomplete statement (e.g. $a but no $.) may not occur in an
included file.

6. The type declared for a variable by a $f statement is effectively
global, even if the variable is not (e.g. "wff P" in one local scope
and "class P" in another is not allowed).

7. A $e statement may not occur in the outermost scope.

Norm

fl

unread,
Apr 23, 2016, 6:35:17 AM4/23/16
to Metamath
Hi Norm,


> Here is a collection of proposed changes and clarifications to the
> Metamath 1.0 spec collected from various emails.


I've not followed carefully the debate and I hope I'm not wide of
the mark but here is an evolution I would like.

${  <- beginning of a group. Only one is possible. No nested curly braces.

   aa $e |- A = ....
   ab $e |- B = ....
   ac $e |- C = ....
   ad $e |- D = ....

    $d .... $. q<- refers to ba only.
    $( $)
    ba $p |- .... uses only aa and ac ....

    $d .... $. q<- refers to bb only.
    $( $)
    bb $p |- .... uses only ad

    $( $)
    bc $p |- .... uses only ac and ad

$}


#1 On a html page only the premises used by a theorem are showed.

#2 In a proof only the  premises  really used are referenced.

#3 $d statements only concern the next $p statement.

--
FL

Mario Carneiro

unread,
Apr 23, 2016, 6:50:48 AM4/23/16
to metamath
On Sat, Apr 23, 2016 at 6:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:
Hi Norm,

> Here is a collection of proposed changes and clarifications to the
> Metamath 1.0 spec collected from various emails.


I've not followed carefully the debate and I hope I'm not wide of
the mark but here is an evolution I would like.

#1 On a html page only the premises used by a theorem are showed.

#2 In a proof only the  premises  really used are referenced.

I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.

Without "automatic premise selection", limiting groups to one nesting level becomes too restrictive.
 
#3 $d statements only concern the next $p statement.

This isn't a big deal, but I don't see any advantage. Particularly if you are using a group to share $e's, you will often want DV pairs in the variables of these shared $e's, which will then be repeated for each proof. In fact, I tend to go in the complete opposite direction, collecting all $d's in a group right at the start (of the outermost level), unless there is a specific $d I need to avoid that would be over-shared by this. This always gives the shortest specification, when it is possible to do. Conversely, totally unsharing $d's will make specification longer for no apparent benefit.
 
Mario

Mario Carneiro

unread,
Apr 23, 2016, 6:54:06 AM4/23/16