Modularity for set.mm; interface design for the "perfect editor"

451 views
Skip to first unread message

Mario Carneiro

unread,
Apr 16, 2016, 6:53:12 AM4/16/16
to metamath


On Sat, Apr 16, 2016 at 1:09 AM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/15/16 12:25 AM, Marnix Klooster wrote:
Hi all,

I'm not invested in any design direction here, but here is just one
remark from my side.

It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).

Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.

(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)

[...]

Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility.  If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.

[...]

As for modularity, in set.mm you will find commented out file includes for things like logic.mm.  At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't).  I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor.  Eventually set.mm will have to be broken up, but we aren't at that point yet.

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this. Most file editors are just not designed for this, and it is not good to need to read it all and store it in memory in order to get to a single theorem. Edits to an .mm file almost always are local, touching only a tiny tiny part of that giant file, and splitting it up is one way to focus development a bit more.

Could you repeat the reasons why you think having set.mm in one file is a good thing?

1. Moving sections around is trivial if sections are in separate files, it just means moving a file and possibly changing the dependencies, while moving a theorem or range of theorems is just the same with multiple files, just cut and paste.

2. Global renames are completely identical with multiple files; almost all file editors have a "find/replace in files" feature, which would do the trick. Ideally this should be done by the aforementioned "perfect metamath editor", though, so it can be more syntax-aware and only do the replacement where it is syntactically valid.


If you could start over from scratch, that is, some kind of "Metamath 2.0", how would you design the interface? What is the core structure of MM file data? Some thoughts:

* The basic design is a sequence of theorems with possibly shared environments. Within each theorem, though, the data structure is something complex, which we have decided not to store in plain text in the present version, necessitating interaction with a proof editor in any case. (That is, a user can not practically write a complete mm file by hand, and must resort to a program to write the proof parts.)

* There is a logical two-level interaction pattern with data in the file:
1. Maintenance mode: Adding and removing theorems, changing theorem statements, changing hypotheses and disjoint variable conditions.
2. Proof mode: Writing proof bodies.

* The raw MM file presentation is suboptimal in that on the one hand users must interact with it directly for maintenance mode, yet are presented with human-unreadable data in the proofs, while on the other hand existing tools allow proof mode work but do not handle any maintenance mode activities.

* An easy presentation format for maintenance mode is the raw MM file, but with proofs collapsed or omitted. The operations on such a representation may also be limited beyond arbitrary text edits, to a more well-formed approach such as adding and removing groups, and moving statements around as atomic elements. A graphical interface for this might work, but would need a lot of usability testing to get right.

* For proof mode, the best current contender is the MMP file format, but mmj2 is the only program that can read it, and there is no connection between this and maintenance mode in mmj2 (in fact mmj2 does not edit the MM file at all). I don't think this is the best direction in order to support proof scripts in the future, and I would support something more like a sequence of commands. (Tactic modes exist in Lean, Coq, HOL Light, Isabelle and I can explain them in more detail if anyone doesn't know how they work.) An MMP style output could be written as a sequence of ` apply xxx "|- math" ` commands like those used in MM-PA (the math strings are optional for user input).

* Linking in to an existing IDE like Eclipse is an easy way to get a lot of nice features like syntax highlighting out of the box. But any attempt to do this in maintenance mode is severely restricted by the sheer size of the file, even if proofs are omitted. Eclipse assumes that a software project has a modular structure, and I daresay that most other IDEs do as well. Even simple section-by-section splits would mostly resolve this issue.

* A primary problem that is not currently addressed by mmj2 is the ability to change the state of the database while the program is running (that is, maintenance mode changes affecting proof mode). MM-PA is able to handle this well, although it does not allow many maintenance mode changes in the first place. A good editor would have both modes and could use each to affect the other.


I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?

Mario

fl

unread,
Apr 16, 2016, 7:21:30 AM4/16/16
to Metamath

Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this.

A normal database is big. Having only one file has its benefits. You can download it easily. You can
use an editor to look for the proofs. Interfacing metamath with powerful existing text tools is very good.
Parceling out the file would make everything more restrictive.

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you
may place the proofs somewhere else.

--
FL

fl

unread,
Apr 16, 2016, 7:40:10 AM4/16/16
to Metamath

If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe you
may place the proofs somewhere else.

I think that mathematics are linear in their essence. This is currently the way metamath
manages them: one theorems after the other. A structure in the form of a net would make everything
more difficult to understand and manage. With a net you never know where the things are.

But a simple db2 database can be considered. I'm not sure however the potential benefits are worth the trouble.

The current most serious drawbacks in metamath are the time required to minimize proofs or generate
html pages. Maybe additional elements and caches could be added to minimize those times.

--
FL

fl

unread,
Apr 16, 2016, 7:43:08 AM4/16/16
to Metamath
And for the perfect editor, I think that more well designed algorithms and routines can help to achieve
such a goal.

(In particular the ability to add keywords to the theorems to find them easilier.)

--
FL

Norman Megill

unread,
Apr 16, 2016, 10:17:04 PM4/16/16
to meta...@googlegroups.com
> logic theorems in set.mm <http://set.mm>, there is no reason not to
> make it part of set.mm <http://set.mm> itself, say in a mathbox, so
> that everyone will benefit.
>
> [...]
>
> As for modularity, in set.mm <http://set.mm> you will find commented
> out file includes for things like logic.mm <http://logic.mm>. At
> one point I had several pieces, but they were a pain to keep in sync
> (without special software that didn't exist and still doesn't). I
> think anyone who has done major work on set.mm <http://set.mm> will
> agree that it is very convenient to have everything in one file for
> global label/symbol changes, moving sections around, etc. with a
> text editor. Eventually set.mm <http://set.mm> will have to be
> broken up, but we aren't at that point yet.
>
>
> Actually, I've been brainstorming how to fix this, what the perfect
> metamath editor should look like. It's quite clear to me, at least, that
> we are not there, and having one 16MB file is a symptom of this. Most
> file editors are just not designed for this, and it is not good to need
> to read it all and store it in memory in order to get to a single
> theorem.

I'm not sure why it's not good. On today's computers, 28MB is nothing
compared to the available 4 to 8GB of memory. If an editor has trouble
with a file of that size, maybe the editor design is not good. While I
haven't checked
https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features
carefully, the currently supported ones I recognize handle several GB or
more. Emacs supposedly handles 2EB (10^9 GB).

The main editor I use reads set.mm in well under 1 sec. When I search
for a string at the end of the file it gets there essentially instantly.
I just tried a search and replace of the 27K instances of '$p' with a
larger string, and it completed before I could lift my finger from the
mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it
became a little sluggish, taking 6 sec to find a string at the end of
the file.

> Edits to an .mm file almost always are local, touching only a
> tiny tiny part of that giant file, and splitting it up is one way to
> focus development a bit more.
>
> Could you repeat the reasons why you think having set.mm <http://set.mm>
> in one file is a good thing?

Simplicity. Speed. Lowest common denominator. Convenience.

> 1. Moving sections around is trivial if sections are in separate files,
> it just means moving a file and possibly changing the dependencies,
> while moving a theorem or range of theorems is just the same with
> multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

> 2. Global renames are completely identical with multiple files; almost
> all file editors have a "find/replace in files" feature, which would do
> the trick.

What if we forget to load one of them into the editor?

Anyway, the main editor I use doesn't have this feature. I could change
editors, but it would be asking me to sacrifice something I'm
comfortable with.

> Ideally this should be done by the aforementioned "perfect
> metamath editor", though, so it can be more syntax-aware and only do the
> replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $]
includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would
be similar to what is in set.mm now if you search for '$( $['.
Conversion between single and multiple would of course be automated in
metamath.exe or another program.

I think something like this should satisfy everyone.
FYI the current plans on my to-do list for metamath.exe include a change
to its data structure to allow any part of a statement to be modified
internally.

Applications I have in mind are global label and math token renames,
checking for conflicts before applying and also updating comment markup
and htmldefs. This will prevent accidental changes to English words
that might happen in an editor. Maybe I'll update the Recent Label
Changes automatically.

I don't have plans at this point to provide for insertion, deletion, or
changing the order of statements, or to provide for editing comments.

> I'd like to collect opinions on what a good editor should look like, and
> establish known shortcomings of existing methods, to get a roadmap for
> future editors. What does your "perfect Metamath editor" look like?

These are interesting ideas for a Metamath 2.0, and I'd like to hear
other opinions too.

As a general comment, I think it would be useful for Metamath 2.0 to be
translatable to and from the current version. If 2.0 needs additional
keywords for its features, maybe they could be hidden in special
comments in the 1.0 version so the versions could be completely
compatible. If 2.0 has say local scoping of things like labels, the
translator could rename them for the 1.0 version with some standard
suffix to make them unique. The tools we have for 1.0 (metamath.exe,
numerous verifiers) may or may not be practical to rewrite for 2.0, but
with perfect translatability they could continue to be used.

And ideally 2.0 could be verified with a 74-line program. :)

Norm

Mario Carneiro

unread,
Apr 17, 2016, 6:24:41 AM4/17/16
to metamath
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good.  On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory.  If an editor has trouble with a file of that size, maybe the editor design is not good.  While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more.  Emacs supposedly handles 2EB (10^9 GB).

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

The main editor I use reads set.mm in well under 1 sec.  When I search for a string at the end of the file it gets there essentially instantly.  I just tried a search and replace of the 27K instances of '$p' with a larger string, and it completed before I could lift my finger from the mouse.  When I tried browsing a 2.1GB ASCII file of a human genome, it became a little sluggish, taking 6 sec to find a string at the end of the file.

You have managed to get impressively far using only plain text tools, and the mm file format is somewhat tailored for use this way (e.g. spaces around label names everywhere). But I think that Metamath 2.0 should not rely on this, and use instead a more structured format, with tools that respect this format. (This might still basically be the current mm format, but the tool would not just be a text editor.)

Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity.  Speed.  Lowest common denominator.  Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.
 
1. Moving sections around is trivial if sections are in separate files,
it just means moving a file and possibly changing the dependencies,
while moving a theorem or range of theorems is just the same with
multiple files, just cut and paste.

How do you prevent undo/redo confusion when switching among various files?

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.
 
2. Global renames are completely identical with multiple files; almost
all file editors have a "find/replace in files" feature, which would do
the trick.

What if we forget to load one of them into the editor?

Usually you don't have to; my editor allows you to select a directory to search, and it will automatically open all the files that need to change, or you can just change them in place without opening them. A good maintenance tool for Metamath will search for and update all downstream files under a global rename (because otherwise it wouldn't be correctness-preserving).

Anyway, the main editor I use doesn't have this feature.  I could change editors, but it would be asking me to sacrifice something I'm comfortable with.

Like I said, my goal is to figure out what features we need in an editor of the future, so that you don't have to sacrifice anything.
 
Ideally this should be done by the aforementioned "perfect
metamath editor", though, so it can be more syntax-aware and only do the
replacement where it is syntactically valid.

The multiple-module feature you want could be done now with the $[ $] includes.

For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.

We'd have to agree on the commenting convention, but I imagine it would be similar to what is in set.mm now if you search for '$( $['. Conversion between single and multiple would of course be automated in metamath.exe or another program.

I think something like this should satisfy everyone.

A simple short term solution would be a tool to separate set.mm by section, and daisy-chain the files together with include statements. If this can be done in a perfectly reversible manner, then we could pass between each form painlessly. That is, you could take:

$( =-=- section 1 =-=- $)
thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm, and form the files section1.mm:

$( =-=- section 1 =-=- $)
thm1 $p ...

$( =-=- section 2 =-=- $)
thm2 $p ...

$( =-=- section 3 =-=- $)
thm3 $p ...
 

I believe Stefan has already used such a tool for the set.mm-history project.

FYI the current plans on my to-do list for metamath.exe include a change to its data structure to allow any part of a statement to be modified internally.

Applications I have in mind are global label and math token renames, checking for conflicts before applying and also updating comment markup and htmldefs.  This will prevent accidental changes to English words that might happen in an editor.  Maybe I'll update the Recent Label Changes automatically.

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause unforeseen mistakes.

As a general comment, I think it would be useful for Metamath 2.0 to be translatable to and from the current version. If 2.0 needs additional keywords for its features, maybe they could be hidden in special comments in the 1.0 version so the versions could be completely compatible. If 2.0 has say local scoping of things like labels, the translator could rename them for the 1.0 version with some standard suffix to make them unique.  The tools we have for 1.0 (metamath.exe, numerous verifiers) may or may not be practical to rewrite for 2.0, but with perfect translatability they could continue to be used.

I totally agree. I think that full backwards compatibility is very important for Metamath in general, so that proofs are supported long into the future, and we retain the current simplicity that has encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the current mm format, in particular the separation of proofs and theorem statements, so I would relax this restriction to having a (fairly simple) translation from 1.0 to and from 2.0.


Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

label $p math $.

For forward compatibility, the label $p math $= ... $. form is also valid, and in this case a separate proof is not needed. Then, have a parallel file, say set.mmc to go with set.mm, which contains any proofs in binary format. Details of this format are up for discussion, but let me point out a few facts:

* It only needs to have the structure {{"thm1", "proof of thm1"}, {"thm2", "proof of thm2"}, ...} where the entries for each theorem can come in any order convenient for lookup or update (like a sorted binary tree or something).

* The file can be binary because all the human-readable information has been removed and placed in the other file. The file should be optimized for quick seeks, so that single-proof examination can operate without reading the whole file.

* Proof checking can be done in parallel (this has actually always been true, but it is harder to mix reading the file with proof checking at the same time), provided that care is taken with sequence numbers from the mm file to ensure that a theorem does not reference later theorems.

Here's an actual binary format which satisfies the above constraints:

Part 1: Header

4D4D4301, 4 bytes = [MMC<hex 1>]: magic number (format specifier)
length of header, 4 bytes (big endian)

List of records of the form:
Label string, zero-terminated. This is the search key, the first label in alphabetical order in the right subtree.
Length of left subtree, 4 bytes (skip this many bytes to get to the right subtree)
Length of all proofs in the left subtree, 4 bytes (skip this many bytes in part 2 to jump over all the proofs for the left subtree)

A zero byte (empty label string) indicates an empty subtree.

Part 2: Proof data

List of records of the form:
Length of record, 4 bytes
Label string, zero-terminated (This and the record length are not necessary if you use the header, but make it easier to traverse all proofs)
Proof data. The proof data has a format reminiscent of the compressed proof format, but without the printable text restriction. It has the following structure:
List of label strings, separated by zero bytes (the referenced theorems)
A second zero byte denoting the end of the label list
The "letters" part of the compressed proof (discussed below)


That's all that is needed for efficient storage and retrieval. In order to find a specific record, one needs to read log n entries in the header, then skip straight to the proof location. Traversal over all records is easily accomplished by skipping the header, then reading each proof data record. Insertion or proof data replacement is also efficient, requiring updates to log n entries in the header. Renames are most easily accomplished as a deletion followed by an insertion. Rebalancing may be necessary, but only requires adjustment to the header.

Is it over-engineered? Probably, but the idea here is that an editor would not store this data in memory at all, and would instead just open the file whenever it needed to consult a new proof, and just use the mm file for contextual information otherwise. This I think can scale up much better than having one giant .mm file the way we currently have.

For the "letter list" part, I'll start by describing the general scheme, without picking the cutoffs yet. Divide the 256 possible byte values into three regions, A,B,C. A single number takes the form CC...CA or CC...CB (that is, all but the last byte have values in region C, and the last is in region A or B). If there are a,b,c numbers in each region (with a+b+c = 256), then sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a0 translates to (((cn*c+...)*c+c2)*c+c1)*a+a0, and the sequence (a+b+cn),...,(a+b+c2),(a+b+c1),a+b0 translates to (((cn*c+...)*c+c2)*c+c1)*b+b0.

This is identical to the mixed-base representation in the standard compressed proof format, except there are three regions instead of two. The difference between the A numbers and the B numbers is that B numbers are marked, in the same sense as the "Z" mark in a compressed proof. This way, a single marked step can fit in only one byte instead of two, if the number is small enough.

I gathered statistics on how often each number, marked or not, is used in the compressed proofs, and tested what choices {a,b,c} lead to the shortest overall size. The largest number in set.mm is 905, and 10% of steps are marked, with a nice exponential decay for both groups.

Surprisingly, the best choices have c being incredibly small; the best solution was {208, 46, 2} with 6084 KB to store all the compressed numbers in set.mm. (For comparison, the currrent compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.) My instinct is to go with a more balanced choice, such as {192, 56, 8} = 6118 KB, but I guess the fact that there is a loose upper bound on proof lengths currently in force in set.mm means that numbers never get so large that more than 3 digits is needed.

Mario

fl

unread,
Apr 17, 2016, 11:15:41 AM4/17/16
to Metamath

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. 

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add
syntax-aware routines.

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

And all those problems are not related to the way the database is stored on a disk anyway.

> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause 
> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues 
but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say  80% of their time with text-based
tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always
captured by the checker (at least for the proofs an formulas).

-- 
FL

Norman Megill

unread,
Apr 17, 2016, 7:51:21 PM4/17/16
to meta...@googlegroups.com
On 4/17/16 6:24 AM, Mario Carneiro wrote:
>
>
>
> On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu
> <mailto:n...@alum.mit.edu>> wrote:
>
> On 4/16/16 6:53 AM, Mario Carneiro wrote:
...
>
> The multiple-module feature you want could be done now with the $[
> $] includes.
>
> For those wanting a single file, the includes would be commented out.
> For those wanting multiple files, the includes would be uncommented.
>
> We'd have to agree on the commenting convention, but I imagine it
> would be similar to what is in set.mm <http://set.mm> now if you
> search for '$( $['. Conversion between single and multiple would of
> course be automated in metamath.exe or another program.
>
> I think something like this should satisfy everyone.
>
>
> A simple short term solution would be a tool to separate set.mm
> <http://set.mm> by section, and daisy-chain the files together with
> include statements. If this can be done in a perfectly reversible
> manner, then we could pass between each form painlessly. That is, you
> could take:
>
> $( =-=- section 1 =-=- $)

Do you mean the existing section headers? If so, how would the user put
multiple tiny sections in one file or alternately break a huge section
into 2 files, or is that not possible/desirable with your scheme? What
is the algorithm for mapping the string "section 1" to "section1.mm"?
Or if you don't mean that, wouldn't it be clearer just to put the file
name instead of "section 1"?

> thm1 $p ...
> $( =-=- section 2 =-=- $)
> thm2 $p ...
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>
> from set.mm <http://set.mm>, and form the files section1.mm
> <http://section1.mm>:
>
> $( =-=- section 1 =-=- $)
> thm1 $p ...
>
> section2.mm <http://section2.mm>:

(Aside: Is there any way to turn off Google's insertion of spurious
http links on file names?)

>
> $[ section1.mm <http://section1.mm> $]
>
> $( =-=- section 2 =-=- $)
> thm2 $p ...
>
> section3.mm <http://section3.mm>:
>
> $[ section2.mm <http://section2.mm> $]
>
> $( =-=- section 3 =-=- $)
> thm3 $p ...
>
>
> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?


...

>
> As a general comment, I think it would be useful for Metamath 2.0 to
> be translatable to and from the current version. If 2.0 needs
> additional keywords for its features, maybe they could be hidden in
> special comments in the 1.0 version so the versions could be
> completely compatible. If 2.0 has say local scoping of things like
> labels, the translator could rename them for the 1.0 version with
> some standard suffix to make them unique. The tools we have for 1.0
> (metamath.exe, numerous verifiers) may or may not be practical to
> rewrite for 2.0, but with perfect translatability they could
> continue to be used.
>
>
> I totally agree. I think that full backwards compatibility is very
> important for Metamath in general, so that proofs are supported long
> into the future, and we retain the current simplicity that has
> encouraged the existing proliferation of verifiers.
>
> However, I am not certain that all features can be supported with the
> current mm format, in particular the separation of proofs and theorem
> statements, so I would relax this restriction to having a (fairly
> simple) translation from 1.0 to and from 2.0.

Can you clarify this? I'm not opposed to hiding new 2.0 features in
special 1.0 comments so that 2.0 could be recovered from it exactly. If
we have perfect (lossless) translations back and forth between 1.0 and
2.0 (which is what I was trying to suggest), what are you relaxing?

>
>
> Here's a simple proposal for separating the proofs: take the proofs out
> of the .mm file, so that a $p statement takes the simple form:
>
> label $p math $.
>
> For forward compatibility, the label $p math $= ... $. form is also
> valid, and in this case a separate proof is not needed. Then, have a
> parallel file, say set.mmc to go with set.mm <http://set.mm>, which
> shortest overall size. The largest number in set.mm <http://set.mm> is
> 905, and 10% of steps are marked, with a nice exponential decay for both
> groups.
>
> Surprisingly, the best choices have c being incredibly small; the best
> solution was {208, 46, 2} with 6084 KB to store all the compressed
> numbers in set.mm <http://set.mm>. (For comparison, the currrent
> compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
> My instinct is to go with a more balanced choice, such as {192, 56, 8} =
> 6118 KB, but I guess the fact that there is a loose upper bound on proof
> lengths currently in force in set.mm <http://set.mm> means that numbers
> never get so large that more than 3 digits is needed.

Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?

Norm

Mario Carneiro

unread,
Apr 17, 2016, 8:16:24 PM4/17/16
to metamath
On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:

Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor. 

But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to add
syntax-aware routines.

But it's not an *editor*. It was explicitly designed not to modify the .mm file, which was IMO a mistake. The task of simply reading an .mm file and converting it to an internal data structure is not that hard, and is performed by every verifier. It is true that mmj2 handles "proof mode" better than any other existing solutions, but it does not do "maintenance mode" at all, nor can it handle the interplay between the two (I have a script to restart mmj2 whenever I do a maintenance mode edit, which I do using a text editor open to the .mm file).

Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.

True. Ideas are welcome.
 
And all those problems are not related to the way the database is stored on a disk anyway.

The disk storage question is an attempt to address the scalability of the full library going forward. Almost all activities you want to do with the library are "local", and as set.mm becomes more modular this will become even more so. You can't "seek" in a text file.
 
> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause 
> unforeseen mistakes.

And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issues 
but it is always at the cost of poorer interfacing. I think computer engineers must spend I would say  80% of their time with text-based
tools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.

You're kidding yourself if you think that compressed proofs are "readable". Currently an .mm file is this odd mix of human readable and unreadable data. The basic idea here is to take the unreadable part out of an .mm file so it becomes pure "source code", then make the human-unreadable part *really* unreadable, i.e. full binary, taking advantage of all the unutilized byte values to achieve a significant savings in proof size (~30% decrease), and making it possible to jump around the file more efficiently than just loading the whole thing into memory.

Also, a binary format like this one is naturally paired with the aforementioned "proof mode" software, which will unpack a proof into a human readable format like .mmp style. That way you get the best of both worlds.
 
The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is always
captured by the checker (at least for the proofs an formulas).

 Just because an error is recognized doesn't mean it's easy to fix. It is quite easy to cause thousands of errors with a simple change in one spot, and syntax aware editors can automatically resolve these downstream errors in many cases.



On Sun, Apr 17, 2016 at 7:51 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/17/16 6:24 AM, Mario Carneiro wrote:
A simple short term solution would be a tool to separate set.mm
<http://set.mm> by section, and daisy-chain the files together with

include statements. If this can be done in a perfectly reversible
manner, then we could pass between each form painlessly. That is, you
could take:

$( =-=- section 1 =-=- $)
Do you mean the existing section headers?  If so, how would the user put multiple tiny sections in one file or alternately break a huge section into 2 files, or is that not possible/desirable with your scheme?  What is the algorithm for mapping the string "section 1" to "section1.mm"? Or if you don't mean that, wouldn't it be clearer just to put the file name instead of "section 1"?

Yes, those are the existing section headers. Splitting and joining files is done just like you would in a software project: make a new file, copy the theorems of interest over. I don't think it is necessary to specifically support this action just because it happens to be simple with the current approach. Also, you could just add a second section to one of the single-section files, then round trip it and it will get broken up.

You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.
 

thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...

from set.mm <http://set.mm>, and form the files section1.mm
<http://section1.mm>:


$( =-=- section 1 =-=- $)
thm1 $p ...


(Aside:  Is there any way to turn off Google's insertion of spurious http links on file names?)

Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

I totally agree. I think that full backwards compatibility is very
important for Metamath in general, so that proofs are supported long
into the future, and we retain the current simplicity that has
encouraged the existing proliferation of verifiers.

However, I am not certain that all features can be supported with the
current mm format, in particular the separation of proofs and theorem
statements, so I would relax this restriction to having a (fairly
simple) translation from 1.0 to and from 2.0.

Can you clarify this?  I'm not opposed to hiding new 2.0 features in special 1.0 comments so that 2.0 could be recovered from it exactly.  If we have perfect (lossless) translations back and forth between 1.0 and 2.0 (which is what I was trying to suggest), what are you relaxing?

Oh, I just meant it may not be backwards or forwards compatible, in the sense that one spec is a subset or superset of the other, but there is an embedding and maybe a bijection between formats.

I gathered statistics on how often each number, marked or not, is used
in the compressed proofs, and tested what choices {a,b,c} lead to the
shortest overall size. The largest number in set.mm <http://set.mm> is

905, and 10% of steps are marked, with a nice exponential decay for both
groups.

Surprisingly, the best choices have c being incredibly small; the best
solution was {208, 46, 2} with 6084 KB to store all the compressed
numbers in set.mm <http://set.mm>. (For comparison, the currrent

compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
My instinct is to go with a more balanced choice, such as {192, 56, 8} =
6118 KB, but I guess the fact that there is a loose upper bound on proof
lengths currently in force in set.mm <http://set.mm> means that numbers

never get so large that more than 3 digits is needed.
Interesting.  I'm curious - how well do these formats compress further with say bzip2 or xz?

There is probably still some room for compression, since the label lists are still english-like. But a proper answer would require implementing the algorithm (those KB numbers are projections based on the statistics of set.mm, I never actually made a 6084 KB file). I'll let you know if/when 2.0 is actually built.

Mario

Stefan O'Rear

unread,
Apr 17, 2016, 8:53:13 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 3:24:41 AM UTC-7, Mario Carneiro wrote:
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.

I'm not sure why it's not good.  On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory.  If an editor has trouble with a file of that size, maybe the editor design is not good.  While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more.  Emacs supposedly handles 2EB (10^9 GB).

Being able to load a 28MB file is not the same thing as being able to handle a 28MB file _quickly_.  Most files encountered in programming are much smaller (look at the C files in the Metamath distribution), and most specialized programmer's editors don't place a high priority on being able to edit responsively with a large file loaded, even if that file can theoretically be loaded.  I believe I reported a long time ago that vim was having 1/4 - 1/2 seconds of lag to move around in the 17MB set.mm.  It's great that your editor of choice has made tradeoffs that allow it to efficiently handle large files, but other people have editors of choice that make different tradeoffs, and keeping set.mm in smaller pieces will help make them happy.

The version control systems I use also get noticably sluggish with files that large.

Moreover, it's important to plan for the future as well as the present.  If we ever succeed in attracting a significant number of new contributors, that file will increase in size much faster than it has done in the past.  I consider the fact that set.mm is growing slower than Moore's law to be a bug, not a feature. 

The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.

Exactly, and this goes double for my zany idea of the browser-based proof editor, because Javascript implementations give you very little control over the heap layout (security reasons) and many data structures will require a lot more memory than they would in a C program.  (Unless we do the asm.js "everything in a big Int8Array" approach).
 
One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.

My plan for SMM-PA was to keep the proofs in text format when they weren't being edited, because that's denser than almost any other possible data structure.  Still, maintaining the parse trees for all statements took ~ 120 MB last time I tried it.  Developing a denser format for the parse trees was on my TODO list.
 
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.

Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?

Simplicity.  Speed.  Lowest common denominator.  Convenience.

I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.

Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

Important: This proposal DOES NOT use $[ $].  The directory structure is king.  I regard $[ $] as a mistake because it gives the database author too much flexibility; the database should be a database, not a program.

This gives us a large number of small files (although Construction_of_a_vector_space_from_a_Hilbert_lattice could still stand to be split up :) ), making our tree similar to other trees and play nicer with the widest variety of existing text manipulation tools.  (A current Linux kernel checkout is 624 MB, or 12 KB per source file...)

Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.

Agreed on commits and the value of integrating version control into our workflow.  Again, git and other systems I've used will be happier with hundreds of small files than one largish one.
 
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

-sorear 

Stefan O'Rear

unread,
Apr 17, 2016, 9:02:19 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 4:51:21 PM UTC-7, Norman Megill wrote: 
> I believe Stefan has already used such a tool for the set.mm-history
> project.

Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?


A newer version of essentially the same logic: https://github.com/sorear/smm/blob/master/lib/smm/outline.js

I have already stated my opposition to $[ $].
 
Interesting.  I'm curious - how well do these formats compress further
with say bzip2 or xz?

I'd really encourage us to _not_ spend much time on optimizing compression with databases at rest.  I think I could make something half the size of .mm.xz, given my experience with format-specific data compressors (the general-purpose compressors are a lot worse for specific formats than people realize), but at least for me 28MB on disk is very little.  I care much more about 120MB of parse trees in RAM, occupying space in a browser heap already split between N tabs...

This might be US city-dweller bias.  FL might have choice words on the importance of optimizing databases for transmission over very slow unreliable links.

-sorear

Stefan O'Rear

unread,
Apr 17, 2016, 9:05:20 PM4/17/16
to Metamath
On Sunday, April 17, 2016 at 5:16:24 PM UTC-7, Mario Carneiro wrote:
You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.

I posted two implementations of the "news website approach" above.  It needs a little tweaking for Metamath but works well.
 
Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.

Two-letter file extensions are best avoided for new projects because of the ccTLD issue...
 
-sorear

David A. Wheeler

unread,
Apr 17, 2016, 10:18:31 PM4/17/16
to meta...@googlegroups.com, Mario Carneiro
I do like the readability of text files. If seekability is important, perhaps they could be stored as a bunch of small text files, with one proof profile.

--- David A.Wheeler

Norman Megill

unread,
Apr 17, 2016, 11:35:35 PM4/17/16
to meta...@googlegroups.com
On 4/17/16 8:16 PM, Mario Carneiro wrote:
>
>
> On Sun, Apr 17, 2016 at 11:15 AM, 'fl' via Metamath
> <meta...@googlegroups.com <mailto:meta...@googlegroups.com>> wrote:
...

> > In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause
> > unforeseen mistakes.
>
> And frankly readable, text-based databases are never superseded by
> binary databases. Binary data can be useful for speed or other
> efficiency issues
> but it is always at the cost of poorer interfacing. I think computer
> engineers must spend I would say 80% of their time with text-based
> tools to manage data, systems and so on. Text-based databases have
> the peculiarity of interfacing well with important element of a
> computer: human beings.
>
>
> You're kidding yourself if you think that compressed proofs are
> "readable". Currently an .mm file is this odd mix of human readable and
> unreadable data. The basic idea here is to take the unreadable part out
> of an .mm file so it becomes pure "source code", then make the
> human-unreadable part *really* unreadable, i.e. full binary, taking
> advantage of all the unutilized byte values to achieve a significant
> savings in proof size (~30% decrease), and making it possible to jump
> around the file more efficiently than just loading the whole thing into
> memory.
>
> Also, a binary format like this one is naturally paired with the
> aforementioned "proof mode" software, which will unpack a proof into a
> human readable format like .mmp style. That way you get the best of both
> worlds.

I'll mention that I am somewhat uncomfortable with binary formats, but
as long as we can convert to ASCII 1.0 I guess it's tolerable. For some
things like label renames, a syntax-aware Metamath-specific tool is
certainly preferred. But no such tool can anticipate all the possible
manipulations we may want to do to a .mm file. Having the ability to
process it with text-manipulating tools from sed to powerful text editor
scripting has been extremely useful in the past, and that ability goes
away with binary formats. Unlike a computer program where editing can
introduce hidden bugs, the ability to verify proofs makes .mm files
relatively robust against editing mistakes.

A computer program source file 28MB in size is hard to even imagine, and
I wouldn't want to be responsible for its maintenance. The fact that
set.mm flourishes with no significant problems in spite of its size
suggests that is has a fundamentally different character from a computer
program. The same principles of structure and modularity, intended to
reduce bugs and increase maintainability, may or may not apply. set.mm
is also quite different from an ordinary SQL database, which could be
riddled with unverifiable data entry errors that can only partially be
checked for consistency (such as correct zip code). I don't know what
category of data object set.mm would be classified as, but it seems
different from the usual ones of our experience.

As for file size, of course smaller files are preferable as a general
principle, but I don't think it is worth giving up flexibility. I still
don't understand the concern with our present file sizes, where set.mm
takes less than 1% of available memory on most computers. (BTW my text
editor uses an additional 13MB to work with the 28MB set.mm, growing up
to a user-specified limit for the undo stack.) I know you are trying to
anticipate future scalability, and theoretical studies of different
proof compression methods is certainly interesting, but it will be quite
a while before saving 30% of the size will make a noticeable difference.

> The quantity of errors introduce by a non aware-based system are
> easily corrected I think especially with metamath where an error in
> the data is always
> captured by the checker (at least for the proofs an formulas).
>
>
> Just because an error is recognized doesn't mean it's easy to fix. It
> is quite easy to cause thousands of errors with a simple change in one
> spot, and syntax aware editors can automatically resolve these
> downstream errors in many cases.

What is an example of an error introduced by a careless edit that is
hard to fix? (Personally, during an editing session I usually save the
file frequently and run the fast 'verify proof */syntax_only'.)

Norm

Mario Carneiro

unread,
Apr 18, 2016, 1:31:23 AM4/18/16
to metamath
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:
Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).
 
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.


On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable.  For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred.  But no such tool can anticipate all the possible manipulations we may want to do to a .mm file.  Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats.  Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.
 
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance.  The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program.  The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.
 
set.mm is also quite different from an ordinary SQL database, which could be riddled with unverifiable data entry errors that can only partially be checked for consistency (such as correct zip code).  I don't know what category of data object set.mm would be classified as, but it seems different from the usual ones of our experience.

As we've seen, not all data errors in set.mm can be checked. The definition checker, grammar parsing, and theorem scoping are all examples of data integrity checks that are not checkable by the average verifier. Furthermore, testing that theorems and definitions mean what we want will never be automatically checkable. This is exactly the same sort of thing as an SQL database with unverifiable data entry errors, or a C program which might compile but might still have logical bugs.

The biggest difference between set.mm and an SQL database is instead the online nature of the data (online as in https://en.wikipedia.org/wiki/Online_algorithm). If all the grammar was defined up front, it actually wouldn't be all that bad to store a metamath file in a DBMS.

As for file size, of course smaller files are preferable as a general principle, but I don't think it is worth giving up flexibility.  I still don't understand the concern with our present file sizes, where set.mm takes less than 1% of available memory on most computers.  (BTW my text editor uses an additional 13MB to work with the 28MB set.mm, growing up to a user-specified limit for the undo stack.)  I know you are trying to anticipate future scalability, and theoretical studies of different proof compression methods is certainly interesting, but it will be quite a while before saving 30% of the size will make a noticeable difference.

I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.
 
Just because an error is recognized doesn't mean it's easy to fix. It
is quite easy to cause thousands of errors with a simple change in one
spot, and syntax aware editors can automatically resolve these
downstream errors in many cases.

What is an example of an error introduced by a careless edit that is hard to fix?  (Personally, during an editing session I usually save the file frequently and run the fast 'verify proof */syntax_only'.)

An almost trivial example: Change syl to $e |- ( ps -> ch ) $e |- ( ch -> ph ) $p |- ( ps -> ph ) $. By "fix" here I don't mean "undo the change", but "fix the file so that it's not broken, but keep the change".

Mario

fl

unread,
Apr 18, 2016, 6:08:17 AM4/18/16
to Metamath


> But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the > syntax tree delivered the by mmj2 to add

Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

You can also externalize the parser and make an external filter in the "find" spirit. I think it would
allow to do think.

By the way if could add automatic periodic backups.

--
FL

fl

unread,
Apr 18, 2016, 6:13:11 AM4/18/16
to Metamath

>  I think it would allow to do think.

To do things.

--
FL

Stefan O'Rear

unread,
Apr 19, 2016, 12:51:00 AM4/19/16
to Metamath
On Sunday, April 17, 2016 at 10:31:23 PM UTC-7, Mario Carneiro wrote:
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:
Simplicity is a big deal for me.  I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.

Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order.  Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.

I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.

That was a very quick idea, and sure to have a number of problems… this is a fairly hard problem to solve, and I still don't have all the answers yet.  Definitely interested in future exploration and discussion of the "how do we make a _good_ dual-mode editor" space.

What do we want to do in terms of platform issues?  I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.
 
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well.  At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

Vaguely I feel like the Database should always be, abstractly, a singular entity, but it need not be worked with _as_ a singular entity, if that makes any sense at all.
 
the database should be a database, not a program.

I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.

A simple "optimization" that simplifies static analysis:

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?
 
Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.

Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:

Not a big fan of separating the proofs.  Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth.  Splitting the database into _many_ files can scale.

It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.

Sounds good, assuming we have a splitting mechanism that is rationalized enough that we can actually handle files in isolation.  (Even C compiler authors are turning against textual inclusion these days.  http://clang.llvm.org/docs/Modules.html — I have my hopes.)
 
The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).

When I did the memory profiling on smm that came up with the 120MB number, the worst offender was large numbers of short string objects, specifically the "(" "A" "+" "B" ")" strings that came from first-stage tokenizing the math strings.  I'm pretty sure the memory usage could be cut substantially, first by using an intern table for tokens, second by more restricted caching of things like parse trees that can be locally constructed.  There are tradeoffs here; I was trying to cut every last centisecond from the verification time, and I reasoned that would add quite a few (although to be honest I never measured it, I could do that).

"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches.  We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them.  We can't go around rebuilding all of the indexes on load and on every edit.  A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here.  They'll have to respect tree structure and scope structure in a few ways.
 
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section.  Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).

I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.

That's what I had in mind - splitting the database into 100 or so files with separate timestamps.  Don't go crazy though as nobody ever added transitive directory timestamps to UNIX and stat overhead can be a killer when you have 50,000+ files (see also: doing almost anything with said Linux checkout).

On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.

Absolutely version control should have a single point of truth.  If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.
 
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable.  For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred.  But no such tool can anticipate all the possible manipulations we may want to do to a .mm file.  Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats.  Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.

I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.

Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one.  Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier.  I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with.  A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.
 
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance.  The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program.  The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.

I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.

The largest legitimate (not a firmware dump, etc) single source file in my Linux checkout is 534 KB.  It's a matter of degree, not kind.
 
I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.

Worth noting that one of my big research plans for a proof tool is a fast highly-indexed search function.  I haven't worked out very many of the details yet but I've spent a fair amount of time looking at the algorithms used by Lucene et al, and I do think something can be usefully done, *but* managing memory use is going to be one of the hardest parts.
 
-sorear

Stefan O'Rear

unread,
Apr 19, 2016, 12:55:52 AM4/19/16
to Metamath
On Monday, April 18, 2016 at 3:08:17 AM UTC-7, fl wrote:
Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.

I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.

And don't even think about enabling the syntax highlighter. 

Mario Carneiro

unread,
Apr 19, 2016, 4:47:50 AM4/19/16
to metamath
On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
What do we want to do in terms of platform issues?  I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.

I've been investigating a possible replacement for metamath.exe written in Rust (which targets the same platforms as C). As long as you are running local, it seems useful to just run as a command line tool, like a compiler, which is invoked by actual text editors to do the heavy lifting. The biggest problem with this model is the large startup cost, which would seem to suggest a more server-oriented approach in which the compiler runs as a service and updates its memory based on commands from the editor.
 
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:

index.mm:
$[ fol\prop\axioms.mm $]  <-- thru sec 1.2.2
$[ fol\prop\imp.mm $]  <-- sec 1.2.3
$[ fol\prop\neg.mm $]  <-- sec 1.2.4
...
$[ zf\ax-ext.mm $]   <-- sec 2.1.1
$[ zf\cab.mm $]   <-- sec 2.1.2
...

This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.

That could work well.  At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.

One idea I've been toying with is snapshotting the database after each file, so that you have a restart point when some file in the middle changes. Not sure if it's not simpler to just filter the full database and throw out everything after a given sequence number.

* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)

Isn't this already true?

It's almost true; moving local $v's to the start causes duplicate $v errors, but you can fix that by merging the duplicates, but local $f's might cause an irreparable duplicate problem, if the type of the variable is different in the two $f statements. A recent suggestion of mine for ensuring that local $f assignments always use the same type will resolve this.

But the other half of the story is that moving the grammar up means that more statements will be accepted, for example a propositional calc statement might make reference to a constant declared later (like CHOICE), which would parse "correctly" even though it is not valid. I'm not sure how to resolve this; probably it is best to just load the full grammar and then postprocess the parse to make sure it does not use any forward references. (The same problem applies to user scripts, which may have hardcoded references to theorems from later on.)
 
"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches.  We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them.  We can't go around rebuilding all of the indexes on load and on every edit.  A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here.  They'll have to respect tree structure and scope structure in a few ways.

Alternatively, you can present the user with the raw .mm file, and diff any changes against the indexed version, then attempt to reverse engineer which statements are untouched and only rebuild the index for changed statements. Doing this quickly is tough, though, even assuming you are working with a reasonably sized .mm file fragment.