On 4/15/16 12:25 AM, Marnix Klooster wrote:
Hi all,
I'm not invested in any design direction here, but here is just one
remark from my side.
It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).
Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.
(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)
[...]
Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility. If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.
[...]
As for modularity, in set.mm you will find commented out file includes for things like logic.mm. At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't). I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor. Eventually set.mm will have to be broken up, but we aren't at that point yet.
Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this.
If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe youmay place the proofs somewhere else.
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.
I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).
The main editor I use reads set.mm in well under 1 sec. When I search for a string at the end of the file it gets there essentially instantly. I just tried a search and replace of the 27K instances of '$p' with a larger string, and it completed before I could lift my finger from the mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it became a little sluggish, taking 6 sec to find a string at the end of the file.
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.
Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?
Simplicity. Speed. Lowest common denominator. Convenience.
1. Moving sections around is trivial if sections are in separate files,
it just means moving a file and possibly changing the dependencies,
while moving a theorem or range of theorems is just the same with
multiple files, just cut and paste.
How do you prevent undo/redo confusion when switching among various files?
2. Global renames are completely identical with multiple files; almost
all file editors have a "find/replace in files" feature, which would do
the trick.
What if we forget to load one of them into the editor?
Anyway, the main editor I use doesn't have this feature. I could change editors, but it would be asking me to sacrifice something I'm comfortable with.
Ideally this should be done by the aforementioned "perfect
metamath editor", though, so it can be more syntax-aware and only do the
replacement where it is syntactically valid.
The multiple-module feature you want could be done now with the $[ $] includes.
For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.
We'd have to agree on the commenting convention, but I imagine it would be similar to what is in set.mm now if you search for '$( $['. Conversion between single and multiple would of course be automated in metamath.exe or another program.
I think something like this should satisfy everyone.
FYI the current plans on my to-do list for metamath.exe include a change to its data structure to allow any part of a statement to be modified internally.
Applications I have in mind are global label and math token renames, checking for conflicts before applying and also updating comment markup and htmldefs. This will prevent accidental changes to English words that might happen in an editor. Maybe I'll update the Recent Label Changes automatically.
As a general comment, I think it would be useful for Metamath 2.0 to be translatable to and from the current version. If 2.0 needs additional keywords for its features, maybe they could be hidden in special comments in the 1.0 version so the versions could be completely compatible. If 2.0 has say local scoping of things like labels, the translator could rename them for the 1.0 version with some standard suffix to make them unique. The tools we have for 1.0 (metamath.exe, numerous verifiers) may or may not be practical to rewrite for 2.0, but with perfect translatability they could continue to be used.
Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.
Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to addsyntax-aware routines.
Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.
And all those problems are not related to the way the database is stored on a disk anyway.
> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause> unforeseen mistakes.And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issuesbut it is always at the cost of poorer interfacing. I think computer engineers must spend I would say 80% of their time with text-basedtools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.
The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is alwayscaptured by the checker (at least for the proofs an formulas).
On 4/17/16 6:24 AM, Mario Carneiro wrote:
A simple short term solution would be a tool to separate set.mm
<http://set.mm> by section, and daisy-chain the files together with
include statements. If this can be done in a perfectly reversible
manner, then we could pass between each form painlessly. That is, you
could take:
$( =-=- section 1 =-=- $)
Do you mean the existing section headers? If so, how would the user put multiple tiny sections in one file or alternately break a huge section into 2 files, or is that not possible/desirable with your scheme? What is the algorithm for mapping the string "section 1" to "section1.mm"? Or if you don't mean that, wouldn't it be clearer just to put the file name instead of "section 1"?
thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...
from set.mm <http://set.mm>, and form the files section1.mm
<http://section1.mm>:
$( =-=- section 1 =-=- $)
thm1 $p ...
(Aside: Is there any way to turn off Google's insertion of spurious http links on file names?)
I totally agree. I think that full backwards compatibility is very
important for Metamath in general, so that proofs are supported long
into the future, and we retain the current simplicity that has
encouraged the existing proliferation of verifiers.
However, I am not certain that all features can be supported with the
current mm format, in particular the separation of proofs and theorem
statements, so I would relax this restriction to having a (fairly
simple) translation from 1.0 to and from 2.0.
Can you clarify this? I'm not opposed to hiding new 2.0 features in special 1.0 comments so that 2.0 could be recovered from it exactly. If we have perfect (lossless) translations back and forth between 1.0 and 2.0 (which is what I was trying to suggest), what are you relaxing?
I gathered statistics on how often each number, marked or not, is usedin the compressed proofs, and tested what choices {a,b,c} lead to the
shortest overall size. The largest number in set.mm <http://set.mm> is
905, and 10% of steps are marked, with a nice exponential decay for both
groups.
Surprisingly, the best choices have c being incredibly small; the best
solution was {208, 46, 2} with 6084 KB to store all the compressed
numbers in set.mm <http://set.mm>. (For comparison, the currrent
compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
My instinct is to go with a more balanced choice, such as {192, 56, 8} =
6118 KB, but I guess the fact that there is a loose upper bound on proof
lengths currently in force in set.mm <http://set.mm> means that numbers
never get so large that more than 3 digits is needed.
Interesting. I'm curious - how well do these formats compress further with say bzip2 or xz?
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.
I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).
The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.
One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.
Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?
Simplicity. Speed. Lowest common denominator. Convenience.I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.
Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:
> I believe Stefan has already used such a tool for the set.mm-history
> project.
Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?
Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?
You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.
Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.
Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.
the database should be a database, not a program.
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.
set.mm is also quite different from an ordinary SQL database, which could be riddled with unverifiable data entry errors that can only partially be checked for consistency (such as correct zip code). I don't know what category of data object set.mm would be classified as, but it seems different from the usual ones of our experience.
As for file size, of course smaller files are preferable as a general principle, but I don't think it is worth giving up flexibility. I still don't understand the concern with our present file sizes, where set.mm takes less than 1% of available memory on most computers. (BTW my text editor uses an additional 13MB to work with the 28MB set.mm, growing up to a user-specified limit for the undo stack.) I know you are trying to anticipate future scalability, and theoretical studies of different proof compression methods is certainly interesting, but it will be quite a while before saving 30% of the size will make a noticeable difference.
Just because an error is recognized doesn't mean it's easy to fix. It
is quite easy to cause thousands of errors with a simple change in one
spot, and syntax aware editors can automatically resolve these
downstream errors in many cases.
What is an example of an error introduced by a careless edit that is hard to fix? (Personally, during an editing session I usually save the file frequently and run the fast 'verify proof */syntax_only'.)
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.
the database should be a database, not a program.I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.A simple "optimization" that simplifies static analysis:* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)
Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.
The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.
On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.
I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.
Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.
What do we want to do in terms of platform issues? I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.That could work well. At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.
* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)Isn't this already true?
"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches. We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them. We can't go around rebuilding all of the indexes on load and on every edit. A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here. They'll have to respect tree structure and scope structure in a few ways.