On 4/15/16 12:25 AM, Marnix Klooster wrote:
Hi all,
I'm not invested in any design direction here, but here is just one
remark from my side.
It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).
Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.
(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)
[...]
Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility. If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.
[...]
As for modularity, in set.mm you will find commented out file includes for things like logic.mm. At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't). I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor. Eventually set.mm will have to be broken up, but we aren't at that point yet.
Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this.
If you want to change the format I suggest that the theorems themselves are kept in one place. Maybe youmay place the proofs somewhere else.
On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.
I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).
The main editor I use reads set.mm in well under 1 sec. When I search for a string at the end of the file it gets there essentially instantly. I just tried a search and replace of the 27K instances of '$p' with a larger string, and it completed before I could lift my finger from the mouse. When I tried browsing a 2.1GB ASCII file of a human genome, it became a little sluggish, taking 6 sec to find a string at the end of the file.
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.
Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?
Simplicity. Speed. Lowest common denominator. Convenience.
1. Moving sections around is trivial if sections are in separate files,
it just means moving a file and possibly changing the dependencies,
while moving a theorem or range of theorems is just the same with
multiple files, just cut and paste.
How do you prevent undo/redo confusion when switching among various files?
2. Global renames are completely identical with multiple files; almost
all file editors have a "find/replace in files" feature, which would do
the trick.
What if we forget to load one of them into the editor?
Anyway, the main editor I use doesn't have this feature. I could change editors, but it would be asking me to sacrifice something I'm comfortable with.
Ideally this should be done by the aforementioned "perfect
metamath editor", though, so it can be more syntax-aware and only do the
replacement where it is syntactically valid.
The multiple-module feature you want could be done now with the $[ $] includes.
For those wanting a single file, the includes would be commented out.
For those wanting multiple files, the includes would be uncommented.
We'd have to agree on the commenting convention, but I imagine it would be similar to what is in set.mm now if you search for '$( $['. Conversion between single and multiple would of course be automated in metamath.exe or another program.
I think something like this should satisfy everyone.
FYI the current plans on my to-do list for metamath.exe include a change to its data structure to allow any part of a statement to be modified internally.
Applications I have in mind are global label and math token renames, checking for conflicts before applying and also updating comment markup and htmldefs. This will prevent accidental changes to English words that might happen in an editor. Maybe I'll update the Recent Label Changes automatically.
As a general comment, I think it would be useful for Metamath 2.0 to be translatable to and from the current version. If 2.0 needs additional keywords for its features, maybe they could be hidden in special comments in the 1.0 version so the versions could be completely compatible. If 2.0 has say local scoping of things like labels, the translator could rename them for the 1.0 version with some standard suffix to make them unique. The tools we have for 1.0 (metamath.exe, numerous verifiers) may or may not be practical to rewrite for 2.0, but with perfect translatability they could continue to be used.
Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.
Right, this is exactly what I have in mind. Doing this with a syntax aware tool allows you to perform these kinds of edits more reliably than a plain text editor.But a syntax aware tool exists: it is mmj2. Its parser is very efficient and well designed. Now it is simple: use the syntax tree delivered the by mmj2 to addsyntax-aware routines.
Currently the search/replace functions in mmj2 are poor in my opinion. They deserve to be completely redesigned.
And all those problems are not related to the way the database is stored on a disk anyway.
> In fact, I question whether there are *any* maintenance activities that are appropriate for a plain text editor, with find and replace, which don't sometimes cause> unforeseen mistakes.And frankly readable, text-based databases are never superseded by binary databases. Binary data can be useful for speed or other efficiency issuesbut it is always at the cost of poorer interfacing. I think computer engineers must spend I would say 80% of their time with text-basedtools to manage data, systems and so on. Text-based databases have the peculiarity of interfacing well with important element of a computer: human beings.
The quantity of errors introduce by a non aware-based system are easily corrected I think especially with metamath where an error in the data is alwayscaptured by the checker (at least for the proofs an formulas).
On 4/17/16 6:24 AM, Mario Carneiro wrote:
A simple short term solution would be a tool to separate set.mm
<http://set.mm> by section, and daisy-chain the files together with
include statements. If this can be done in a perfectly reversible
manner, then we could pass between each form painlessly. That is, you
could take:
$( =-=- section 1 =-=- $)
Do you mean the existing section headers? If so, how would the user put multiple tiny sections in one file or alternately break a huge section into 2 files, or is that not possible/desirable with your scheme? What is the algorithm for mapping the string "section 1" to "section1.mm"? Or if you don't mean that, wouldn't it be clearer just to put the file name instead of "section 1"?
thm1 $p ...
$( =-=- section 2 =-=- $)
thm2 $p ...
$( =-=- section 3 =-=- $)
thm3 $p ...
from set.mm <http://set.mm>, and form the files section1.mm
<http://section1.mm>:
$( =-=- section 1 =-=- $)
thm1 $p ...
(Aside: Is there any way to turn off Google's insertion of spurious http links on file names?)
I totally agree. I think that full backwards compatibility is very
important for Metamath in general, so that proofs are supported long
into the future, and we retain the current simplicity that has
encouraged the existing proliferation of verifiers.
However, I am not certain that all features can be supported with the
current mm format, in particular the separation of proofs and theorem
statements, so I would relax this restriction to having a (fairly
simple) translation from 1.0 to and from 2.0.
Can you clarify this? I'm not opposed to hiding new 2.0 features in special 1.0 comments so that 2.0 could be recovered from it exactly. If we have perfect (lossless) translations back and forth between 1.0 and 2.0 (which is what I was trying to suggest), what are you relaxing?
I gathered statistics on how often each number, marked or not, is usedin the compressed proofs, and tested what choices {a,b,c} lead to the
shortest overall size. The largest number in set.mm <http://set.mm> is
905, and 10% of steps are marked, with a nice exponential decay for both
groups.
Surprisingly, the best choices have c being incredibly small; the best
solution was {208, 46, 2} with 6084 KB to store all the compressed
numbers in set.mm <http://set.mm>. (For comparison, the currrent
compressed proof format uses roughly {20, 20, 5} which gives 9713 KB.)
My instinct is to go with a more balanced choice, such as {192, 56, 8} =
6118 KB, but I guess the fact that there is a loose upper bound on proof
lengths currently in force in set.mm <http://set.mm> means that numbers
never get so large that more than 3 digits is needed.
Interesting. I'm curious - how well do these formats compress further with say bzip2 or xz?
On Sat, Apr 16, 2016 at 10:16 PM, Norman Megill <n...@alum.mit.edu> wrote:On 4/16/16 6:53 AM, Mario Carneiro wrote:
Actually, I've been brainstorming how to fix this, what the perfect
metamath editor should look like. It's quite clear to me, at least, that
we are not there, and having one 16MB file is a symptom of this. Most
file editors are just not designed for this, and it is not good to need
to read it all and store it in memory in order to get to a single
theorem.
I'm not sure why it's not good. On today's computers, 28MB is nothing compared to the available 4 to 8GB of memory. If an editor has trouble with a file of that size, maybe the editor design is not good. While I haven't checked https://en.wikipedia.org/wiki/Comparison_of_text_editors#Basic_features carefully, the currently supported ones I recognize handle several GB or more. Emacs supposedly handles 2EB (10^9 GB).
The thing is that it's not just 28MB (wow, it's grown a lot since I last checked). An editor will likely need several times that in order to store not just the raw text, but also the internal processed form of the data. Remember that the "perfect Metamath editor" will not be a plain text editor, because that doesn't correctly reflect the structure of Metamath data.
One interesting data structure I'd like to focus on is the information needed by proof mode. Roughly, this is the MM file with proofs omitted. I just did a test, stubbing out all the proofs, and the result is about 12MB which is alarmingly large. Removing comments brings it down to 5.8MB. I think there is a good argument for separating this to a different file from the proofs, so that an editor need not read the proofs if one is only interested in the "context" of one proof. This structure is also useful as the context needed for proof HTML generation.
Edits to an .mm file almost always are local, touching only a
tiny tiny part of that giant file, and splitting it up is one way to
focus development a bit more.
Could you repeat the reasons why you think having set.mm <http://set.mm>
in one file is a good thing?
Simplicity. Speed. Lowest common denominator. Convenience.I will certainly grant that it is easy to use a text editor for maintenance activities, when it can be done. The major drawback is that there are certain maintenance activities that can't be done with find/replace, and others that "almost" work but need some cleanup afterwards. A proper tool to do these refactorings in a correctness-preserving manner is warranted.
Some editors do shared undo, and usually a theorem move only involves one or two files anyway. But I think the correct solution here is just to commit often, and use the commits as cross-file undo checkpoints.
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:
> I believe Stefan has already used such a tool for the set.mm-history
> project.
Could Stefan report the format he uses, and any other opinions he might
have on what our "official" convention should be?
Interesting. I'm curious - how well do these formats compress further
with say bzip2 or xz?
You are right that there is no easy way to build section names appropriate as file names, since currently we don't have any short naming system for sections that would naturally have already developed if we used multiple files (notice that existing split attempts have such short names, like logic.mm, dedekind.mm instead of "Dedekind-cut construction of real and complex numbers"). We may want to consider a format for annotating short names in the section headers. Other options include using the section numbering (which is sensitive to insertion/deletion), or using the news website approach: take the English title, lowercase everything, replace all non-alphanumerics with hyphen, then collapse multiple consecutive hyphens and cut it off if it is longer than some upper bound, say 50 chars.
Haha, you have it bad because Google turns these names into links, which your plaintext email client turns into "a <b>" format, which Google turns into links... I bet if we keep passing this quote around it will blow up exponentially.
Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.
the database should be a database, not a program.
Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).
I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.
set.mm is also quite different from an ordinary SQL database, which could be riddled with unverifiable data entry errors that can only partially be checked for consistency (such as correct zip code). I don't know what category of data object set.mm would be classified as, but it seems different from the usual ones of our experience.
As for file size, of course smaller files are preferable as a general principle, but I don't think it is worth giving up flexibility. I still don't understand the concern with our present file sizes, where set.mm takes less than 1% of available memory on most computers. (BTW my text editor uses an additional 13MB to work with the 28MB set.mm, growing up to a user-specified limit for the undo stack.) I know you are trying to anticipate future scalability, and theoretical studies of different proof compression methods is certainly interesting, but it will be quite a while before saving 30% of the size will make a noticeable difference.
Just because an error is recognized doesn't mean it's easy to fix. It
is quite easy to cause thousands of errors with a simple change in one
spot, and syntax aware editors can automatically resolve these
downstream errors in many cases.
What is an example of an error introduced by a careless edit that is hard to fix? (Personally, during an editing session I usually save the file frequently and run the fast 'verify proof */syntax_only'.)
On Sun, Apr 17, 2016 at 8:53 PM, Stefan O'Rear <stef...@cox.net> wrote:Simplicity is a big deal for me. I made up a proposal for a complicated namespacing system when I first joined; I forget if I ever shared it but I no longer support it.Straw proposal that I _could_ get behind today: replace the single .mm file with a directory of files, which are read in alphabetical order. Tools which don't understand the directory design can be supported with a very simple concatenation and splitting program; the concatenation, in particular, is a 1-liner in UNIX shell.I like this general approach, although the alphabetic thing will require indexing in section names, which means a lot of shuffling around for insertions.
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.
the database should be a database, not a program.I think this bears repeating. A good portion of the complexity of manipulating an .mm file is caused by all the possible changes to the parser state that can occur while reading the file. That is, it is difficult to check a proof in a second pass because you have to remember all the changes to global state that have "since" happened from directives later in the file, and undo them using sequence number filters and so on. This design is easy for dynamic work, i.e. verifiers, but makes static analysis (syntax aware maintenance mode) more complex.A simple "optimization" that simplifies static analysis:* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)
Luckily we don't have to worry about theorems being hidden to the point of inaccessibility, followed by name reuse, or else that would also be on this list.Here's a simple proposal for separating the proofs: take the proofs out of the .mm file, so that a $p statement takes the simple form:Not a big fan of separating the proofs. Seems to add a lot of complexity for a gain that will be erased by just a year or two of growth. Splitting the database into _many_ files can scale.It may not have been obvious, but the proposal was actually suggesting one .mmc proof index file for *each* .mm file in the project. So if the .mm file is broken up, so too will the .mmc files.
The gain here is not just the ~30% file size savings, although that is nice. The main point is that this file need not be stored in memory _at all_, regardless of how the proof data is compressed. Whenever a proof record is needed, you open the file, seek to the location of the data, then read the record and close the file. So that's a 100% savings on in-memory proof data storage (although if your memory statistics are correct, this may still not be a very big savings, since it's only about 70% of set.mm, compared to your 1000% data structure).
If the file were split into sections, the mythical Metamath Two could automatically create interface cache files for each section. Pre-tokenizing the _statements_ would be a substantial space win, and cache files that are automatically regenerated based on timestamps would add no complexity whatsoever to the maintenance and version control stories (after all, your compiler + make already does exactly that).I think caching is a good idea, particularly if it can somehow be segmented so that the entire cache is not invalidated when any part changes. (This is yet another argument for splitting the file, so that you have multiple timestamps.) If you could get something similar to the in-memory conception of the database onto the disk as a cache, then you could speed up loading the database by an order of magnitude.
On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.
A computer program source file 28MB in size is hard to even imagine, and I wouldn't want to be responsible for its maintenance. The fact that set.mm flourishes with no significant problems in spite of its size suggests that is has a fundamentally different character from a computer program. The same principles of structure and modularity, intended to reduce bugs and increase maintainability, may or may not apply.I don't buy this argument. It is quite possible to use a 28MB C file, it's not all that different from 28MB worth of separate C files as far as the compiler is concerned. And if you had to, and the tools did not have an issue with large files, it wouldn't be that bad for day-to-day work. But it's terrible code style. Not to put too fine a point on it, but I'm pretty sure that the main reason set.mm is one file today is because you think it's a good idea, which isn't actually an argument. Of course I presume that you have reasons for this decision, which are perfectly good arguments, but the mere fact that this is how it has been is not proof that it is the best way.
I think Stefan had the right idea here. This view forgets two important facts: (1) The size on disk is often quite different from the size in memory when read by a *syntax aware* tool. A file editor will probably not need much more space than the file itself because it makes no attempt to understand the data it is manipulating (which leads to subtle or not-so-subtle errors in maintenance mode). (2) Just because a file "fits" in memory (itself a fuzzy concept when you consider paging) does not mean that it can be *quickly* manipulated. This is essential for proof assistants like mmj2; I am starting to feel the pressure of scanning the database on each unify (which I do once every 3 seconds or so), where a 0.5 - 1 second delay is a pain point.
Frankly I doubt it is so difficult to transform mmj2 into an editor. Add another buffer
where the whole database is displayed, allow saving the data into a file. And here you are.
What do we want to do in terms of platform issues? I still feel like the JVM is a major liability, but I see a shortage of time continuing for the next few months so I'm not very excited to maintain my own system.
A variant of this that is forward compatible is to have a folder full of .mm file fragments, not expected to compile on their own, and one big index file, like so:This way, we have a single place to go for section reordering, and there is a lot of flexibility with file naming and folder structure. In fact, it might also be useful to put the section comments themselves in this index file, so that section organization (the stuff that Gerard Lang likes to work with) can be separated from the theorems.That could work well. At that point it would essentially be a project file; I feel like there's something _almost_ here that would be a big help to local work / handling subsets of the database, but I'm not quite sure what it is.
* All $c, $v, $f statements, and syntax axioms come at the beginning of the file. In other words, the entire database works under a single grammar. (It's not necessary that this actually be true, but rather that transplanting all the statements to the start doesn't break anything.)Isn't this already true?
"Memory" versus "disk" is not necessarily the best way to conceptualize this, since we're really dealing with a hierarchy of caches. We want to minimize the size of the auxiliary data structures that are constructed (to an extent), and the amount of ongoing work to maintain them. We can't go around rebuilding all of the indexes on load and on every edit. A very important part of the smm maintenance mode work will be developing a vocabulary of editing commands that can do useful changes without requiring all indexes to be rebuilt; I've got a few half-baked ideas here. They'll have to respect tree structure and scope structure in a few ways.
On second thought, I think the .mmc format should not contain the header, which is too brittle for version control. Instead, it would only have what I labeled "part 2", and the program would construct the binary tree or something like it in memory, maybe also outputting it as part of the cache, but not checked in to version control.Absolutely version control should have a single point of truth. If we decide to store MMC files on disk (absolute loading speed may not be important once we have a mature maintenance mode, so I could go either way on this), they should be .gitignored.
On Sun, Apr 17, 2016 at 11:35 PM, Norman Megill <n...@alum.mit.edu> wrote:I'll mention that I am somewhat uncomfortable with binary formats, but as long as we can convert to ASCII 1.0 I guess it's tolerable. For some things like label renames, a syntax-aware Metamath-specific tool is certainly preferred. But no such tool can anticipate all the possible manipulations we may want to do to a .mm file. Having the ability to process it with text-manipulating tools from sed to powerful text editor scripting has been extremely useful in the past, and that ability goes away with binary formats. Unlike a computer program where editing can introduce hidden bugs, the ability to verify proofs makes .mm files relatively robust against editing mistakes.I will emphasize that I am only talking about the stuff that is meaningless, sed-unprocessable "ASFAKLHRAWHLA" garbage already. I don't have any proposal for the .mm format any easier to use than the current one (besides taking the proof blocks out). Compressed proofs *are* binary data, even if they are currently being stored in a text file.Once we have the chunking and indexing procedure resolved, I don't see major advantages to formats other than the current one. Certainly there are better entropy coders that could be used for the compressed proof data, and we can get rid of most of the syntax steps while we're at it (I need to write and benchmark a unifying proof verifier. I suspect that with a grammatical .mm file, it may be "fast enough", as the trees give unification a lot more to bite into.); but that doesn't save us much more than size at rest, which isn't that important to begin with. A metamath editor won't need to maintain data structures for the contents of most proofs most of the time.
I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.
And don't even think about enabling the syntax highlighter.
I suspect if you did this you would discover that, as bad as commodity programmer's editors are at handling 28MB files, JTextArea is worse.Consider using real tools. Emacs manages a 28M file as if it had 100 bytes.
On Tue, Apr 19, 2016 at 04:47:49AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 12:51 AM, Stefan O'Rear <stef...@cox.net> wrote:
>
> > What do we want to do in terms of platform issues? I still feel like the
> > JVM is a major liability, but I see a shortage of time continuing for the
> > next few months so I'm not very excited to maintain my own system.
>
>
> I've been investigating a possible replacement for metamath.exe written in
> Rust (which targets the same platforms as C). As long as you are running
> local, it seems useful to just run as a command line tool, like a compiler,
> which is invoked by actual text editors to do the heavy lifting. The
> biggest problem with this model is the large startup cost, which would seem
> to suggest a more server-oriented approach in which the compiler runs as a
> service and updates its memory based on commands from the editor.
I've been playing with Rust for a few months now, and I've *already* been
thinking it'd be an excellent fit for a new metamath program. Also provides
nicer multithreading than C (and *much* nicer than any current Javascript), and
there are quite a few places we can leverage that for improved responsiveness
in the editor... Was going to try writing a one-shot verifier in Rust in a
week or so.
Given the low startup overhead of a pre-compiled program and the ability to
keep a cache file (see below), I think startup under 200ms is feasible. Will
need to do experiments.
Proposal for SMM 0.3:
* The Database is divided up into segments, which are coterminous with .mm
files. If proofs are split from statements, then each segment has a .mms
and a .mmc file.
It is possible to consider shorter segments, but some way to stably
identify them would become necessary.
* Each database has a "project file" which lists the used Database segments.
Details TBD.
* A segment must end with a newline, must consist of complete Metamath
statements only, and may not contain unbalanced ${ $}. Thus, the lexer
state and the nesting state are both automatically known at the top of a
segment.
* $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
requires state (the include list). The exception is for compatibility.
* Top level $e statements are forbidden if there are more than one segment.
(They create difficulty for frame construction).
* Given a single segment in isolation, it can be parsed into Metamath
statements and blocks. For every assertion, we can identify the `$e`
hypotheses (but not the `$f`s, because we still allow those at top level and
they could be in scope from a former segment). Record all used global names
in an index.
* Once we've scanned all segments, we can detect errors from name collisons.
* Check all segments again. Now that we have the full `$c`/`$v`/`$f` index,
we can construct frames for all assertions.
* At this point we have frames for all `$a` statements, so we can construct
the grammar.
* Check all segments again. We have a grammar and can do a grammatical check;
we also have frames for everything and can do a proof validity check. This
phase works locally within assertions so it can be incremental at that
level.
I am fairly confident that all of this can be done in a strongly incremental
way: the only changes that will requiring scanning the full active database are
the ones that change the definition or parsing rules of a math token that is
actually used everwhere.
I think I finally see what you're getting at here. You want to remove proofs
from the text that needs to be edited while maintaining the database, because
it's useless visual noise. I want to keep proofs interleaved in the .mm
file(s) that we store and distribute because the proof file and the statement
file would otherwise have to be kept closely in sync, which is doable but a bit
annoying. Undecided.
If you want to do this, I suggest a line-based "text" format, with one line per
proof, separated by equals signs, sorted, with the current format on the right,
e.g.:
id = ( wi ax-1 mpd ) AAABZAAACAECD
idd = ( wi id a1i ) BBCABDE
This should keep diff happy, even if the line length is a bit long.
Secondary proposal: The stripped Metamath files should be called .mms, to solve
our problem with Myanmar.
On Wed, Apr 20, 2016 at 12:02:40AM -0400, Norman Megill wrote:
> Couldn't a syntax-aware editor simply have a mode that hides proofs?
I believe that is precisely what Mario was aiming at with "rather than fiddling
with collapsing markers to hide the proofs".
I'd kind of like to split up set.mm to make tools happier, but one of
the few ways set.mm is *not* like a program is that it either works
(verifies) or it doesn't. With a program it's useful to have very
frequent commits because it makes it easier to understand "why" when a
regression is discovered, but set.mm has no regressions and so commits
every couple days are good enough for history-keeping.
Mario, as far as I can tell you want to split out the proofs to make it
easier to manage maintainance mode operations in Eclipse. I think that
can be done using very simple tools to convert between split files and
not-split files, while avoiding ecosystem fragmentation by only sharing
the not-split files. I think smm 0.2 could handle the splitting and
unsplitting with a <100 line program; I could do that if you'd like.
On Thu, Apr 21, 2016 at 08:07:35AM -0400, Mario Carneiro wrote:
> On Tue, Apr 19, 2016 at 11:28 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> > I've been playing with Rust for a few months now, and I've *already* been
> > thinking it'd be an excellent fit for a new metamath program. Also
...
>
> That's great. I'll open source my project soon, and then we can collaborate
> on this. (I'm calling it rmm.)
The main disadvantage of something like Rust is the difficulty of
handling user extensions. Rust's learning curve is not the shallowest,
and the 100 MB compiler download and the need for frequent slow
recompiles will create a high barrier of entry for things like macros.
Most likely I'd wind up embedding Lua; now we have a different problem,
namely two parallel ecosystems of support code.
> > Given the low startup overhead of a pre-compiled program and the ability to
> > keep a cache file (see below), I think startup under 200ms is feasible.
> > Will need to do experiments.
>
> What does this "startup" entail? Is it reading the whole .mm file (or
> equivalent)? I suppose the minimum necessary for "ready to go" state is to
> have parses and frames for all the statements. Is this what you mean?
Yes. "md5sum set.mm" completes in about 50ms, so I think it'll take
about that long to segment the set.mm file, generate checksums of the
segments, and verify in a database that they haven't changed. There's a
lot of fiddliness in that design; see "forest for the trees" below.
> > Proposal for SMM 0.3:
...
> > * $[ $] is forbidden if there are multiple segments. Rationale: handle $[ $]
> > requires state (the include list). The exception is for compatibility.
>
> What is this "multiple segments" exception? I notice it a few more times
> below. If there is only one "segment", are you trying to make it the same
> as parsing a single .mm file?
Yes. I'm reluctant to write a verifier that _cannot_ handle legal .mm
files just because they do things like declare `$e` statements in the
top-level scope.
> The key, I think, is figuring out how to do incremental changes in this
> setup. Let's number your phases, (1) gathering names (2) building frames
> (3) grammar check (4) proof check. We can make a cache file for each phase
> and each segment.
>
> * The phase 1 cache is never invalidated by changes to other segments.
> * The phase 2 cache is invalidated only by changes or removal of global $v
> and $f statements in previous segments.
No. The key to this entire programme is to maintain reverse mappings;
"all previous segments" will *never appear* in the final algorithm.
Remember, this is planning for a world where set.mm is 200+ MB, a world
where memories and disks have gotten larger but CPUs haven't gotten much
faster.
The phase 2 cache is invalidated when any math symbol which is used in
the segment is the subject of a changed global $v or $f in any other
segment.
> * The phase 3 cache is invalidated by changes or removal of syntax $a
> statements in previous segments.
Grammar check for a statement is invalidated if it contains all of the
`$c` symbols from an `$a` that was added or removed. (Yes, this means
that if you mess with `cv` you'll have a small problem.)
> I'm sure Norm knows some clever trick for doing this with
> minimize/allow_growth. A simple primitive operation which would accomplish
> the above is "inline theorem" and "extract theorem". Extract theorem T
> would look for subtrees in a theorem proof that match the proof of T (or
> perhaps just match the signature of T), and uses T instead. "Minimize with
> T" from MM-PA does essentially this. Conversely, "inline theorem T" would
> go to each use of T and insert the proof of T instead of referencing T;
> afterward T will not be referenced and can be deleted. There are some
> subtleties in choosing dummy variables in the proof, if there are any.
If smm 0.2 had a proof parser it could have handled that. We don't need
a new architecture for that.
> I beg to differ regarding sharing not-split, though. The major benefits of
> splitting are in git maintenance and day to day work, so keeping it local
> largely defeats the purpose. Plus, I want to be able to choose the split
> file names, which isn't possible if the "default" is not-split. Instead, we
> should unsplit the git repo for producing deliverables such as the set.mm
> download on the website.
I think we're having a vocabulary problem because of the two kinds of
splitting.
If we go down the route of section splitting, then we should keep the
section files in git, for many reasons including "being able to choose
the split file names".
For binary splitting into set.mms and set.mmc, there's no particular
advantage to having two files in git rather than one, so I'd recommend
one.
On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> Proposal for SMM 0.3:
I've made more progress on the API design for SMM0.3. Segments aren't
just a performance hack; some form of them is semantically necessary if
you have multiple .mm files because we need to keep track of what gets
saved where. SMM0.2's flat array of statements won't cut it here.
File inclusion (with the restrictions added in
330680035.12827.14329...@alum.mit.edu) and top level
$e are no longer a problem in the design; in fact, my preferred outcome
for set.mm is a top level file which includes a few dozen small files.
On Fri, Apr 22, 2016 at 01:18:58AM -0400, Mario Carneiro wrote:
> On Thu, Apr 21, 2016 at 12:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
...
> One way to handle this is if the Rust verifier, which acts as a command
> line tool, accepts commands passed on stdin, and a separate program,
> written in another language (like node.js) produces those commands and is
> piped through. That way the verifier acts as the "kernel" and the macro
> language can support higher-level stuff. (The Rust verifier will also
> support higher level operations like atomic maintenance mode operations,
> but the set.mm-specific stiff can be farmed out.) The reason for using
> node.js instead of Lua is mainly to keep the macro language common with smm
> and mmj2.
That sounds really, really painful. What's the advantage at this point
over keeping the entire stack in Javascript?
> > If smm 0.2 had a proof parser it could have handled that. We don't need
> > a new architecture for that.
>
> Well, I think smm *is* the new architecture, provided it makes it to the
> big time. Although currently I don't really know how to use it for anything
> except one-shot verification (if that).
I don't follow "is the new architecture".
smm 0.2 doesn't have the proper plumbing to track which proofs came from
which files, so it's going to be much less useful in a many-small-proofs
world. I think I can fix that in smm 0.3, but it will break all API
consumers.
smm (any version) also does not currently have a proof object model.
There's been some discussion of that in the past; we had a rough
consensus that something like a MMP file was desired, but many details
were TBD and I never got around to implementing it.
> On Fri, Apr 22, 2016 at 12:50 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>
> > On Tue, Apr 19, 2016 at 08:28:52PM -0700, Stefan O'Rear wrote:
> > > Proposal for SMM 0.3:
> >
> > I've made more progress on the API design for SMM0.3. Segments aren't
> > just a performance hack; some form of them is semantically necessary if
> > you have multiple .mm files because we need to keep track of what gets
> > saved where. SMM0.2's flat array of statements won't cut it here.
> >
> > File inclusion (with the restrictions added in
> > 330680035.12827.14329...@alum.mit.edu) and top level
> > $e are no longer a problem in the design; in fact, my preferred outcome
> > for set.mm is a top level file which includes a few dozen small files.
> >
>
> Your link looks corrupted. Are there changes to the design from the earlier
> email?
That's not a Web link, it's the Message-ID of an email Norm sent to you
and me on 2015-05-29.
On 5/29/15 3:42 PM, Stefan O'Rear wrote:
> On Fri, May 29, 2015 at 11:34:47AM -0400, Norman Megill wrote:
>> To summarize, your edge cases are all flagged as errors in the current
>> metamath program, and the actual internal behavior ignoring the messages
>> may or may not be what was intended.
>>
>> Certainly I need to clarify the spec. If you think the behavior should
>> be different from what I described, let me know. My goal is to keep
>> both the spec and the implementation as simple as possible.
>>
>> Another issue is how the other half-dozen verifiers handle these cases;
>> I don't know.
>
> Soon to be a half-dozen plus one. My goals are for the behavior to be simple
> and fully specified. The code I have so far will flag an error if a comment or
> $[ trails off the end of a file, so it seems like that much is agreed on. I
> kind of like Mario's suggestion to only allow $[ $] where $c is currently
> allowed, although I don't beleive any verifier uses that rule currently.
I agree. I'll make it official that $[ $] is allowed only in the outermost
scope i.e. not between ${ $}.
I'll also add that comments have highest parse precedence and that an
incomplete comment may not occur in an included file.
I'll also add that $[ $] may not occur inside of a statement (e.g. may
not occur between the label of a $a statement and its $.) and that an
incomplete $[ $] may not occur in an included file.
I'll also add that an incomplete statement (e.g. $a but no $.) may not
occur in an included file. This might be implied by the above
depending on how you read it (since the end of the included file
would be "inside" of a statement), but better to make it explicit.
Have I missed anything?
Norm
Reciprocally, what is the advantage of not just keeping everything in Rust
Hi Norm,
> Here is a collection of proposed changes and clarifications to the
> Metamath 1.0 spec collected from various emails.I've not followed carefully the debate and I hope I'm not wide ofthe mark but here is an evolution I would like.
#1 On a html page only the premises used by a theorem are showed.#2 In a proof only the premises really used are referenced.
#3 $d statements only concern the next $p statement.
On Sat, Apr 23, 2016 at 6:35 AM, 'fl' via Metamath <meta...@googlegroups.com> wrote:#2 In a proof only the premises really used are referenced.I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.
#2 In a proof only the premises really used are referenced.I am strongly opposed to this. This means that you have to read and verify the proof in order to figure out its frame, which totally kills parallel processing of proofs and separation of maintenance and proof mode.By the way, I made a very similar suggestion to this in Lean (look at the proof to figure out which optional hypotheses are needed),
> #2 In a proof only the premises really used are referenced.Obviously the intent is to avoid repeating again and again the same premises.
But you don't need to repeat $e's, that's the whole point of the scoping system for building frames.
> This logic won't work for a lot of lemmas, which define abbreviations used in the proof but not otherwise appearing in the > premises or conclusion.OK but you add a bit of recursion obviously.
I've lost you. Recursion on what? Assuming you can't look at the proof, how would you deduce that those other abbreviations in metnrmlem3 are needed?
If we're talking Metamath 2.0 here, perhaps what you actually want is a *true* abbreviation mechanism,
--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To post to this group, send email to meta...@googlegroups.com.
Visit this group at https://groups.google.com/group/metamath.
For more options, visit https://groups.google.com/d/optout.
My $b counter-proposal has this property.
On Sat, Apr 16, 2016 at 1:09 AM, Norman Megill <n...@alum.mit.edu> wrote:
On 4/15/16 12:25 AM, Marnix Klooster wrote:
Hi all,
I'm not invested in any design direction here, but here is just one
remark from my side.
It seems to me that the goal here is to have some modularity and
information hiding, and for that one normally uses mechanisms like
scoping and modules, instead of human readable comments (in English even).
Therefore IMHO the current design should either be really lightweight,
to be replaced by a real 'module system' later; or the current design
should already be a step in the direction of said module system.
(Goals of such a module system would be to break up set.mm
<http://set.mm> in more decently sized pieces; and to allow people to
use, e.g., the logic part of set.mm <http://set.mm> without any set
theory or anything else.)
[...]
Getting back to Marnix's comment, our general philosophy (outside of the special cases above) has been to make available everything that came previously in order to achieve efficient proofs and provide the user with the most flexibility. If someone wishes to add to the logic theorems in set.mm, there is no reason not to make it part of set.mm itself, say in a mathbox, so that everyone will benefit.
[...]
As for modularity, in set.mm you will find commented out file includes for things like logic.mm. At one point I had several pieces, but they were a pain to keep in sync (without special software that didn't exist and still doesn't). I think anyone who has done major work on set.mm will agree that it is very convenient to have everything in one file for global label/symbol changes, moving sections around, etc. with a text editor. Eventually set.mm will have to be broken up, but we aren't at that point yet.
Actually, I've been brainstorming how to fix this, what the perfect metamath editor should look like. It's quite clear to me, at least, that we are not there, and having one 16MB file is a symptom of this. Most file editors are just not designed for this, and it is not good to need to read it all and store it in memory in order to get to a single theorem. Edits to an .mm file almost always are local, touching only a tiny tiny part of that giant file, and splitting it up is one way to focus development a bit more.
Could you repeat the reasons why you think having set.mm in one file is a good thing?
1. Moving sections around is trivial if sections are in separate files, it just means moving a file and possibly changing the dependencies, while moving a theorem or range of theorems is just the same with multiple files, just cut and paste.
2. Global renames are completely identical with multiple files; almost all file editors have a "find/replace in files" feature, which would do the trick. Ideally this should be done by the aforementioned "perfect metamath editor", though, so it can be more syntax-aware and only do the replacement where it is syntactically valid.
If you could start over from scratch, that is, some kind of "Metamath 2.0", how would you design the interface? What is the core structure of MM file data? Some thoughts:
* The basic design is a sequence of theorems with possibly shared environments. Within each theorem, though, the data structure is something complex, which we have decided not to store in plain text in the present version, necessitating interaction with a proof editor in any case. (That is, a user can not practically write a complete mm file by hand, and must resort to a program to write the proof parts.)
* There is a logical two-level interaction pattern with data in the file:
1. Maintenance mode: Adding and removing theorems, changing theorem statements, changing hypotheses and disjoint variable conditions.
2. Proof mode: Writing proof bodies.
* The raw MM file presentation is suboptimal in that on the one hand users must interact with it directly for maintenance mode, yet are presented with human-unreadable data in the proofs, while on the other hand existing tools allow proof mode work but do not handle any maintenance mode activities.
* An easy presentation format for maintenance mode is the raw MM file, but with proofs collapsed or omitted. The operations on such a representation may also be limited beyond arbitrary text edits, to a more well-formed approach such as adding and removing groups, and moving statements around as atomic elements. A graphical interface for this might work, but would need a lot of usability testing to get right.
* For proof mode, the best current contender is the MMP file format, but mmj2 is the only program that can read it, and there is no connection between this and maintenance mode in mmj2 (in fact mmj2 does not edit the MM file at all). I don't think this is the best direction in order to support proof scripts in the future, and I would support something more like a sequence of commands. (Tactic modes exist in Lean, Coq, HOL Light, Isabelle and I can explain them in more detail if anyone doesn't know how they work.) An MMP style output could be written as a sequence of ` apply xxx "|- math" ` commands like those used in MM-PA (the math strings are optional for user input).
* Linking in to an existing IDE like Eclipse is an easy way to get a lot of nice features like syntax highlighting out of the box. But any attempt to do this in maintenance mode is severely restricted by the sheer size of the file, even if proofs are omitted. Eclipse assumes that a software project has a modular structure, and I daresay that most other IDEs do as well. Even simple section-by-section splits would mostly resolve this issue.
* A primary problem that is not currently addressed by mmj2 is the ability to change the state of the database while the program is running (that is, maintenance mode changes affecting proof mode). MM-PA is able to handle this well, although it does not allow many maintenance mode changes in the first place. A good editor would have both modes and could use each to affect the other.
I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?
Mario
http://www.arnoux.ch/thierry/eclipse
http://www.arnoux.ch/thierry/metamath
I'd like to collect opinions on what a good editor should look like, and establish known shortcomings of existing methods, to get a roadmap for future editors. What does your "perfect Metamath editor" look like?
Hi Mel,
That's great!
If you have not done so yet, I suggest you have a look at: http://emetamath.tirix.org/, and: https://github.com/tirix/EmetamathRunning mmj2 inside Eclipse is exactly what "emetamath" does: it's a Eclipse plugin for editing Metamath (.MM) and Metamath Proof (.MMP) files, which embeds the core of mmj2.
Most recently, I've added e.g. highlighting matching parenthesis
...that's still far away from VR powered proof building!
BTW, let me take this opportunity to thank you for the excellent
MMJ2!
_
Thierry
The proofs are entered exactly in the same way as in MMJ2: the same proof file format is used (.MMP). In the end, it’s an ASCII text editor with special features.
Hi Marnix,
I've updated the links on the page, and added the build for the
latest changes, from April 2018.
The plugin also works at least with Eclipse Oxygen, so I've
updated this mention (I've not updated my own version to Photon,
so I'm not sure if it works with Photon).
I think until now only Mario tried, so if anything does not go smoothly don't hesitate to ask!
_
Thierry
Hi Frédéric,
Unfortunately I don't have a Linux distribution available right
now. I suggest you install Eclipse and then just download and copy
the plugin in the eclipse "plugins" directory.
If this tool becomes popular I might create an upload site (you
can then install and update the plugin directly from Eclipse), or
even register it in the "Eclipse Marketplace".
_
Thierry
Hi Marnix,
I'm glad to count you among the users of EMetamath! To be honest,
I think it's just you and me :-)
Congratulations on getting the plugin working and please keep
comments coming!
The GitHub issues list is fine to track any issues, but informal emails will do too.
I'm surprised the editor does line wrapping at 80 characters,
this is not how it works for me. Did you try going to the
"Preference" menu, the "Metamath/TMFF" choice, and setting e.g.
format number 7? (this is the one I use, single line per step, no
limit)
For proofs in set.mm, please also make sure you have
auto-transformations activated (in "Project/Properties", choosing
"Metamath Properties"). This is really a big enhancement for
writing proofs.
The credit for the proof assistant really goes to Mel'O Cat and Mario, I've just done the integration into Eclipse!
BR,
_
Thierry