Lexing/parsing a Metamath database

106 views
Skip to first unread message

Benoit

unread,
May 29, 2022, 11:24:40 AM5/29/22
to Metamath
I am writing a lexer and parser for Metamath databases, and I do not know if it's better to check "string compliance" (e.g., questions like: is the string "az%e" allowed to be a label?) at lexing or at parsing:

At parsing: the lexer (skips whitespaces and comments and) returns tokens which are the Metamath keywords as well as Ident(string) for anything else.  Then, the parser checks if Ident(string) satisfies the rules, e.g., when being passed by the lexer the sequence '$[', Ident(file), '$]', it will call a function is_valid_filename(file) which will check if 'file' contains forbidden characters for a filename, like '%'.  It is not very satisfactory since there is some "rereading" of the characters of the string 'file' (once at lexing and once at parsing).

At lexing: one can tell if a string of characters is:
* a Metamath keyword;
* something which can be (and only be):
   * a math symbol;
   * a math symbol or label or filename;
   * a math symbol or label or filename or complete compressed proof bloc;
   * a math symbol or compressed proof bloc (complete or incomplete);
So the tokens given by the lexer are of one of these five kinds.  This ensures "unique reading" but is probably more cumbersome (even if not directly coded from an automaton, the fact that the automaton has more states will reflect in the code).

Do you have any advice ?  How is it done in metamath.c and in rust ?  (I'm not very proficient in these languages.)  In mmverify.py, very few checks are done, so the problem does not arise.

Benoît

Mario Carneiro

unread,
May 30, 2022, 1:59:58 AM5/30/22
to metamath
On Sun, May 29, 2022 at 11:24 AM Benoit <benoit...@gmail.com> wrote:
I am writing a lexer and parser for Metamath databases, and I do not know if it's better to check "string compliance" (e.g., questions like: is the string "az%e" allowed to be a label?) at lexing or at parsing:

I think it is preferable to do the work during parsing.
 
At parsing: the lexer (skips whitespaces and comments and) returns tokens which are the Metamath keywords as well as Ident(string) for anything else.  Then, the parser checks if Ident(string) satisfies the rules, e.g., when being passed by the lexer the sequence '$[', Ident(file), '$]', it will call a function is_valid_filename(file) which will check if 'file' contains forbidden characters for a filename, like '%'.  It is not very satisfactory since there is some "rereading" of the characters of the string 'file' (once at lexing and once at parsing).

The rereading is only necessary for labels (in declaration position), which are a relatively small proportion of tokens. Furthermore, it is only a "check" and does not need to create a new string: once you already know what class of token you expect you can check the rules for that token or list of tokens all at once.

At lexing: one can tell if a string of characters is:
* a Metamath keyword;
* something which can be (and only be):
   * a math symbol;
   * a math symbol or label or filename;
   * a math symbol or label or filename or complete compressed proof bloc;
   * a math symbol or compressed proof bloc (complete or incomplete);
So the tokens given by the lexer are of one of these five kinds.  This ensures "unique reading" but is probably more cumbersome (even if not directly coded from an automaton, the fact that the automaton has more states will reflect in the code).

It will most likely slow the lexer down as well to have more classes, and you have to eliminate most of these classes most of the time.
 
Do you have any advice ?  How is it done in metamath.c and in rust ?  (I'm not very proficient in these languages.)  In mmverify.py, very few checks are done, so the problem does not arise.

It looks like it first parses all the statements without validation, meaning that the token preceding a $p gets shoved into an array whatever it is, and then there is a second pass which goes over all these tokens (which are actually pointers into the original text), re-parses up to the space to get the length, and re-checks all the characters in between to ensure they are valid label characters. I'm not sure I would recommend this method, because of the late validation, not to mention the triple scan due to not storing token lengths (which is a classic C-ism).

The read_labels() function is called first at statement position before we even get to the keyword token. So this is accepting the grammar `label* $[acdefpv]`. Conceivably you could make this a one-pass implementation, but it just gets the token and then checks all the letters in the token in case $ is present (so we abort) or it contains label characters (or we error). The number of labels is restricted from any number to just 0 or 1 depending on the keyword later when the keyword is parsed (in get_statement()).

Mario

Benoit

unread,
May 30, 2022, 6:19:05 PM5/30/22
to Metamath
Thanks Mario, very clear and useful !  I'll do that at parsing, as you advise.  Do you happen to remember in which file it is done in mmj2 ?  I skimmed through some files at https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the huge codebase.

I have a second question concerning the "second-stage parsing" that mmj2 (and metamath-knife ?) do, but not metamath.c.  Namely, this is the parsing of the math expressions themselves.  Do mmj2 and metamath-knife code their own "parser generators" like Yacc ?  As I see it, the program would look for syntactic axioms (i.e., label $a w ..... $.  where w has *already* appeared in an $f statement), and from each of them it constructs a production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff -> '(' wff '/\' wff ')'.  This is described in your "Models for Metamath".  Then, you hope that your CFG is not too complex and you give these rules (all at a time at the end of parsing, or progressively as you read the file ?), to a Yacc-like program, either external or one that you built (as is the case for mmj2 and metamath-knife ?).  This would be pretty impressive.  Can you point me to the rust file or function that would do that part ?

Thanks,
Benoît

Thierry Arnoux

unread,
May 30, 2022, 11:51:43 PM5/30/22
to meta...@googlegroups.com, Benoit

Hi Benoît,

For metamath-knife, I wrote that stage myself. I've called it "statement parsing" and everything is self-contained in `grammar.rs`. I basically started with a naive Moore machine, and then modified it to handle missing cases and cover the family of syntax of set.mm. I can give more explanation about how it works if you need.

For mmj2 I think Mel'O'Cat wrote it. It is an Earley parser, and you can find it in `verify/EarleyParser.java`.

Both do two passes: a first pass where the grammar is constructed based on the syntax axioms, and a second where each statement is parsed to build parse trees (or AST, abstract syntax trees). So, that's roughly the "parser generator" model you mention, even though no code is generated like is Yacc. It might be possible to do everything in a single pass, since metamath will only refer to previous definitions, however for metamath-knife, the "parse statement" pass can be run in parallel threads which all need an access to the grammar. While several threads can share a read-only access, that would not work for a read-write (mutable) access.

The grammar building phase in metamath-knife is blazingly fast, it runs in 4-6ms on my 7-year old laptop. The reason is that it only cares about syntactic axioms, and there's only a few hundreds of them. Statement parsing then takes around 600ms on my laptop. Memory for every parse tree needs to be allocated, and then possibly grown, so I'm sure there is space for improvement here.

Glauco said he was using a 3rd party Earley parser, `nearly.js`. If it can cover set.mm that's probably also your best option, and in hindsight, I shall probably also have done the same in Rust!

BR,
_
Thierry

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com.

Mario Carneiro

unread,
May 31, 2022, 6:07:36 AM5/31/22
to metamath
On Mon, May 30, 2022 at 6:19 PM Benoit <benoit...@gmail.com> wrote:
Thanks Mario, very clear and useful !  I'll do that at parsing, as you advise.  Do you happen to remember in which file it is done in mmj2 ?  I skimmed through some files at https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the huge codebase.

At this point in the code, it has decided to read the next token, it gets back a string and checks it is not a keyword, and then it validates the token as a label (which checks both the label character restrictions as well as the banned statement names like 'aux', 'prn' etc, which is kind of interesting since this is only a "verify markup" lint in metamath-exe).

I have a second question concerning the "second-stage parsing" that mmj2 (and metamath-knife ?) do, but not metamath.c.  Namely, this is the parsing of the math expressions themselves.  Do mmj2 and metamath-knife code their own "parser generators" like Yacc ?  As I see it, the program would look for syntactic axioms (i.e., label $a w ..... $.  where w has *already* appeared in an $f statement), and from each of them it constructs a production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff -> '(' wff '/\' wff ')'.  This is described in your "Models for Metamath".  Then, you hope that your CFG is not too complex and you give these rules (all at a time at the end of parsing, or progressively as you read the file ?), to a Yacc-like program, either external or one that you built (as is the case for mmj2 and metamath-knife ?).  This would be pretty impressive.  Can you point me to the rust file or function that would do that part ?

The mechanisms for doing this have evolved over time. Most systems that support math parsing implement what I would call a "dynamic general backtracking CFG parser", since this is definitely the easiest to code and understand and also good enough performance. MMJ2 has the EarleyParser for this, this is the original parser that was used before I started working on it. I think the cleanest implementation of this kind of parser is the one in mm-scala (https://github.com/digama0/mm-scala/blob/master/src/main/scala/org/metamath/scala/Grammar.scala) or the haskell version which is part of mm0-hs (https://github.com/digama0/mm0/blob/master/mm0-hs/src/MM0/FromMM/Parser.hs#L573-L610). Basically, rather than putting all the productions into a BNF and handing it to Yacc to generate a state machine, the compilation to state machine is done in the code itself.

It's not so hard to do: each state has a map from constant token to next state, a map from variable sort to next state, and a possible reduction (proof tree fragment) to apply if this is a terminal node (or a list of reductions, although you should never have more than one reduction on a node if the grammar is unambiguous). Adding a new rule entails starting at the root and traversing all the symbols to get to the next state, creating a new empty state if needed, and then putting the provided reduction in the last node. This results in a tree of nodes (a trie) and you can navigate this trie to parse formulas. The backtracking part happens during "runtime", you have to consider all the ways that you can apply variable transitions to parse subformulas, which can be done by a recursive function.

The other way to implement math parsing is using a variation on a LR(0) linear-time parser. This is more complex (assuming you at least want it to work on set.mm, which is not actually LR(0)), and it only works on a subset of unambiguous grammars, but it has the advantage of being O(n) instead of O(n^3) worst case for the backtracking parser. There are two implementations of this kind of parser: the LRParser in mmj2 and tirix's grammar parser in mm-knife. The disadvantage is that it doesn't really work very well as a dynamic parser: mmj2 has to parse the whole file, construct the parse tables and then use it on everything else. It's a lot closer to the "build a Yacc file" approach you mentioned. (As far as I know, no one has actually implemented something like that, although I have certainly considered it before.) It is broadly similar to the general parser approach with the nodes and transitions, but there is a deterministic rule for selecting the next state given each input token, and the state graph is expanded in certain places in order to ensure the correctness of the rule.

Mario
Reply all
Reply to author
Forward
0 new messages