I am writing a lexer and parser for Metamath databases, and I do not know if it's better to check "string compliance" (e.g., questions like: is the string "az%e" allowed to be a label?) at lexing or at parsing:
At parsing: the lexer (skips whitespaces and comments and) returns tokens which are the Metamath keywords as well as Ident(string) for anything else. Then, the parser checks if Ident(string) satisfies the rules, e.g., when being passed by the lexer the sequence '$[', Ident(file), '$]', it will call a function is_valid_filename(file) which will check if 'file' contains forbidden characters for a filename, like '%'. It is not very satisfactory since there is some "rereading" of the characters of the string 'file' (once at lexing and once at parsing).
At lexing: one can tell if a string of characters is:
* a Metamath keyword;
* something which can be (and only be):
* a math symbol;
* a math symbol or label or filename;
* a math symbol or label or filename or complete compressed proof bloc;
* a math symbol or compressed proof bloc (complete or incomplete);
So the tokens given by the lexer are of one of these five kinds. This ensures "unique reading" but is probably more cumbersome (even if not directly coded from an automaton, the fact that the automaton has more states will reflect in the code).
Do you have any advice ? How is it done in metamath.c and in rust ? (I'm not very proficient in these languages.) In mmverify.py, very few checks are done, so the problem does not arise.
Hi Benoît,
For metamath-knife, I wrote that stage myself. I've called it
"statement parsing" and everything is self-contained in
`grammar.rs`. I basically started with a naive Moore machine, and
then modified it to handle missing cases and cover the family of
syntax of set.mm. I can give more explanation about how it works
if you need.
For mmj2 I think Mel'O'Cat wrote it. It is an Earley parser, and
you can find it in `verify/EarleyParser.java`.
Both do two passes: a first pass where the grammar is constructed
based on the syntax axioms, and a second where each statement is
parsed to build parse trees (or AST, abstract syntax trees). So,
that's roughly the "parser generator" model you mention, even
though no code is generated like is Yacc. It might be possible to
do everything in a single pass, since metamath will only refer to
previous definitions, however for metamath-knife, the "parse
statement" pass can be run in parallel threads which all need an
access to the grammar. While several threads can share a read-only
access, that would not work for a read-write (mutable) access.
The grammar building phase in metamath-knife is blazingly fast, it
runs in 4-6ms on my 7-year old laptop. The reason is that it only
cares about syntactic axioms, and there's only a few hundreds of
them. Statement parsing then takes around 600ms on my laptop.
Memory for every parse tree needs to be allocated, and then
possibly grown, so I'm sure there is space for improvement here.
Glauco said he was using a 3rd party Earley parser, `nearly.js`.
If it can cover set.mm that's probably also your best option, and
in hindsight, I shall probably also have done the same in Rust!
BR,
_
Thierry
--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/b8db089f-e235-41e8-bfbd-20539de5a9d5n%40googlegroups.com.
Thanks Mario, very clear and useful ! I'll do that at parsing, as you advise. Do you happen to remember in which file it is done in mmj2 ? I skimmed through some files at https://github.com/digama0/mmj2/tree/master/src/mmj but got lost in the huge codebase.
I have a second question concerning the "second-stage parsing" that mmj2 (and metamath-knife ?) do, but not metamath.c. Namely, this is the parsing of the math expressions themselves. Do mmj2 and metamath-knife code their own "parser generators" like Yacc ? As I see it, the program would look for syntactic axioms (i.e., label $a w ..... $. where w has *already* appeared in an $f statement), and from each of them it constructs a production rule, for instance, $a wff ( ph /\ ps ) $. yields the rule wff -> '(' wff '/\' wff ')'. This is described in your "Models for Metamath". Then, you hope that your CFG is not too complex and you give these rules (all at a time at the end of parsing, or progressively as you read the file ?), to a Yacc-like program, either external or one that you built (as is the case for mmj2 and metamath-knife ?). This would be pretty impressive. Can you point me to the rust file or function that would do that part ?