How should I approach comment parsing?

73 views
Skip to first unread message

Philip White

unread,
May 12, 2022, 6:38:17 PM5/12/22
to meta...@googlegroups.com
I'm working on implementing comment parsing in my verifier. It doesn't
seem to lend itself well to elegant implementations, however, since the
token structure of comments seems to be different than that of
non-comment text. In addition, the token rules of $t comments and
ordinary comments seems to be different (for example, normal comments
don't have the C-style comments, but $t comments do).

I can certainly figure this out by handling all the special cases and
bolting together a few different parsers, but I'm wondering if anyone
has found an nice technique for this.

- Philip

Philip White

unread,
May 12, 2022, 6:39:25 PM5/12/22
to meta...@googlegroups.com

Benoit

unread,
May 12, 2022, 7:21:11 PM5/12/22
to Metamath
The EBNF in the Metamath book differs slightly from metamath.c in that "$( $( $)" returns an error in metamath.c but is accepted by the EBNF grammar and by mmverify.py. I opens an issue there (https://github.com/metamath/metamath-book/issues/233).

I clarified a bit readc() in mmverify.py, and I wrote a similar function for my OCaml verifier, which can forbid the above example, like metamath.c.  I copy both below:
-----------------------------------
    def readc(self) -> StringOption:
        """Read the next token once included files have been expanded and
        comments have been skipped.
        """
        tok = self.readf()
        while tok == '$(':
            # Note that we use 'read' in this while-loop, and not 'readf',
            # since inclusion statements within a comment are still comments
            # so should be skipped.
            # The following line is not necessary but makes things clearer;
            # note the similarity with the first three lines of 'readf'.
            tok = self.read()
            while tok and tok != '$)':
                tok = self.read()
            if not tok:
                raise MMError("Unclosed comment at end of file.")
            assert(tok == '$)')
            tok = self.read()
        vprint(70, "Token once comments skipped:", tok)
        return tok
-----------------------------------
(** Read a comment (and ignore its content, so does not need to be passed a
    [parse_env] as argument).  Comments should not nest, so opening-comment
    tokens are ignored. *)
let read_comment ic =
  Printf.printf "Reading a comment.\n";
  let rec read_comment_content ic =
    match read_token ic with
    | Some CloseComm -> Printf.printf "End of comment\n"
    | Some OpenComm ->
        Printf.printf "Comments should not nest.  Ignoring.\n";
        read_comment_content ic
    | Some _ -> read_comment_content ic
    | None -> Printf.printf "File ended while reading comment.  Ignoring.\n"
  in
  read_comment_content ic
-----------------------------------

Benoît

Mario Carneiro

unread,
May 13, 2022, 7:16:08 PM5/13/22
to metamath
Parsing comments as a metamath verifier is not very hard, and does not require any special tokenization. This is what Benoit is referring to. However, if you want to parse $t or $j comments, you will need a specialized parser for those, and it is true that the tokenization rules are different, kind of C style. The most accurate $t/$j parser I am aware of is https://github.com/david-a-wheeler/metamath-knife/blob/360e7b1/src/parser.rs#L809-L933 , but I think you can do the same thing with a more traditional parser generator style. The summary is: fancy comments are a subset of the regular comment grammar. (Regular comments, parsed with the standard metamath lexer (i.e. whitespace separated everything) have the form '$(' token* '$)' where no token contains an embedded occurrence of '$)' or '$('.) The first token in a fancy comment must be '$t' or '$j'. The remainder of the comment is lexed as follows:

fancy-comment ::= '$(' ws ('$t' | '$j') (ws token)* ws '$)'
ws ::= (' ' | '\t' | \n' | '\r' | '/*' .* '*/')*
string ::= '[^']*' | "[^"]*"
keyword ::= [^;]+
element ::= string | keyword
token := element | ';'

In other words, inside fancy comments there are C style comments /* like this */, counted as part of whitespace that can appear between anything, and then what remains is separated into strings (which can be delimited by ' or " but have no escapes), "keywords" (any other unquoted thing) and semicolon (which delimits commands). The output of the lexing phase is the initial '$t' or '$j' and the sequence of tokens. It can be further separated into commands using the grammar:

command ::= keyword (string | keyword)* ';'
fancy-comment ::= ('$t' | '$j') command*

At this point you can start interpreting the commands. Each command has an initial keyword which specifies the grammar of the rest. The '$t' and '$j' commands differ in that '$t' is a closed class - there is an explicit list of enumerated valid commands - while '$j' is designed as extensible and parsers should ignore any unrecognized commands. The '$t' commands and their grammars are:

sum ::= string ('+' string)*
typesetting-command ::=
  ('htmldef' | 'althtmldef' | 'latexdef') string 'as' sum ';' |
  'htmlvarcolor' sum ';' |
  'htmltitle' sum ';' |
  'htmlhome' sum ';' |
  'exthtmltitle' sum ';' |
  'exthtmlhome' sum ';' |
  'exthtmllabel' sum ';' |
  'htmldir' sum ';' |
  'althtmldir' sum ';' |
  'htmlbibliography' sum ';' |
  'exthtmlbibliography' sum ';' |
  'htmlcss' sum ';' |
  'htmlfont' sum ';' |
  'htmlexturl' sum ';'


The list of '$j' commands is open ended as mentioned, but some commands you will find in set.mm are:

extra-command ::=
  'syntax' string ('as' string)? ';' |
  'unambiguous' string ';' |
  'primitive' string* ';' |
  'justification' string 'for' string ';' |
  'equality' string 'from' string string string ';' |
  'definition' string 'for' string ';' |
  'congruence' string* ';' |
  'bound' string ';' |
  'natded_init' string string string ';' |
  ('natded_assume' | 'natded_weak' | 'natded_cut') string* ';' |
  ('natded_true' | 'natded_imp' | 'natded_and' | 'natded_or' | 'natded_not') string+ 'with' string* ';' |
  ('free_var' | 'free_var_in') string 'with' string* ';' |
  ('condequality' | 'notfree') string 'from' string ';' |
  'condcongruence' string* ';' |
  'notfree' string 'from' string ';' |
  'restatement' string 'of' string ';' |
  'garden_path' element+ '=>' element+ ';' |
  ...


I don't know any verifier or proof assistant that recognizes all of these; they were added at different times for different tools and set.mm has historically been pretty okay with adding various kinds of metadata to support specialized analyses.



--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com.

Philip White

unread,
May 13, 2022, 8:15:46 PM5/13/22
to meta...@googlegroups.com
Thanks for the depth of the responses!

A way that I'm used to writing parsers is to first write a lexer that
operates on the part of the syntax that falls into the "regular" class
of grammars. The main metamath syntax fits this neatly, including
comments. However, once you try to understand the contents of those
comments, the only part of the syntax that remains regular is single
characters as tokens, I think. To illustrate this, note that "latexdef"
is an ordinary symbol token in the main metamath syntax, but it is a
keyword in $t comment syntax.

I'm glad you wrote down the grammar for comments; I was looking for
that at one point. The grammar does use characters as tokens, so it
doesn't solve the issue (I know you weren't saying that it did). I was
able to figure out the $t definition types through reading the book and
trial and error, but I haven't even tried to think about $j comments
yet (not sure if I will).

My first pass at this has been to parse the main metamath syntax, and
then run a totally separate parser on the string contents of each
comment. However, this won't hold up now that I'm starting to parse
the markup of ordinary comments, since math mode requires switching
back to the ordinary metamath tokens.

A solution that seems to me more elegant (though I haven't tried it
yet), is to have a single parser, but multiple lexing functions that
operate. The parser calls the appropriate lexer. Each lexer reads the
regular part of its respective "sub-syntax", and the parser decides
when to switch between sub-languages.

At several points in this process, I've found myself wishing that the
minimalism of the main metamath syntax was preserved in the comment
syntax. For example, the $t comment section could have used $latexdef
and $htmldef as keywords, and instead of C-style comments, allow actual
nested comments. The property of the syntax being a sequence of
space-delimited symbols plus keywords that start with $ is refreshingly
simple, and it would have been nice to work with it in comments.
Anyway, I don't mean to criticize from an armchair the folks who
implemented this system; it would be rather difficult to change at this
point, I would think.

Another question about comments: what is a good heuristic for deciding
when a comment is attached to a statement? My first guess has been to
attach the immediately preceding comment to each theorem or axiom, if
there is one. This seems to yield good results, but I noticed that
set.mm sometimes has comments on the same line, directly after the
the statement (this is especially common for constant and variable
declarations). I doubt that attaching an comments immediately after a
statement would be good, unless I make the parser be
whitespace-sensitive so that I can distinguish when it is on the same
line as the declaration.

- Philip
> > <https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com?utm_medium=email&utm_source=footer>
> > .
> >
>

Mario Carneiro

unread,
May 14, 2022, 12:55:52 AM5/14/22
to metamath
On Fri, May 13, 2022 at 8:15 PM 'Philip White' via Metamath <meta...@googlegroups.com> wrote:
Thanks for the depth of the responses!

A way that I'm used to writing parsers is to first write a lexer that
operates on the part of the syntax that falls into the "regular" class
of grammars. The main metamath syntax fits this neatly, including
comments. However, once you try to understand the contents of those
comments, the only part of the syntax that remains regular is single
characters as tokens, I think. To illustrate this, note that "latexdef"
is an ordinary symbol token in the main metamath syntax, but it is a
keyword in $t comment syntax.

You should parse the comment twice with two entirely separate parsers. Using a single multi-mode parser is going to be very messy, especially because one parser has absolute veto power over the other.
 
I'm glad you wrote down the grammar for comments; I was looking for
that at one point. The grammar does use characters as tokens, so it
doesn't solve the issue (I know you weren't saying that it did).

I'm not sure what you mean by "solve the issue". That's what the grammar is; it plainly allows no spaces around at least semicolons so the grammar has to account for that.

My first pass at this has been to parse the main metamath syntax, and
then run a totally separate parser on the string contents of each
comment.
 
Good. Do that. It will be good for your sanity.

However, this won't hold up now that I'm starting to parse
the markup of ordinary comments, since math mode requires switching
back to the ordinary metamath tokens.

Oh... the math markup syntax is a whole separate mess. The issue there is that the parser is a bunch of layered regexes, so it has no nice grammar at all, at least if you care about the edge cases. The recent discussion https://groups.google.com/g/metamath/c/kdbrroGzJnc/m/7PCMZGgSAQAJ chronicles my attempt to write a fully accurate markup parser.

Ignoring the edge cases, I think the most reasonable way to understand markup parsing is just as having a string-quoting operator. The body text is parsed on a character level, but `foo` starts a quoted math section, and you use the metamath parser (split on space) to get the tokens out (after replacing `` with `). I don't think using a CFG grammar here would help.
 
A solution that seems to me more elegant (though I haven't tried it
yet), is to have a single parser, but multiple lexing functions that
operate. The parser calls the appropriate lexer. Each lexer reads the
regular part of its respective "sub-syntax", and the parser decides
when to switch between sub-languages.

The different parsers really have nothing in common. It's just completely separate languages with one hiding out in the legal comment syntax of the other.

At several points in this process, I've found myself wishing that the
minimalism of the main metamath syntax was preserved in the comment
syntax. For example, the $t comment section could have used $latexdef
and $htmldef as keywords, and instead of C-style comments, allow actual
nested comments. The property of the syntax being a sequence of
space-delimited symbols plus keywords that start with $ is refreshingly
simple, and it would have been nice to work with it in comments.
Anyway, I don't mean to criticize from an armchair the folks who
implemented this system; it would be rather difficult to change at this
point, I would think.

I wasn't involved in the original decision making here but if I had to guess, the reason is that we don't want the $ to confuse a metamath parser. Keep in mind that for the vast majority of metamath parsers, comments are just comments, no parsing is required. In particular nested comments would impose a burden on simplistic verifiers, so using a different style of comment seems like a good idea to me. (I wouldn't mind spacing things out a bit more though, to keep the same simple lexer.)
 
Another question about comments: what is a good heuristic for deciding
when a comment is attached to a statement? My first guess has been to
attach the immediately preceding comment to each theorem or axiom, if
there is one. This seems to yield good results, but I noticed that
set.mm sometimes has comments on the same line, directly after the
the statement (this is especially common for constant and variable
declarations). I doubt that attaching an comments immediately after a
statement would be good, unless I make the parser be
whitespace-sensitive so that I can distinguish when it is on the same
line as the declaration.

The official definition is that a markup comment immediately precedes a labeled statement ($f $a and $p, I don't think you can comment $e). $c and $v statements don't have tooling-recognized comments, so you should read those comments as analogous to single line non-doc comments. If you would like to parse these comments, one reasonable definition is that the comment should come immediately after the $c comment without an intervening newline. (Again, "unless I make the parser be whitespace-sensitive so that I can distinguish when it is on the same line as the declaration." sounds like you are trying to make one parser to rule them all and this will not end well. Markup parsing is substantially more heuristic than everything else, likely because it arose organically from unofficial formatting conventions.)

Mario

Philip White

unread,
May 18, 2022, 9:31:44 PM5/18/22
to meta...@googlegroups.com
Maybe I'm running into issues that were mentioned in the email thread
you linked to( https://groups.google.com/g/metamath/c/kdbrroGzJnc/m/7PCMZGgSAQAJ ),
but what's the right thing to do about this comment text that shows up
in set.mm:

@( The cardinality of a set is a cardinal number. @)
cardel @p |- ( card ` A ) e. Card @=
( ccrd cfv ccdn wcel wceq cardidm elcard mpbir ) ABCZDEJBCJFAGJHI @.
@( [16-Jan-2005] @) @( [23-Sep-2004] @)

It's a theorem with backticks in it that got commented out, but the
backticks weren't doubled. My parser thinks that this backtick should
start math mode, and that seems right? I wouldn't mind sending a PR to
fix the places that break this rule, if that is desirable.

Mario Carneiro

unread,
May 18, 2022, 9:35:32 PM5/18/22
to metamath
Not all comments are markup comments. It is up to you how to handle this: you could parse all comments as markup and give up if it doesn't go well, or only try to parse comments in particular positions as markup. The latter is what mm-knife does: comments immediately before a labeled statement and header comments are parsed as markup and others are ignored.

--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
Reply all
Reply to author
Forward
0 new messages