Thanks for the depth of the responses!
A way that I'm used to writing parsers is to first write a lexer that
operates on the part of the syntax that falls into the "regular" class
of grammars. The main metamath syntax fits this neatly, including
comments. However, once you try to understand the contents of those
comments, the only part of the syntax that remains regular is single
characters as tokens, I think. To illustrate this, note that "latexdef"
is an ordinary symbol token in the main metamath syntax, but it is a
keyword in $t comment syntax.
I'm glad you wrote down the grammar for comments; I was looking for
that at one point. The grammar does use characters as tokens, so it
doesn't solve the issue (I know you weren't saying that it did). I was
able to figure out the $t definition types through reading the book and
trial and error, but I haven't even tried to think about $j comments
yet (not sure if I will).
My first pass at this has been to parse the main metamath syntax, and
then run a totally separate parser on the string contents of each
comment. However, this won't hold up now that I'm starting to parse
the markup of ordinary comments, since math mode requires switching
back to the ordinary metamath tokens.
A solution that seems to me more elegant (though I haven't tried it
yet), is to have a single parser, but multiple lexing functions that
operate. The parser calls the appropriate lexer. Each lexer reads the
regular part of its respective "sub-syntax", and the parser decides
when to switch between sub-languages.
At several points in this process, I've found myself wishing that the
minimalism of the main metamath syntax was preserved in the comment
syntax. For example, the $t comment section could have used $latexdef
and $htmldef as keywords, and instead of C-style comments, allow actual
nested comments. The property of the syntax being a sequence of
space-delimited symbols plus keywords that start with $ is refreshingly
simple, and it would have been nice to work with it in comments.
Anyway, I don't mean to criticize from an armchair the folks who
implemented this system; it would be rather difficult to change at this
point, I would think.
Another question about comments: what is a good heuristic for deciding
when a comment is attached to a statement? My first guess has been to
attach the immediately preceding comment to each theorem or axiom, if
there is one. This seems to yield good results, but I noticed that
sometimes has comments on the same line, directly after the
the statement (this is especially common for constant and variable
declarations). I doubt that attaching an comments immediately after a
statement would be good, unless I make the parser be
whitespace-sensitive so that I can distinguish when it is on the same
line as the declaration.
> > <https://groups.google.com/d/msgid/metamath/2770d75f-6d32-4ab5-be5f-0634296feeaan%40googlegroups.com?utm_medium=email&utm_source=footer
> > .