set.mm logic syntax is not LR(n)

136 views
Skip to first unread message

Jens-Wolfhard Schicke-Uffmann

unread,
Jan 5, 2016, 6:36:39 PM1/5/16
to Metamath
Hi,

I tried to convert Igor's parser to LALR(1), but the new parser fails to
parse all syntactically valid theorems, alerting me to the following
problem.

Please consider:

case1 $p |- ( x e. { 1 } /\ x e. { 1 } ) $= ? $.
case2 $p |- ( x e. { 1 } |-> 0 ) = ( x e. { 1 } |-> 0 ) $= ? $.

After the initial ` ( x ` , the lookahead is ` e. `
and the parser needs to either
* reduce ` x ` via cv and later via wcel (case1)
* shift the ` e. ` to ultimately reduce via cmpt (case2)

The correct action cannot be determined from the lookahead
and given that a class follows next, there is not even a finite
lookahead horizon, hence the math syntax of set.mm is not LR(n).

Given that set.mm makes various comments about not complicating
parser implementations, this might be undesirable.

An obvious solution would be use different parentheses for cmpt and
cmpt2. I currently do not know if other instances of this problem exist,
which would then also require fixing. If you think the current syntax is
asthetically sufficiently superior, I'll move to a GLR parser for Igor.


Regards,
Drahflow
signature.asc

Mario Carneiro

unread,
Jan 5, 2016, 7:14:07 PM1/5/16
to metamath
At first glance, this seems not to be a big problem. You can parse the class expression { 1 } after reading "( x e." without needing to disambiguate the two cases, and once you return from parsing a class expression the next token will be "/\" or "|->" to distinguish the two cases (or "," for cmpt2). I am not a grammar expert, but I have a few quotes from Norm (quoting someone else, probably Raph Levien or Mel O'Cat) saying that set.mm's grammar is LALR(7) (the number 7 comes from the parsing of coprab).

Mario


--
You received this message because you are subscribed to the Google Groups "Metamath" group.
To unsubscribe from this group and stop receiving emails from it, send an email to metamath+u...@googlegroups.com.
To post to this group, send email to meta...@googlegroups.com.
Visit this group at https://groups.google.com/group/metamath.
For more options, visit https://groups.google.com/d/optout.

Mario Carneiro

unread,
Jan 5, 2016, 8:00:32 PM1/5/16
to metamath
For the specific shift-reduce conflict you have presented, it looks like the correct move is to shift, so that the parse stack looks like:

-> "(" -> shift
"(" -> "x" -> shift
"(" "x" -> "e." -> reduce vx
"(" [vx] -> "e." -> shift
"(" [vx] "e." -> "{" -> shift
"(" [vx] "e." "{"-> "1" -> shift
"(" [vx] "e." "{" "1" -> "}" -> reduce c1
"(" [vx] "e." "{" [c1] -> "}" -> shift
"(" [vx] "e." "{" [c1] "}" -> ? -> reduce csn
"(" [vx] "e." [c1 csn] -> ?

At this point, the next move depends on whether the lookahead ? is "/\" (case1) or "|->" (case2). In case2, we can shift and proceed as normal, eventually reducing with cmpt, but in case1 this leads to a problem since we need to reduce with wcel, which requires the stack to look like [vx cv] "e." [c1 csn] instead. This can be manually fixed by adding a rule to the parser to construct "class x e. A" in one step (either literally by adding a theorem to this effect and using it in the parser, or by producing a composite RPN "? cv ? wcel" whenever this comes up).

It is possible that the LALR(7) claim predates the wel/wcel change, which I think is the source of the problem; let me know if you see any other similar problems. Note that the same technique can be used to resolve "{ <. <. x , y >. , z >. | ..." vs "{ <. <. x , y >. , z >. } ..." with a single lookahead, if theorems are added for "class <. x , y >." and "class <. <. x , y >. , z >.".

Mario

Mario Carneiro

unread,
Jan 6, 2016, 7:53:20 AM1/6/16
to metamath
In order to handle "{ <. <. x , A" and "{ <. <. x , y >. , A" without backtracking, you also need the theorems

wff x e. A
class <. x , A >.

class <. x , y >.
class <. <. x , y >. , A >.

class <. <. x , y >. , z >.

I have not found any other cases that need such composite rules, and it is not clear to me if these rules can be automatically discovered based on the known shift/reduce conflicts.

To be clear, this is not a failure of unambiguity of the grammar, but only of the simplicity of parsing the grammar. Indeed this may be a way to simplify http://us.metamath.org/downloads/grammar-ambiguity.txt by a "constructive" proof of unambiguity using a parser, although ironically adding the above theorems (or any derived syntax theorems for that matter) makes the resulting grammar ambiguous, since any proof using a theorem can be rewritten using the derivation of the theorem.

Now for some examples. Here is a simplified set.mm grammar that retains the issue:

W -> ( W /\ W )
W -> C e. C
W -> 'ph'
C -> ( V e. C |-> C )
C -> V
C -> 'A'
V -> 'x'

We want to distinguish ( x e. A /\ ph ) from ( x e. A |-> A ) e. A.

State 0:
W -> * ( W /\ W )
W -> * C e. C
W -> * 'ph'
C -> * ( V e. C |-> C )
C -> * V
C -> * 'A'
V -> * 'x'

From this state we read a '(' and move to state 1:
W -> ( * W /\ W )
C -> ( * V e. C |-> C )
W -> * ( W /\ W )
W -> * C e. C
W -> * 'ph'
C -> * ( V e. C |-> C )
C -> * V
C -> * 'A'
V -> * 'x'

then we read a 'x' and move to state 2:
V -> 'x' *

Currently the stack has states [0,1,2], and we pop 2 and reduce with V -> 'x', reaching state 3*:
C -> ( V * e. C |-> C )
C -> V *

Here we have a shift-reduce conflict, as you said, since we want to shift 'e.' and also reduce C -> V. To see if we can discover the necessary repair to the table (to shift here and add the special derived rule W -> V e. C), we backtrack to look at the stack more carefully. We arrived at C -> * V by closure from W -> C e. C, so we composite these two rules to get W -> V e. C and pretend that state 1 was actually:
W -> ( * W /\ W )
C -> ( * V e. C |-> C )
W -> * ( W /\ W )
W -> * V e. C
W -> * 'ph'

(which was created by suppressing W -> C e. C and adding W -> V e. C when taking the closure of {W -> ( * W /\ W ), C -> ( * V e. C |-> C )} ) so that the reduction actually leads to state 3, which looks like:
C -> ( V * e. C |-> C )
W -> V * e. C

Shifting the 'e.', we reach state 4:
C -> ( V e. * C |-> C )
W -> V e. * C
C -> * ( V e. C |-> C )
C -> * V
C -> * 'A'
V -> * 'x'

and shift 'A' to get to state 5:
C -> 'A' *

and pop 5 and reduce with C -> 'A'. At this point the stack has states [0,1,3,4,5], and the reduce takes us to state 6:
C -> ( V e. C * |-> C )
W -> V e. C *

This is a shift-reduce conflict for LR(0), but the lookahead tells us that since |-> is not in the follow set of W we should only reduce W -> V e. C if the next token is not |->. Since the stack is now [0,1,3,4,6], the reduce would pop 3,4,6 and take us to state 7:
W -> ( W * /\ W )

The rest continues as you would expect. The reduction by W -> V e. C entails unfolding its composite definition - to keep it linear time you could output a special theorem to the RPN here and then expand it in a second pass.


One problem with this approach is that in the final step we rely on |-> not being in the follow set of W. Although there are currently no rules of the form C -> & W where & is some terminal, this is by no means a prohibited syntax extension (it is a prefix axiom in the language of grammar-ambiguity.txt), and it would add |-> to the follow set of W by the derivation ( V e. & W |-> C ). Nevertheless it is still impossible for state 6 to allow reducing W -> V e. C because the other derivation C -> ( V e. C * |-> C ) implies that the token before W is '(', not '&'.

For another example, let's try to derive the aforementioned theorems "class <. x , A >." and "class <. x , y >.". A minimal example grammar:

C -> { <. V , V >. | W }
C -> <. C , C >.
C -> { C }
C -> V
C -> 'A'
W -> 'ph'
V -> 'x'

We will parse { <. x , x >. | ph } against { <. x , x >. }. State 0:
C -> * { <. V , V >. | W }
C -> * <. C , C >.
C -> * { C }
C -> * V
C -> * 'A'
V -> * 'x'

shift '{', state 1:
C -> { * <. V , V >. | W }
C -> { * C }
C -> * { <. V , V >. | W }
C -> * <. C , C >.
C -> * { C }
C -> * V
C -> * 'A'
V -> * 'x'

shift '<.', state 2:
C -> { <. * V , V >. | W }
C -> <. * C , C >.
C -> * { <. V , V >. | W }
C -> * <. C , C >.
C -> * { C }
C -> * V
C -> * 'A'
V -> * 'x'

shift 'x', state 3:
V -> 'x' *

pop 3 and reduce V -> 'x', state 4*:
C -> { <. V * , V >. | W }
C -> V *

A shift-reduce conflict is obtained. We reached state 4* from state 2, so we composite C -> V with C -> <. C , C >. so that the correct state 4 is:
C -> { <. V * , V >. | W }
C -> <. V * , C >.

shift ',', state 5:
C -> { <. V , * V >. | W }
C -> <. V , * C >.
C -> * { <. V , V >. | W }
C -> * <. C , C >.
C -> * { C }
C -> * V
C -> * 'A'
V -> * 'x'

shift 'x', taking us back to state 3, then reduce V -> 'x' to get to state 6*:
C -> { <. V , V * >. | W }
C -> V *


Again, a shift-reduce conflict, so we backtrack from state 6* to state 5; the offending reduction is C -> V through C -> <. V , C >. so we composite again to get state 6:
C -> { <. V , V * >. | W }
C -> <. V , V * >.

shift '.>', state 7:
C -> { <. V , V >. * | W }
C -> <. V , V >. *

And now the lookahead tells the difference between shift and reduce. Note that the two constructed rules C -> <. V , C >. and C -> <. V , V >. match the previously given rules
class <. x , A >.

class <. x , y >.

as we wanted to show.

I'll see if I can mock up a shift-reduce parser for set.mm's grammar to see if there are any other such generated rules, but this definitely shows that it is possible to parse set.mm's grammar using a standard shift-reduce parser, with some extra tricks in the parse table generation phase.

Mario

fl

unread,
Jan 6, 2016, 8:14:12 AM1/6/16
to Metamath
 If you think the current syntax is
asthetically sufficiently superior, I'll move to a GLR parser for Igor.


mmj2 uses an Earley parser. It works remarkably well despite the 
size of set.mm. Maybe because of its linear structure in essence.

-- 
FL

Jens-Wolfhard Schicke-Uffmann

unread,
Jan 6, 2016, 11:23:14 AM1/6/16
to meta...@googlegroups.com
On Wed, Jan 06, 2016 at 07:53:19AM -0500, Mario Carneiro wrote:
> To be clear, this is not a failure of unambiguity of the grammar, but only of
> the simplicity of parsing the grammar. Indeed this may be a way to simplify
> http://us.metamath.org/downloads/grammar-ambiguity.txt by a "constructive"
> proof of unambiguity using a parser, although ironically adding the above
> theorems (or any derived syntax theorems for that matter) makes the resulting
> grammar ambiguous, since any proof using a theorem can be rewritten using the
> derivation of the theorem.
Yes. So "just" adding the theorems to set.mm will actually make the
situation worse.

> I'll see if I can mock up a shift-reduce parser for set.mm's grammar to see if
> there are any other such generated rules, but this definitely shows that it is
> possible to parse set.mm's grammar using a standard shift-reduce parser, with
> some extra tricks in the parse table generation phase.
Unless one writes a parser generator specifically for metamath, doing
fancy stuff in the parse table generation phase is rather hard - usually
this is hidden somewhere inside the generator library. But for showing
constructively that the syntax is unambigous this could be useful.


I converted my LALR(1) library to a nondeterministic LALR(1) parser and
everything works fine (and fast) now for Igor.

Jens
signature.asc

Mario Carneiro

unread,
Jan 6, 2016, 12:23:35 PM1/6/16
to metamath
On Wed, Jan 6, 2016 at 11:23 AM, Jens-Wolfhard Schicke-Uffmann <drah...@gmx.de> wrote:
> I'll see if I can mock up a shift-reduce parser for set.mm's grammar to see if
> there are any other such generated rules, but this definitely shows that it is
> possible to parse set.mm's grammar using a standard shift-reduce parser, with
> some extra tricks in the parse table generation phase.
Unless one writes a parser generator specifically for metamath, doing
fancy stuff in the parse table generation phase is rather hard - usually
this is hidden somewhere inside the generator library. But for showing
constructively that the syntax is unambigous this could be useful.
 
Actually, the example derivation was intended to show that the tricks are all automatically generated and not specific to metamath, so in essence it seems I've defined a new class of LR(k)-like grammars somewhere between LR(1) and LALR(1). I'm going to try to make an LRParser plugin to mmj2 and see whether it all works out, but there might even be a paper in here somewhere if it all pans out.

I converted my LALR(1) library to a nondeterministic LALR(1) parser and
everything works fine (and fast) now for Igor.

That's great. In mmj2 everything is done manually anyway so it doesn't make a big difference, but if you are using a library for the parsing that's probably the best option.

Mario

Mario Carneiro

unread,
Jan 6, 2016, 5:47:03 PM1/6/16
to metamath
I have the results from the initial LR parse tree generation on the full set.mm syntax. Output:

shift/reduce conflict: {class -> <. set , set >. *, class -> { <. set , set >. * | wff }} => "shift |" / "class -> <. set , set >."
shift/reduce conflict: {class -> <. set , set >. *, class -> { <. <. set , set >. * , set >. | wff }} => "shift ," / "class -> <. set , set >."
shift/reduce conflict: {wff -> class class class *, class -> ( class class class * )} => "shift )" / "wff -> class class class"
shift/reduce conflict: {wff -> class class set *, class -> ( class class set * )} => "shift )" / "wff -> class class set"
shift/reduce conflict: {wff -> class set class *, class -> ( class set class * )} => "shift )" / "wff -> class set class"
shift/reduce conflict: {wff -> class set set *, class -> ( class set set * )} => "shift )" / "wff -> class set set"
shift/reduce conflict: {wff -> set e. class *, class -> ( set e. class * |-> class ), class -> ( set e. class * |-> set ), class -> ( set e. class * , set e. class |-> class ), class -> ( set e. class * , set e. set |-> class ), class -> ( set e. class * , set e. class |-> set ), class -> ( set e. class * , set e. set |-> set )} => "shift |->" / "wff -> set e. class"
shift/reduce conflict: {wff -> set e. class *, class -> ( set e. class * |-> class ), class -> ( set e. class * |-> set ), class -> ( set e. class * , set e. class |-> class ), class -> ( set e. class * , set e. set |-> class ), class -> ( set e. class * , set e. class |-> set ), class -> ( set e. class * , set e. set |-> set )} => "shift ," / "wff -> set e. class"
shift/reduce conflict: {wff -> set e. set *, class -> ( set e. set * |-> class ), class -> ( set e. set * |-> set ), class -> ( set e. set * , set e. class |-> class ), class -> ( set e. set * , set e. set |-> class ), class -> ( set e. set * , set e. class |-> set ), class -> ( set e. set * , set e. set |-> set )} => "shift |->" / "wff -> set e. set"
shift/reduce conflict: {wff -> set e. set *, class -> ( set e. set * |-> class ), class -> ( set e. set * |-> set ), class -> ( set e. set * , set e. class |-> class ), class -> ( set e. set * , set e. set |-> class ), class -> ( set e. set * , set e. class |-> set ), class -> ( set e. set * , set e. set |-> set )} => "shift ," / "wff -> set e. set"
shift/reduce conflict: {wff -> set class class *, class -> ( set class class * )} => "shift )" / "wff -> set class class"
shift/reduce conflict: {wff -> set class set *, class -> ( set class set * )} => "shift )" / "wff -> set class set"
shift/reduce conflict: {wff -> set set class *, class -> ( set set class * )} => "shift )" / "wff -> set set class"
shift/reduce conflict: {wff -> set set set *, class -> ( set set set * )} => "shift )" / "wff -> set set set"

This is using an LR(0) algorithm; most of the conflicts should go away once it uses the lookahead to eliminate spurious reduces. Note that there are no reduce/reduce conflicts. These are treated as warnings, with it resolving the conflicts by always shifting if it is an option. Amazingly, this simple LR(0) approach is able to correctly parse all formulas in set.mm except for two:

dfmpt2 $p |- ( x e. A , y e. B |-> C ) = U_ x e. A U_ y e. B { <. <. x , y >. , C >. } $=
ghomgrplem.2 $e |- S = { <. <. z , z >. , z >. } $.

The issue is of course with parsing { <. <. x , y >. , C >. } correctly. It turns out that mmj2 is already set up to automatically expand cv (there are special rules to normalize away type conversion (A -> B) and nulls-permitted (A -> "") productions), so the only real problem is in making sure that the composite rules class <. <. x , y >. , A >. and class <. <. x , y >. , z >. are created.

Mario

David A. Wheeler

unread,
Jan 6, 2016, 6:23:51 PM1/6/16
to metamath
On Wed, 6 Jan 2016 17:47:02 -0500, Mario Carneiro <di....@gmail.com> wrote:
> I have the results from the initial LR parse tree generation on the full
> set.mm syntax...

That's cool, thanks for sharing that.

If you're interested in parsing, ANTLR version 4
uses an interesting newer parsing approach called ALL(*), described here:
http://www.antlr.org/papers/allstar-techreport.pdf
ALL(*) is O(n^4) in theory, but in practice seems to be basically linear, and
seems to be far faster than other flexible approaches than GLL and GLR
(with similar speed as hand-created recursive decent LL(1) parsers).

ANTLR 4 itself is available here:
http://www.antlr.org/
ANTLR is widely used for generating parsers (I happily used ANTLR 3
on a different project), and ALL(*) looks like a really promising advance in
making it easier to create reasonably efficient parsers with good
error detection and recovery.

I imagine this might be a fine way to create a Metamath-related parser (if ANTLR 4 supports
your target language).

--- David A. Wheeler

Mario Carneiro

unread,
Jan 6, 2016, 8:03:24 PM1/6/16
to metamath
I did a one-step backtracking algorithm on the listed shift/reduce conflicts, to find out whether in fact the proposed reduction can be followed by the known lookahead. This reduces the list of conflicts to only one:


shift/reduce conflict: {class -> <. set , set >. *, class -> { <. <. set , set >. * , set >. | wff }} => "shift ," / "class -> <. set , set >."

which proves that the "shift if possible" algorithm only fails in the case { <. <. x , y >. , A >. }. To eliminate this last case using helper theorems will need more work, but this is the closest we have yet come to a rigorous computer proof of the unambiguity of set.mm's grammar.

Mario Carneiro

unread,
Jan 7, 2016, 2:51:41 AM1/7/16
to metamath
The LR parser is now (apparently) working, including the backtracking and production splitting techniques mentioned here. You can get it on the lr-parser branch on mmj2 github, although since it's not a very glamorous feature I'll let interested parties build it themselves.

There is a fairly heavy (~20s on my machine) startup cost, during which the parsing tables are compiled (compared to 0.5~5.5s with the standard EarleyParser). After this, though, parses are very fast - I was able to parse quartfull in ~.001 s, compared to EarleyParser's .08 s. And of course, the hard work at the beginning is simply to construct a lookup table, which can easily be serialized so that it doesn't need to be generated on the fly as long as the grammar stays put. The final table is LR(1), as promised, although it includes a number of composite productions, represented as parse trees with VarHyp slots in them (and we already have a lot of support for substitutions into trees like this because that's how metamath works).

To test it out, I added a dynamic loading of parsers, just add

SetParser,mmj.verify.LRParser

before the Parse,* line. The other options are mmj.verify.EarleyParser (the default) and mmj.verify.BottomUpParser (which pukes on set.mm). This of course also generalizes to ANTLRParser if we ever get there.


Mario

fl

unread,
Jan 7, 2016, 9:16:53 AM1/7/16
to Metamath
 
 
> There is a fairly heavy (~20s on my machine) startup cost, during which the parsing tables are compiled (compared to 0.5~5.5s with the standard EarleyParser).
 
A new achievement of Mario Carneiro. The only man in the world that succeeds to get a worse result with a determinitic parser than with a backtracking one.
 
--
FL

Mario Carneiro

unread,
Jan 7, 2016, 1:03:07 PM1/7/16
to metamath
On Thu, Jan 7, 2016 at 9:16 AM, fl <freder...@laposte.net> wrote:
 
 
> There is a fairly heavy (~20s on my machine) startup cost, during which the parsing tables are compiled (compared to 0.5~5.5s with the standard EarleyParser).
 
A new achievement of Mario Carneiro. The only man in the world that succeeds to get a worse result with a determinitic parser than with a backtracking one.

This is very much an expected result. It is much more difficult to explore the complete state space of the grammar than just to handle it adaptively based on the actual strings being parsed. The former approach has the benefit that you can make claims about the whole grammar such as "this grammar is unambiguous" while the try-as-you-go method only allows you to claim that none of the strings in the corpus are ambiguous. Also although it is nominally a linear-time algorithm, the generation is potentially exponential or worse in the size of the grammar.

The parse states are labeled by sets of grammar productions with a head marker in them, leading to a rough upper bound of 2^(kn) states where n is the number of productions and k is the length of a production. In practice it is much less, and I think the current version generates about 2000 states which is manageable, but it can potentially be much worse. Add on to this that the backtracking algorithm adds new productions to the grammar and I don't know if I can give any upper bound on the number of states. I think that it should be able to handle any unambiguous grammar, so since this is an undecidable problem the lack of an upper bound should not come as a surprise.

Finally, EarleyParser has been aggressively optimized to the point of unmaintainability, which has been a major factor in my not wanting to futz with the parsing code in the first place. The new parser is written with lots of java collections API and does not shy away from object creation (while O'Cat would seem to prefer lots of variable reuse and manual array management to decrease the amount of object overhead). It will be a long time before mmj2 stabilizes to the point that I can focus less on maintainable code and more on speed improvements. (If you say "but mmj2 is already stable", that only makes sense if there is someone on the team who knows the code well enough to stay the course. Eventually mmj2 will reach a new equilibrium where the code appears locally optimal to me, at least until someone else takes up the maintenance role.)

Mario

fl

unread,
Jan 8, 2016, 8:15:04 AM1/8/16
to Metamath
 
 
[long]
 
 
All that words to explain it is conform to reason, moral and religion that you will have to wait for a very long time that an editor starts.
 
It is congruent.
 
(And some people are still expecting that climate can be saved. They don't know that Mario Carneiro participates to the debate.)
 
--
FL

Mario Carneiro

unread,
Jan 13, 2016, 9:57:58 PM1/13/16
to metamath
The latest version of mmj2 includes automatic saving and loading of settings, which was on O'Cat's "wishlist" before he disappeared. In particular, the LRParser will save its parse tables, so the startup is much quicker now (informally timed at around 3 seconds to parse set.mm). But since it is "questionable" as you say, EarleyParser will remain the default. As soon as I get to a good point with my current work (even more new features) I will finally be releasing a NEW OFFICIAL VERSION for the first time since Feb 2014 (it's about time to stop requiring people to build their own mmj2 releases from github).

Mario

--
Reply all
Reply to author
Forward
0 new messages