Handling rejection

Thomas Weigert

unread,

Feb 9, 2015, 7:21:37 PM2/9/15

to marpa-...@googlegroups.com

No, this is not about relationship troubles.

I am struggling to work with rejection events. I am trying to deal with constructs like preprocessing statements or meaningful comments in programming languages. These (i) can go anywhere in the grammar and (ii) need to be propagated into the parse tree and (iii) may affect the parse itself and (iv) cannot be easily parsed with a grammar or an internal lexer.

My idea to parse such constructs was to create lexemes invoked by fake G1 productions which would be tried when the relevant text is encountered and would create a rejection event. I would then parse the text of these constructs in an external recognizer upon handling the rejection event and insert the proper text back into the input string and set the continuation of the parse to the start of the replacement text. If the replacement text is legal at the inserted point, parsing should continue just fine, thanks to the great infrastructure provided by Marpa.

However, things did not go as planned. Please look at the attached example for detail. In this example, I try to handle preprocessor statements (#ifdef).

I created a very simple grammar, and added these productions:

fakecpp ::= cpp
cpp ~ '#'

The fakecpp production is actually not reachable. However, when in the input string, for example:
       abc\n#ifdef A\n=\n#else\n+\n#endif\n12
When we hit the "#ifdef", we get a rejection event, and in the handler I thought I could clean it up:
            $pos = $pos + $len - $newlen + 1;
            substr($string, $pos, $newlen) = $cpp2;
($string is the original string, $pos is the current position, $len is the total length of the ifdef, $newlen is the length of the replacement text, and $cpp2 is the replacement text). I insert the replacement text at the end of the ifdef and set the position to before the replacement text. Now I hoped that upon resume the parser would get the replacement text and be happy.

No such luck. Please note that I got the following to work: Find out what lexeme was expected and read it with the external parser (lexeme_read), and proceed with the text after it.
                $pos = $pos + $len + 1;
                $recce->lexeme_read('OP', $pos, 1, '=');
But this approach only works because this grammar is so simple and I can easily deal with all cases of possible rejections by looking at the expected lexemes.

Note that if I put the "=" into the input string and try to continue parsing from before it, I get another rejection event at this very point. This is really strange because the grammar expects an OP, I give it an OP, but it cannot parse it.

Intuitively, there is something I must be doing wrong as it seems there should be a way of getting this to work.

Any suggestions would be greatly appreciated.

Thanks, Th.

simple.pl

Jeffrey Kegler

unread,

Feb 9, 2015, 8:40:38 PM2/9/15

to Marpa Parser Mailing LIst

A quick response. [ I'm having Internet connectivity problems. ] The input string passed to read is copied into the bowels of Marpa in a digested form. On my first quick reading, it's looking like you are relying on changes to the original string being seen by Marpa. This does not happen.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Thomas Weigert

unread,

Feb 9, 2015, 8:57:47 PM2/9/15

to marpa-...@googlegroups.com

Thank you for this tip. Indeed, this was an assumption that I was making. I got the idea from the R.pod:

"The resume() method uses a new input string to scan the physical input stream that was specified to the read() method. resume()'s input string may be specified explicitly or implicitly. By default, the new input string runs from the current location to the end of the physical input stream."

As one cannot give an input string parameter to resume, I assumed that if it is called, it works of the original input stream.

I saw some hints in the manual that it is possible to change the physical input stream that Marpa is looking at. If you have some pointers to where I can find that described in the docu, that would be great. Or if there is a documentation on how I can access the "bowels of Marpa" that would be lovely also.

Thanks, Th.

Thomas Weigert

unread,

Feb 9, 2015, 9:08:28 PM2/9/15

to marpa-...@googlegroups.com

Maybe some more here:

In the R.pod it further says: "The virtual input stream is a series of input strings. An input string is a substring of the physical input stream. By default the virtual input stream consists of exactly one input string, one which begins at location 0 in the physical input stream and whose length is the length of the physical input stream."

This together with below and the description of resume() makes me think that when I call resume with offsets into the physical input stream (I believe that is the original input that is being parsed), then it creates a new input stream to be appended to the virtual input stream.

Ron Savage

unread,

Feb 9, 2015, 9:19:13 PM2/9/15

to marpa-...@googlegroups.com

On Tuesday, 10 February 2015 12:57:47 UTC+11, Thomas Weigert wrote:

Thank you for this tip. Indeed, this was an assumption that I was making. I got the idea from the R.pod:

"The resume() method uses a new input string to scan the physical input stream that was specified to the read() method. resume()'s input string may be specified explicitly or implicitly. By default, the new input string runs from the current location to the end of the physical input stream."

I take that to be a doc bug. It can only mean the $pos (and optional $length) passed to resume() tell Marpa whereabouts within the original input stream to resume at (and how long - $length - a string to consider, i.e. how must more of the original input stream to consider).

There are 2 courses of action in these situations:

1: Since you know $pos, use your own - called external - scanner to process the tokens starting at the one which $start points to (presumably #), and when finished, set $pos to the end of this substring, and call resume($pos).

2: Off-the-top-of-my-head: Consider changing the rule 'word ::= W' to 'word ::= W | fakecpp', so Marpa finds the # without error, and extend the grammar to handle all things which follow a #. In this case you deliberately sidetrack Marpa to process #ifdef and then let it (with resume($pos) ) naturally return to process the next token, here operator.

Ron Savage

unread,

Feb 9, 2015, 9:23:18 PM2/9/15

to marpa-...@googlegroups.com

On Tuesday, 10 February 2015 13:08:28 UTC+11, Thomas Weigert wrote:

Maybe some more here:

In the R.pod it further says: "The virtual input stream is a series of input strings. An input string is a substring of the physical input stream. By default the virtual input stream consists of exactly one input string, one which begins at location 0 in the physical input stream and whose length is the length of the physical input stream."

This together with below and the description of resume() makes me think that when I call resume with offsets into the physical input stream (I believe that is the original input that is being parsed), then it creates a new input stream to be appended to the virtual input stream.

Fair enough. Just a misreading of awkward docs.

I read it as overly complicated text, since it's not obvious why 'virtual'' was brought into the discussion. It seems the unstated intent was that these virtual strings are laid end-to-end over the real input string. So each one begins at wherever you tell Marpa $pos is.

Certainly I don't believe new strings are fabricated on-the-fly.

I'm tempted to log a request ticket re the docs, but I'll let Jeffrey choose the wording here.

Thomas Weigert

unread,

Feb 9, 2015, 9:35:31 PM2/9/15

to marpa-...@googlegroups.com

Ron,

please see below....

I guess in the end it comes back to what Jeff said earlier. He implied that one cannot simply change the content of the input string.

I guess that maybe somehow the recognizing is happening in the C code and that the input is copied into C space and the position information is maintained between the string I gave Marpa and the actual physical string that sits in C space?

Th.

On Monday, February 9, 2015 at 8:19:13 PM UTC-6, Ron Savage wrote:

There are 2 courses of action in these situations:

1: Since you know $pos, use your own - called external - scanner to process the tokens starting at the one which $start points to (presumably #), and when finished, set $pos to the end of this substring, and call resume($pos).

Right, but I need to inject some other text into the input string to be parsed afterwards.

2: Off-the-top-of-my-head: Consider changing the rule 'word ::= W' to 'word ::= W | fakecpp', so Marpa finds the # without error, and extend the grammar to handle all things which follow a #. In this case you deliberately sidetrack Marpa to process #ifdef and then let it (with resume($pos) ) naturally return to process the next token, here operator.

I think this strategy would be completely infeasible for a real grammar.

Ron Savage

unread,

Feb 9, 2015, 9:54:25 PM2/9/15

to marpa-...@googlegroups.com

On Tuesday, 10 February 2015 13:35:31 UTC+11, Thomas Weigert wrote:

Ron,

please see below....

I guess in the end it comes back to what Jeff said earlier. He implied that one cannot simply change the content of the input string.

No, you can't change it, but using the Ruby Slippers concept, you can tell Marpa to parse another part of the input string, in which you've secreted the Ruby Slippers. That's what happens in the sample code I mentioned in my email to you. It was MarpaX::Demo::SampleScripts and scripts/match.parentheses.02.pl. After that, you reset $pos back to the place you want Marpa to continue from.

I guess that maybe somehow the recognizing is happening in the C code and that the input is copied into C space and the position information is maintained between the string I gave Marpa and the actual physical string that sits in C space?

Yep.

More below.

[snip]

There are 2 courses of action in these situations:

1: Since you know $pos, use your own - called external - scanner to process the tokens starting at the one which $start points to (presumably #), and when finished, set $pos to the end of this substring, and call resume($pos).

Right, but I need to inject some other text into the input string to be parsed afterwards.

2: Off-the-top-of-my-head: Consider changing the rule 'word ::= W' to 'word ::= W | fakecpp', so Marpa finds the # without error, and extend the grammar to handle all things which follow a #. In this case you deliberately sidetrack Marpa to process #ifdef and then let it (with resume($pos) ) naturally return to process the next token, here operator.

I think this strategy would be completely infeasible for a real grammar.

Not at all. I do it :-).

Ron Savage

unread,

Feb 9, 2015, 9:59:15 PM2/9/15

to marpa-...@googlegroups.com

I think this strategy would be completely infeasible for a real grammar.

Perhaps you meant that for the grammar you have to contend with, but I still disagree. There are 2 courses of action here too (which is always better than having 1 or none):

1: Extend the original grammar so Marpa parses everything with 1 grammar

2: Have a second recognizer on stand-by, with it's own grammar for the strings which trigger rejection event in your original code.

I've done both, which I why I seriously recommend them to you.

For an example of the former, see GraphViz2::Marpa, and search for the HTML-specific grammar. It's the second - external - grammar of the 2. Search GraphViz2/Marpa.pm for the string 'bnf4html'.

Thomas Weigert

unread,

Feb 9, 2015, 10:24:56 PM2/9/15

to marpa-...@googlegroups.com

Thanks, Ron.

I see how the Ruby Slippers can work if you know ahead of time what the text will be that we need to insert into the input, so we can "secret it away" before calling read. I guess I could get this to work by prescanning the input for all the text strings, then putting them all into "Ruby Slippers Shoeboxes" at the end of the input string and when hitting one of these tokens doing the Ruby Slipper thing, just like you did in match.parentheses.02.pl

Ruslan Shvedov

unread,

Feb 9, 2015, 10:26:27 PM2/9/15

to marpa-...@googlegroups.com

I'd suggest 2 code examples: one for parsing preprocessor statements [1] and another for parsing length-prefixed format with prediction and completion events [2] -- is show how to use lexeme_read() with pos() and resume().

Hope this helps.

[1] https://gist.github.com/rns/3b2f48477fc23d0ab0f7

[2] https://gist.github.com/rns/ba250ed6a5ed1c82ce7b

Ron Savage

unread,

Feb 9, 2015, 11:49:05 PM2/9/15

to marpa-...@googlegroups.com

On Tuesday, 10 February 2015 14:24:56 UTC+11, Thomas Weigert wrote:

Thanks, Ron.

I see how the Ruby Slippers can work if you know ahead of time what the text will be that we need to insert into the input, so we can "secret it away" before calling read. I guess I could get this to work by prescanning the input for all the text strings, then putting them all into "Ruby Slippers Shoeboxes" at the end of the input string and when hitting one of these tokens doing the Ruby Slipper thing, just like you did in match.parentheses.02.pl

But that prescanning is just what we want Marpa to do! Hmmm. The 4 points in your first post are incredibly restrictive. I suggest you revisit that topic. For example, if they are all prefixed with #, then life is simple(r). If they all end with \n, then that helps too. I still think the only way to handle a lot of flexibility in such introns (to use DNA-style terminology) is to accept that you'll have to write the grammar for all such cases. We can help you simplify it, but ATM it seems far too vague (in the nicest possible way, of course :-) to work with.

Jeffrey Kegler

unread,

Feb 9, 2015, 11:59:17 PM2/9/15

to Marpa Parser Mailing LIst

Still having Internet troubles.

Two notes: First, the text associated with a lexeme does *NOT* have to be its value, if you use the lexeme_read() method to read it. That's how you can do the Ruby Slippers, for example.

Second, the text is fixed, but you can change how you move around in it -- you don't have to read it in order. That allows some of the same effects of changing the text, althjough not all.

A major reason that the text is fixed, is for error messages. An error message has to report an error as occurring *somewhere* in the error stream. But if you provide your own lexeme values, error messages will be the only place where the lexeme-text association makes a difference.

--

Thomas Weigert

unread,

Feb 10, 2015, 12:21:37 AM2/10/15

to marpa-...@googlegroups.com

Thank you all for helping.

Putting Jeff and Ron's insights together I was able to come up with a simple solution. I know exactly where in the input string my replacement text is. So I need not inject anything, I can just go there and recognize that text only as my ruby slippers, and then continue at the end of the skipped text.

The insight here provided by Jeff is that it does not matter where in the string the ruby slippers are sitting. As long as I can find them, I can use them.

Thanks for the kind guidance.

Th.

--
You received this message because you are subscribed to a topic in the Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marpa-parser/tYljqfGS3Aw/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marpa-parser...@googlegroups.com.

Reply all

Reply to author

Forward