How to "trap" unacceptable lines.

clueless newbie

unread,

Feb 21, 2022, 7:36:41 AM2/21/22

to marpa parser

Marpa brings back the feeling of being as a child listening to my father giving a lecture to a graduate class in partial differential equations - I could see the x's the y's but how things work were far beyond my comprehension. I'm sure my head would be a lot less sore and Jeffrey richer if instead of bouncing my head against the wall I were to donate another collar but I am hardly the Duke of Brunswick-Lüneburg.
(Maybe I should just say "I'm just too stupid!", give up and see if I can successfully twiddle my thumbs.)

The data consists of (physical) lines terminated by a newline. A line may be:
1) <name> = <boolean>
2) '/'<regexp>'/' = <boolean>
Comments begin with '--' and are end of line type comments.

Shouldn't I be able to say that anything else is an error? ie:

:default ::= action => [values]
lexeme default = latm => 1
:start ::= lines
lines ::= line+
line ::= <name> ('=') <boolean> (NEWLINE) action => doName
| ('/') <regexp> ('/') ('=') <boolean> (NEWLINE) action => doRegexp
# would like the following to catch everything else
|| <bad stuff> (NEWLINE) action => doError rank => -1
#

<name> ~ <unquoted name> | <quoted name>
<unquoted name> ~ ALPHA | ALPHA ALPHANUMERICS
<quoted name> ~ '"' <quoted name body> '"'
<quoted name body> ~ [\w]+ # for now

<regexp> ~ [$(|)\w^]+

#
<bad stuff> ~ ANYTHING+
#

<boolean> ~ TRUE | FALSE
FALSE ~ 'FALSE':i | 'F':i | '0'
TRUE ~ 'TRUE':i | 'T':i |'1'

ALPHA ~ [a-z]:i
ALPHANUMERICS ~ [\w]*

:discard ~ COMMENT
COMMENT ~ '--' <comment body>
<comment body> ~ ANYTHING*

ANYTHING ~ [^\x{A}\x{B}\x{C}\x{D}\x{2028}\x{2029}]
:discard ~ WHITESPACE
WHITESPACE ~ [ \t]+

NEWLINE ~ [\n]

CAVEAT: <name> is going to be an Oracle identifier and they are weird!

F. Li

unread,

Feb 21, 2022, 10:08:44 AM2/21/22

to marpa-...@googlegroups.com

I could of course feed the parser one line at a time but what I would like to achieve is "I, Marpa, am parsing a line that doesn't conform to any previous alternatives so I'm called the sub you designated for this piece of junk."

Jeffrey Kegler

unread,

Feb 21, 2022, 12:00:07 PM2/21/22

to marpa-...@googlegroups.com

One matter which requires getting used to with Marpa is that you are working with BNF, so the core logic is non-procedural. This is why most programmers seem to want to suffer endlessly with recursive descent rather than consider stronger parsers. You can understand recursive descent with purely procedural thinking.

The idea of "do this on error" is procedural thinking. Procedural stuff can be added to Marpa via events, but the programmer needs to bear in mind the engine is being driven descriptively, not procedurally.

One solution to your problem might be rule ranking. See here, here and here. Rule ranking turns Marpa into a better PEG.

The docs I linked are a bit daunting at first glance, or if you don't skim the more technical parts. The basic idea in your case might be to define a "catch all" line as an error case, ranking it below the non-error alternatives.

People working with ranking can find it tricky because the ranking only is applied in very specific circumstances -- the alternatives must be at the same dotted position of a parent rule (which implies they will have the same LHS), the same start position and the same end position. If any of these 3 is not the case, ranking will not be done. This means, for example, that you can't use rule ranking for things which might differ in length. But this seems to be OK in your case. All line alternatives, error or non-error, should start at the same position and end at the same position.

What I think might work is to give the error lines a lower rank than the non-error lines. Then they will be seen only if there is no non-error parse of the line.

The docs contain example and I hope looking at these will help make things clear.

I hope this helps,

jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/d07a1e10-3244-4899-b73f-ba7deb0369e7n%40googlegroups.com.

F. Li

unread,

Feb 21, 2022, 2:01:59 PM2/21/22

to marpa-...@googlegroups.com

Thank you for your response.

I tried stripping the example in your third link here to a bare minimum ending up with the attached

but the test strings get caught as "JUNK". (The first 'a = 1', is acceptable.) Obviously I'm missing something!

You received this message because you are subscribed to a topic in the Google Groups "marpa parser" group.
To unsubscribe from this topic, visit https://groups.google.com/d/topic/marpa-parser/HWNo_JJINM4/unsubscribe.
To unsubscribe from this group and all its topics, send an email to marpa-parser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/CABWWkhPuGjLK%2BCuo7HJfyo3Vf4nZ%2BVs9f6X7jLnbg1%2BbTMXKXw%40mail.gmail.com.

ranking_01.t

Jeffrey Kegler

unread,

Feb 21, 2022, 2:22:01 PM2/21/22

to marpa-...@googlegroups.com

I will download this and look it over.

Sent with ProtonMail Secure Email.

------- Original Message -------

To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/CA%2BxWOmX%3DP%3DT4mpiCUE8JHnK7VidQjdp9yWYaaTZstp2mBi4s%3Dw%40mail.gmail.com.

Jeffrey Kegler

unread,

Feb 21, 2022, 7:34:50 PM2/21/22

to marpa-...@googlegroups.com

I have attached a reworked version, which I have tested.

A few comments: 1.) In your version you had a long JUNK lexeme which slurped up the entire line. Marpa uses LATM lexing (Longest Acceptable Token Matching). JUNK would usually be longest, meaning the parser would usually see nothing else and report almost every line as junk. I changed it so that "junk" characters are lexed one at a time, meaning that they will at best tie other lexemes.

2.) Your JUNK lexeme would accept spaces, which you were also discarding. Having spaces both be discarded and be part of other lexemes is possible, but a very tricky technique -- best to go with one or the other. I set things up so that spaces are not junk characters and discard handles them. For your full example, you'll have to deal with end-of-lines and should think out carefully what gets handled where.

3.) Very useful, even when the problem is not with the lexing, is the trace_terminals recognizer option. You may want to try it with both your original and my reworked script, and note what is going on. A value of 99 turns on everything and for a small case like this is not too verbose.

I had to reread my own doc for this example to refresh myself on how ranking worked. My description of it a couple of emails ago was (to be honest) somewhat confusing. On the good side, it's a much more useful technique than I remembered, and I enjoyed working this up.