Preferring a shorter token match

Dean S

unread,

Dec 2, 2020, 3:27:46 PM12/2/20

to marpa parser

Hello, I'm having trouble figuring out how to express my grammar and was hoping someone could help. I've tried rewriting various ways and looking for some options that might change behavior, but I haven't been able to figure it out.

I have a language with variable assignment and simple commands and doesn't care about whitespace. So,

PAx=42 # variable assignment to "PAx"

PAx # PA command with argument x

I have a grammar, but it insists on spaces after command names. I've tried hiding assignment behind a prioritized rule and tried setting the command lexeme priority, but I always get parse errors when parsing "PAx". I have a simplified grammer which exhibits the issue,

#!/usr/bin/perl
use warnings; use strict; use 5.028;

use Marpa::R2 8.000000;
use Data::Dumper;

my $grammar = 'Marpa::R2::Scanless::G'->new({ source => \(<<'RULES') });
:default ::= action => [name,values]
lexeme default = latm => 1

:start     ::= Program

Program    ::= Statement+

Statement ::= Command terminator
            || Assign terminator

Command    ::= command arg

Assign     ::= variable equal value

# Doesn't help: :lexeme ~ command priority => 2
command      ~ 'PA' | 'PR'
arg          ~ [\w]+
equal        ~ '='
value        ~ [0-9]+
variable     ~ [\w]+
terminator   ~ [;\n]

:discard     ~ whitespace
whitespace   ~ [ \t]+
RULES

# This parses correctly, line 2 is a command, line 3 is assignment.
my $ok = <<TEXT;
x=23
PA x
PAx=42
TEXT

say Dumper($grammar->parse(\$ok));

# I want to generate same tree as above,

# but my grammar wants line 2 to be an assignment.
my $bad = <<TEXT;
x=23
PAx

PAx=42

TEXT

# Error in SLIF parse: No lexeme found at line 2, column 4

say Dumper($grammar->parse(\$bad));

Is there some trick to this? Did I miss someting in the documentation?

Any suggestions?

Thanks!

- Dean

Jeffrey Kegler

unread,

Dec 2, 2020, 4:29:00 PM12/2/20

to Marpa Parser Mailing LIst

I'll first describe your immediate problem, then ask a couple Q's.

The problem: Lexing is LATM -- *Longest* Acceptable Token Matching. The lexeme priority is a tie breaker, used when tokens are the same length. When your grammar fails, "PAx" is your longest token, and the only choice at length 3. "PA" is only 2 chars long, and lexemes of different length are not compared for priority.

(Btw the reason for this is, as implemented, lexeme priorities can be (and are) tested in a few machine instructions. If Marpa needed to look at earlier possibilities, the logic gets vastly very complex, efficiency goes out the window, and you get into the territory when the grammar can often be handled in easier faster ways.)

Now the questions:

1.) I notice statements cannot be multiline. Is that the intent going forward?

2.) In the example, commands always begin with a capital letter, variables never do. Will that continue to be the case? (If so, it points to an easy, fast solution.)

Possible solutions, depending, include finding something that distinguishes commands from variables in the lexer; custom lexers; using events to guide custom lexing; and character-by-character lexing, whereby you handle your own whitespace.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/32966712-7756-4590-8b13-c9b2decbc3e4n%40googlegroups.com.

Dean S

unread,

Dec 2, 2020, 5:42:41 PM12/2/20

to marpa-...@googlegroups.com

Thank you very much for your response!

1) Multiline: The language does have a "_" line-joining character, but the grammar wouldn't have to support that - it could be done with a trivial preprocessor. Once joined, commands may not span multiple lines.

2) Command/variable upper-case: Commands are always upper case, but there are no case restrictions on variables.

So it sounds, however, like there isn't a straight-forward grammar or option tweak. That's ok. The language has fancy expressions (algebraic expressions, function calls, strings, comments, and arrays), but its statement structure is extremely simplistic. The terminators (newline and semi-colon) are not allowed anywhere except as terminators (no escapes, not in strings, not in comments). So, as a practical solution, I should be able to dumb-split a program on terminators, look at the first characters of each statement, strip off the command or variable assignment part and parse the rest as an expression - which follows more reasonable rules that the LATM will like.

So, I guess this falls to the "handled in easier faster ways" approach which I guess should have been obvious but I failed to think of.

Thank you for your time, and a great library!

- Dean

Jeffrey Kegler

unread,

Dec 2, 2020, 5:55:24 PM12/2/20

to Marpa Parser Mailing LIst

How many commands? One approach that I just thought of and have never tested is to have fixed length variables, prioritized versus the commands of that length.

Off the top of my head:

var2 ~ <varchar><varchar> priority=>1

PA_command ~ 'PA' priority=>2

PR_command ~ 'PR' priority=>2

include a catch-all var for lengths not specifically accounted for

var_catch ~ <varchar>+ priority=>0

<var_catch> will always lose to lexemes of the same length, but will catch those variables whose length is not the same as any command.

A little hack-ish, but should be very fast, and perhaps no more hack-ish than alternatives.

Again, not tested, so you'll be a pioneer! If you try it, let me know!

Hope this helps, jeffrey

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/270ecda3-6917-f717-593b-051ded20629d%40gmail.com.

Jeffrey Kegler

unread,

Dec 2, 2020, 6:41:35 PM12/2/20

to marpa parser

Oops! Ignore that last! It suffers from the same problem -- longest captures the match and all the rest of that hack-ish apparatus I talked about would be ignored.

I've been doing mainly LaTeX these days and obviously it is turning my brain into mush.

Sorry!

Dean S

unread,

Dec 2, 2020, 7:05:37 PM12/2/20

to marpa-...@googlegroups.com

No problem, thanks for trying. I did plug it in and indeed got the same results.

Jeffrey Kegler

unread,

Dec 2, 2020, 7:32:10 PM12/2/20

to marpa parser

One technique that is documented and tested and I believe should work is the Ruby Slippers. See the docs on this, but the basic idea would be, in the default lexer look only for commands. If the parse stops for lack of a lexeme, give it an LHS, and resume the parse. Otherwise, or if it also rejects LHS token, pass the failure on to the user.

Dean S

unread,

Dec 3, 2020, 11:34:21 AM12/3/20

to marpa-...@googlegroups.com

That is looking like it will work great. I set "variable ~ '\0\0\0\0'" so that it will never match but can be an expected terminal. Then, on a rejection event, feed a variable if I can find one.

my $recce = Marpa::R2::Scanless::R->new({ grammar => $grammar, rejection => 'event' });

for ( $recce->read(\$input); $recce->pos < length $input; $recce->resume ) {
if (grep 'variable' eq $_, @{ $recce->terminals_expected }) {
pos($input) = $recce->pos;
if ($input =~ /\G(\w+)/) {
$recce->lexeme_read('variable', $recce->pos, length($1), $1);
} else {
PARSE_ERROR($input, $recce->pos, $recce->terminals_expected);
}
} else {
say $recce->show_progress;
PARSE_ERROR($input, $recce->pos, $recce->terminals_expected);
}
}

Thanks!
- Dean

Jeffrey Kegler

unread,

Dec 3, 2020, 11:47:25 AM12/3/20

to Marpa Parser Mailing LIst

Glad it worked out. The Ruby Slippers tends to be the most fun way to solve issues. :-)

Btw, I call lexemes which should never be found, "unicorns", and usually declare them like this:

unicorn ~ [^\d\D]

The pattern says to find a character which is not a decimal digit, but also not not a decimal digit. Clearly never gonna match.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/marpa-parser/c9c3d95a-e6ab-b41c-c514-26a6121b0d88%40gmail.com.

Reply all

Reply to author

Forward