Scanless lexer doesn't try shorter lexem

Ruslan Zakirov

unread,

Jan 6, 2014, 4:47:18 PM1/6/14

to marpa-parser

Hi,

After long period of ignoring pet projects I'm in attempt to convert one to scanless interface. Stuck and don't understand where to go next.

Full script and full output: https://gist.github.com/ruz/8290356

Here is error I get:

--
Best regards, Ruslan.

Ruslan Zakirov

unread,

Jan 6, 2014, 4:54:43 PM1/6/14

to marpa-parser

Sorry, fat fingers. Continues at the end...

Error in SLIF parse: No lexemes accepted at line 3, column 1

Lexer "L0" rejected 1 lexeme(s)

Rejected lexeme #1: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"; length = 49

Progress by that time:

P4 @4-4 L2c12 content_name -> . name

P5 @4-4 L2c12 content_name -> . group '.' name

name and group are rules in G0/L0:

G0 R6 name ::= A_D_D

...

G0 R34 A_D_D ::= [A-Za-z0-9-] +

...

G0 R85 :start_lex ::= name

Trace of the terminals:

Lexer "L0" accepted lexeme L2c12: CRLF; value="

"

Lexer "L0" rejected lexeme L3c1-49: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"

Here come questions:

1) why doesn't lexer try name with value="UID"?

2) should I disambiguate everything on lexer level?

Tried the most recent dev release with the same result.

--
Best regards, Ruslan.

--
Best regards, Ruslan.

Ron Savage

unread,

Jan 6, 2014, 5:01:50 PM1/6/14

to marpa-...@googlegroups.com

The code hangs (or seems to) for me. I'll keep playing with it.

Ron Savage

unread,

Jan 6, 2014, 5:05:00 PM1/6/14

to marpa-...@googlegroups.com

(1) In this line:

QSAFE_CHAR ~ [!\x23-\x7E] | WSP | NON_ASCII

is the '!' meant to be a literal '!' or a negated set, in which case '^' is used?

(2) In this line:

| [\xF0] [\x90-\xBF][\x80-\xBF][\x80-\xBF]

is that '\x90' meant to be '\x80' like the lines above and below it, or really '\x90'?

Ron Savage

unread,

Jan 6, 2014, 5:17:41 PM1/6/14

to marpa-...@googlegroups.com

My typo caused this issue.

Ruslan Zakirov

unread,

Jan 6, 2014, 5:19:45 PM1/6/14

to marpa-parser

On Tue, Jan 7, 2014 at 2:05 AM, Ron Savage <r...@savage.net.au> wrote:

(1) In this line:
QSAFE_CHAR ~ [!\x23-\x7E] | WSP | NON_ASCII
is the '!' meant to be a literal '!' or a negated set, in which case '^' is used?

literal '!'

(2) In this line:

| [\xF0] [\x90-\xBF][\x80-\xBF][\x80-\xBF]
is that '\x90' meant to be '\x80' like the lines above and below it, or really '\x90'?

I wrote this part long time ago, but as far as I recall it should be \x90 to skip some invalid UTF-8 sequences.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Best regards, Ruslan.

Ron Savage

unread,

Jan 6, 2014, 5:31:39 PM1/6/14

to marpa-...@googlegroups.com

OK

Ron Savage

unread,

Jan 6, 2014, 5:57:36 PM1/6/14

to marpa-...@googlegroups.com

I made some small changes:

ron@zigzag:~/Documents/repos/marpa.papers$ diff ~/bin/vcard.parser.orig.pl ~/bin/vcard.parser.pl

0a1,2

> #!/usr/bin/env perl

>

7c9

< my $syntax = <<'END';

---

> my $syntax = <<'EOS';

15,17c17,19

< group ~ A_D_D

< name ~ A_D_D

< params ::= ';' param_list | empty

---

> group ::= A_D_D

> name ::= A_D_D

> params ::= SEMICOLON param_list | empty

21c23

< any_param_name ~ A_D_D

---

> any_param_name ::= A_D_D

86c88

< END

---

> EOS

89c91

< say "rules L0:\n", $grammar->show_rules(1, 'G0');

---

> #say "rules L0:\n", $grammar->show_rules(1, 'G0');

and I get:

ron@zigzag:~/Documents/repos/marpa.papers$ ~/bin/vcard.parser.pl

Setting trace_terminals option

Lexer "L0" rejected lexeme L1c1-11: text; value="BEGIN:VCARD"

Lexer "L0" accepted lexeme L1c1-11: 'BEGIN:VCARD'; value="BEGIN:VCARD"

Lexer "L0" accepted lexeme L1c12: CRLF; value="

"

Lexer "L0" rejected lexeme L2c1-11: text; value="VERSION:4.0"

Lexer "L0" accepted lexeme L2c1-11: 'VERSION:4.0'; value="VERSION:4.0"

Lexer "L0" accepted lexeme L2c12: CRLF; value="

"

Lexer "L0" rejected lexeme L3c1-49: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"

progress:

P0 @0-0 L1c1 vCards -> . vCard +

P1 @0-0 L1c1 vCard -> . 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF content 'END:VCARD'

P36 @0-0 L1c1 :start -> . vCards

R1:1 @0-1 L1c1-11 vCard -> 'BEGIN:VCARD' . CRLF 'VERSION:4.0' CRLF content 'END:VCARD'

R1:2 @0-2 L1c1-12 vCard -> 'BEGIN:VCARD' CRLF . 'VERSION:4.0' CRLF content 'END:VCARD'

R1:3 @0-3 L1c1-L2c11 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' . CRLF content 'END:VCARD'

R1:4 @0-4 L1c1-L2c12 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF . content 'END:VCARD'

P2 @4-4 L2c12 content -> . content_line +

P3 @4-4 L2c12 content_line -> . content_name params ':' value CRLF

P4 @4-4 L2c12 content_name -> . name

P5 @4-4 L2c12 content_name -> . group '.' name

P6 @4-4 L2c12 group -> . A_D_D

P7 @4-4 L2c12 name -> . A_D_D

Error in SLIF parse: No lexemes accepted at line 3, column 1

Lexer "L0" rejected 1 lexeme(s)

Rejected lexeme #1: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"; length = 49

* String before error: BEGIN:VCARD\nVERSION:4.0\n

* The error was at line 3, column 1, and at character 0x0055 'U', ...

* here: UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1\n

Marpa::R2 exception at /home/ron/bin/vcard.parser.pl line 96.

So it is trying A_D_D.

Ruslan Zakirov

unread,

Jan 6, 2014, 6:08:39 PM1/6/14

to marpa-parser

Hi,

Shorter script that demos problem: https://gist.github.com/ruz/8291475

Comments below:

You see here that lexer rejected text rule, but accepted literal rule of the same length.

Lexer "L0" accepted lexeme L1c12: CRLF; value="
"
Lexer "L0" rejected lexeme L2c1-11: text; value="VERSION:4.0"
Lexer "L0" accepted lexeme L2c1-11: 'VERSION:4.0'; value="VERSION:4.0"

Once again.

Lexer "L0" accepted lexeme L2c12: CRLF; value="

"
Lexer "L0" rejected lexeme L3c1-49: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"

Here lexer went for longer match and never tried A_D_D; value="UID".

progress:
P0 @0-0 L1c1 vCards -> . vCard +

P1 @0-0 L1c1 vCard -> . 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF content 'END:VCARD'
P36 @0-0 L1c1 :start -> . vCards
R1:1 @0-1 L1c1-11 vCard -> 'BEGIN:VCARD' . CRLF 'VERSION:4.0' CRLF content 'END:VCARD'

R1:2 @0-2 L1c1-12 vCard -> 'BEGIN:VCARD' CRLF . 'VERSION:4.0' CRLF content 'END:VCARD'
R1:3 @0-3 L1c1-L2c11 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' . CRLF content 'END:VCARD'

R1:4 @0-4 L1c1-L2c12 vCard -> 'BEGIN:VCARD' CRLF 'VERSION:4.0' CRLF . content 'END:VCARD'
P2 @4-4 L2c12 content -> . content_line +
P3 @4-4 L2c12 content_line -> . content_name params ':' value CRLF

P4 @4-4 L2c12 content_name -> . name
P5 @4-4 L2c12 content_name -> . group '.' name
P6 @4-4 L2c12 group -> . A_D_D
P7 @4-4 L2c12 name -> . A_D_D

Error in SLIF parse: No lexemes accepted at line 3, column 1
Lexer "L0" rejected 1 lexeme(s)
Rejected lexeme #1: text; value="UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1"; length = 49

* String before error: BEGIN:VCARD\nVERSION:4.0\n
* The error was at line 3, column 1, and at character 0x0055 'U', ...
* here: UID:urn:uuid:4fbe8971-0bc3-424c-9c26-36c3e1eff6b1\n

Marpa::R2 exception at /home/ron/bin/vcard.parser.pl line 96.

So it is trying A_D_D.

Sure. Recognizer waits for A_D_D, but lexer never offers it.

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Best regards, Ruslan.

Ron Savage

unread,

Jan 6, 2014, 6:11:32 PM1/6/14

to marpa-...@googlegroups.com

I assume the problem is that you're using ':' for 2 purposes:

content_line ::= content_name params ':' value CRLF

and

TEXT_CHAR ~

[\\] [\\n,;:]

Ruslan Zakirov

unread,

Jan 6, 2014, 6:32:59 PM1/6/14

to marpa-parser

On Tue, Jan 7, 2014 at 3:11 AM, Ron Savage <r...@savage.net.au> wrote:

I assume the problem is that you're using ':' for 2 purposes:

Why is this a problem? Separator between name and value can be used in the value, not so rare.

Question is lexer in scanless interface able to deal with lexing ambiguity or not?

I see evidence that it should be, but fails in this particular case. Am I wrong?

content_line ::= content_name params ':' value CRLF

and

TEXT_CHAR ~
[\\] [\\n,;:]

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

--
Best regards, Ruslan.

Jeffrey Kegler

unread,

Jan 6, 2014, 6:35:11 PM1/6/14

to marpa-...@googlegroups.com

First off, welcome back. Since you've been away for a while, allow me to let new readers know that you're the founder of this mailing list, and someone whose support and advice have been very valuable to Marpa.

Second, I'm about to dive into the answer, but I'm very open to ideas that would make Marpa easier to use.

Marpa does an old-fashioned longest-tokens-match. I does have information about tokens expected, but it imitates traditional parsers in not using that. At the beginning, it looks for the longest token. If there is only one, and it is not acceptable to the grammar, the parse fails. In this case it finds a <value>, and because a <value> is not acceptable first thing, the parse fails.

Longest-tokens-match requires that you contrive it so that the longest token, including those which the grammar will not find acceptable, is always the one you want. Could Marpa do it differently? Yes, and it will in the future. (Aside @amon: perhaps the IRIF already does better?)

-- jeffrey

Jeffrey Kegler

unread,

Jan 6, 2014, 6:50:05 PM1/6/14

to marpa-...@googlegroups.com

By the way, essentially this same problem came up on stackoverflow, and a solution for the current SLIF is there.

-- jeffrey

On 01/06/2014 03:08 PM, Ruslan Zakirov wrote:

Ruslan Zakirov

unread,

Jan 6, 2014, 6:56:21 PM1/6/14

to marpa-parser

Hi,

Peter mentioned Longest-tokens-match off list an hour ago and I only noticed it 5 minutes ago. This is what I was not expecting from scanerless interface.

This means Repa is still valid thing. I should kill all attempts at continuos parsing in it and release.

Pauses and manual lexing are not "sexy" :)

What is IRIF? Is it new marpa front end with inline actions?

Jeffrey Kegler

unread,

Jan 6, 2014, 7:10:53 PM1/6/14

to marpa-...@googlegroups.com

On 01/06/2014 03:56 PM, Ruslan Zakirov wrote:
> Pauses and manual lexing are not "sexy" :)

I'd tend to agree, but on that basis it is hard to explain the
popularity of recursive descent. One of my reasons for this approach is
that it's the way other parsers/lexers currently work.

> What is IRIF? Is it new marpa front end with inline actions?

Yes. I wind up having to discuss the virtues/failings of various
interfaces a lot, and the 4-letter abbreviations are handy for me. Amon
describes the IRIF, accurately, as a re-imagining of the SLIF. Among
other things, he throw away my lexing strategy and replaced it with his own.

-- jeffrey

Peter Stuifzand

unread,

Jan 6, 2014, 7:11:00 PM1/6/14

to Ruslan Zakirov, marpa-parser

I didn't know it was off list. It seems I need to learn how to use my new phone.

LTM is pretty useful when parsing programming languages, but hard to work with in data like or ad hoc formats in my experience.

Peter

On Jan 7, 2014 12:56 AM, Ruslan Zakirov <ruslan....@gmail.com> wrote:

Hi,

Peter mentioned Longest-tokens-match off list an hour ago and I only noticed it 5 minutes ago. This is what I was not expecting from scanerless interface.

This means Repa is still valid thing. I should kill all attempts at continuos parsing in it and release.

Pauses and manual lexing are not "sexy" :)

What is IRIF? Is it new marpa front end with inline actions?

Ron Savage

unread,

Jan 6, 2014, 7:14:29 PM1/6/14

to marpa-...@googlegroups.com

I'm simply not sure :-).

Reply all

Reply to author

Forward