Java 8 Parser

Harry

unread,

Oct 13, 2016, 11:00:29 AM10/13/16

to marpa parser

Hello,

I'm very new to Marpa but, from its description, it looks extremely awesome.

I'm also done playing with the beginner's example of the expression calculator; was also able to make small changes to it. So far, so good.

However, now, I'm trying to write a Java 8 Parser using the grammar published here:
https://docs.oracle.com/javase/specs/jls/se8/html/jls-19.html

While I think I'm able to map the above Oracle grammar spec to the G1 rules (if I stub out some of the lexemes referenced the G1 rules) and create an instance of Marpa::R2::Scanless::G, I'm having a hard time writing the L0 lexer rules in SLIF for the Lexer grammar. Some issues that I will need to (but don't know how to) deal with are:

1. Keyword vs Identifier:

The Java spec defines Identifier thus:

Identifier:
IdentifierChars but not a Keyword or BooleanLiteral or NullLiteral

IdentifierChars:
JavaLetter {JavaLetterOrDigit}

JavaLetter:
any Unicode character that is a "Java letter"

JavaLetterOrDigit:
any Unicode character that is a "Java letter-or-digit"

So, how do I do the "not a Keyword or BooleanLiteral or NullLiteral" part? In Perl regex, one could do a negative lookahead assertion like so...

if (m/ (?! $Keyword | $BooleanLiteral | $NullLiteral ) $IdentifierChars /x) {
    # this is an Identifier
}

... but only if Marpa allowed such a rich, Perl regex syntax. Which it doesn't, apparently, in SLIF.

2. Comment (single- and multi-line versions)
I could write a bunch of G1 rules to handle the multi-line Java comment, but I'm seeing it becoming very verbose. Is there an easier way to handle stuff like this in SLIF?

3. Since Marpa is Perl-based, is it possible to tap the full power of Perl regex engine, especially for lexing?

4. Notice that Java 8 spec for recognizing tokens is in the form of a Lexer grammar... that is written in BNF style instead of a 'flat', regex style. If I were to mechanically replicate the Lexer grammar using G1 rules (instead of L0 rules), would it entail a performance and space overhead by creating unnecessary tree nodes for what would otherwise be a flat lexeme in bison/flex?

5. Would Marpa experts recommend using SLIF (internal scanner) for Java 8, or should I abandon it in favor of a custom / external lexer?

Regards,
/Harry

Jeffrey Kegler

unread,

Oct 13, 2016, 12:24:53 PM10/13/16

to Marpa Parser Mailing LIst

Javascript is not Java I know, but Jean-Damien Durand has written several full language parsers, including ECMAScript: https://github.com/jddurand/MarpaX-Languages-ECMAScript-AST

--
You received this message because you are subscribed to the Google Groups "marpa parser" group.
To unsubscribe from this group and stop receiving emails from it, send an email to marpa-parser+unsubscribe@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Jeffrey Kegler

unread,

Oct 13, 2016, 12:33:41 PM10/13/16

to Marpa Parser Mailing LIst

Your specific questions, of the top of my head:

1.) You may want to look at lexeme priorities. If not, yes, external lexing may be what you need.

2.) There are several examples of ways to write multi-line comments. One is in the FAQ: http://savage.net.au/Perl-modules/html/marpa.faq/faq.html#q110

3.) Yes, but only via external lexing.

4.) Not sure this answers your question, but L0 rules allow full Marpa syntax.

5.) For a large language, this can be a very hard call. Note that you *can* switch back and forth -- you can use the SLIF for some lexemes, and use events to switch to external processing for others.

Quick answers, but I hope they help, jeffrey

Harry

unread,

Oct 15, 2016, 7:12:49 AM10/15/16

to marpa parser

Thanks, Jeffrey, for your responses.

I've gone through the documentation of Marpa::R2::Scanless::R but it's not becoming fully clear how to 'connect' the external lexing routine of mine to Marpa's built-in G1 parser. I've looked at some random Marpa code on the Net but that code is looking way too complicated as far as illustrating just the "external lexing" part goes.

My expectation was (and, is) that of a bison/flex type of interface where yyparse() calls yylex() to get the next token following which things automatically work. With Marpa, it seems, you have to do (much?) more than that (sorry, if I'm being inaccurate here).

What I already understand: When doing external lexing, I assume I'll have to create the equivalent of yylex() myself - in my case this function will, e.g., be making heavy use of Perl Regex's.

However, how would I pass the string value returned by my yylex() to Marpa parser?

Could someone please share a simple, "hello world" type of example, or if not that, at least some pseudocode?

On Thursday, October 13, 2016 at 10:03:41 PM UTC+5:30, Jeffrey Kegler wrote:

4.) Not sure this answers your question, but L0 rules allow full Marpa syntax.

May I ask, why the full Perl Regex syntax is not supported in L0 rules? If it were supported, I could've read a Java multi-line comment simply with just this one-liner:

MultilineComment ~ ('/*' .*? '*/')   # parentheses being optional

Regards,
/Harry

Jeffrey Kegler

unread,

Oct 15, 2016, 12:53:02 PM10/15/16

to Marpa Parser Mailing LIst

Here's a tutorial for small example of external lexing aka procedural parsing: http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/04/procedural.html

Jeffrey Kegler

unread,

Oct 15, 2016, 1:01:26 PM10/15/16

to Marpa Parser Mailing LIst

Re #4, why not implement Perl regexes? A full syntax of Perl regexes is gruesomely complex, and much of it is symptoms rather than features.

But some of the features *are* useful, including eager matching, which is what your example depends on. The obstacle is that Perl regexes are deterministic, while Marpa is non-deterministic. TDeterminism imposes severe limits on regexes -- they're committed to the limits and inefficiencies of a deterministic approach. *But* there is a partially compensating advantage -- deterministic thinking can be easier, particularly if you get used to its limits. So, if you are proceeding deterministically, an instruction to "accept the shortest match" is easy to implement.

I hope to add eager matching to Marpa::R3. For now, in Marpa::R2, you have to re-express the idea in BNF, even in cases where the deterministic approach is easier and more natural. Sorry.

Hopefully, the tutorial in my previous answer also shows you how to switch to lexing in Perl, so you can have the best of both worlds.

Hope this helps! -- jeffrey

On Sat, Oct 15, 2016 at 4:12 AM, Harry <simon...@gmail.com> wrote:

Jeffrey Kegler

unread,

Oct 15, 2016, 1:05:46 PM10/15/16

to Marpa Parser Mailing LIst

And here's another tutorial with a short example of external lexing: http://jeffreykegler.github.io/Ocean-of-Awareness-blog/individual/2013/06/mixing-procedural.html

On Sat, Oct 15, 2016 at 4:12 AM, Harry <simon...@gmail.com> wrote:

Paul Bennett

unread,

Oct 16, 2016, 12:06:16 AM10/16/16

to marpa-...@googlegroups.com

On Oct 15, 2016 13:01, "Jeffrey Kegler" <jeffre...@jeffreykegler.com> wrote:
>
> Re #4, why not implement Perl regexes? A full syntax of Perl regexes is gruesomely complex, and much of it is symptoms rather than features.

Somewhere deep within perldoc there's a howto on making your own \p{} named properties, which AFAICT are acceptable to Marpa's regex engine. IIRC, I once had some progress that way.

--
P/PW/PWBENNETT

Jeffrey Kegler

unread,

Oct 16, 2016, 12:08:31 AM10/16/16

to Marpa Parser Mailing LIst

Harry's example depends on eager (shortest match) recognition. IIRC Perl regex named properties won't get him there.

On Sat, Oct 15, 2016 at 9:06 PM, Paul Bennett <paul.w....@gmail.com> wrote:

On Oct 15, 2016 13:01, "Jeffrey Kegler" <jeffreykegler@jeffreykegler.com> wrote:
>
> Re #4, why not implement Perl regexes? A full syntax of Perl regexes is gruesomely complex, and much of it is symptoms rather than features.

Somewhere deep within perldoc there's a howto on making your own \p{} named properties, which AFAICT are acceptable to Marpa's regex engine. IIRC, I once had some progress that way.

--
P/PW/PWBENNETT

--

Ruslan Shvedov

unread,

Oct 17, 2016, 3:55:19 AM10/17/16

to marpa-...@googlegroups.com

On Thu, Oct 13, 2016 at 6:00 PM, Harry <simon...@gmail.com> wrote:

1. Keyword vs Identifier:

http://stackoverflow.com/questions/27109840/marpa-can-i-explicitly-disallow-keywords-as-identifiers

https://gist.github.com/rns/d19b40ffc5523659dec9 -- events can be used to analyze the string using Perl regexes and read the contex-defined lexeme.

2. Comment (single- and multi-line versions)

perhaps you can use this https://gist.github.com/jeffreykegler/5015057 by Jeffrey

Hope this helps.

Ron Savage

unread,

Oct 17, 2016, 8:12:55 PM10/17/16

to marpa parser

Using this material, I've added 2 new questions to the FAQ: http://savage.net.au/Perl-modules/html/marpa.faq/faq.html. Nos 144 and 145.

Ruslan Shvedov

unread,

Oct 18, 2016, 1:06:42 AM10/18/16

to marpa-...@googlegroups.com

Great, thanks.

BTW, s/How to I/How do I/ on both.

I'd file a PR, but couldn't find those entries at https://github.com/ronsavage/marpa.faq -- missing something perhaps.

On Tue, Oct 18, 2016 at 3:12 AM, Ron Savage <r...@savage.net.au> wrote:

Using this material, I've added 2 new questions to the FAQ: http://savage.net.au/Perl-modules/html/marpa.faq/faq.html. Nos 144 and 145.

--

Ron Savage

unread,

Oct 18, 2016, 9:19:01 PM10/18/16

to marpa parser

Thanx. Typos fixed.

Durand Jean-Damien

unread,

May 1, 2017, 2:44:47 PM5/1/17

to marpa parser

Hello,

With Marpa::R2 it is possible to do exclusions at the lexeme level using user-defined character classes.

Such an implementation was used in ECMAScript as mentionned indeed by Jeffrey, c.f; https://github.com/jddurand/MarpaX-Languages-ECMAScript-AST/blob/master/lib/MarpaX/Languages/ECMAScript/AST/Grammar/CharacterClasses.pm (which I admint is a bit hard to understand stand-alone without the grammar itself - but these are the lexeme implementation with... exclusions).
For example:

sub IsSourceCharacterButNotStarOrLineTerminator { return <<END;
+MarpaX::Languages::ECMAScript::AST::Grammar::CharacterClasses::IsSourceCharacter
-MarpaX::Languages::ECMAScript::AST::Grammar::CharacterClasses::IsStar
-MarpaX::Languages::ECMAScript::AST::Grammar::CharacterClasses::IsLineTerminator
END
}

Regards, Jean-Damien.

Reply all

Reply to author

Forward