Replicate Capture Groups in SableCC's Regular Expressions

69 views
Skip to first unread message

fidel....@gmail.com

unread,
Aug 11, 2022, 10:38:43 AM8/11/22
to SableCC
Hi, there!

I am trying to implement a multiline Lua string in SableCC by trying to replicate this PCRE regex \[((=*)\[(.|\n)*?)\]\2\] . Given that SableCC does not support capture groups, how could we best implement this?

My problem is that there must be a match of == between the first two [ and the last two ]. Something like this:

[==[
This is multiline string
Another line
A third line
]==]

Basically, it would be something similar to this ANTLR recursive Regular expression:

LONGSTRING
    : '[' NESTED_STRING ']'
    ;

fragment
NESTED_STRING
    : '=' NESTED_STRING '='
    | '[' .*? ']'
    ;

Is it possible to replicate this in SableCC without using the filter method?

Thanks in advance!

Fidel H Viegas

Etienne Gagnon

unread,
Aug 11, 2022, 11:21:30 AM8/11/22
to SableCC
Hi Fidel,

I would use lexer states and the filter method to handle such strings. I would define generic multiline_string_start and multiline_string_end tokens and use a filter to match them:

...
States
  normal, string_body;
...
Tokens
  {normal}      multiline_string_start     = '[' '='* '[';
  {string_body} multiline_string_end       = ']' '='* ']';
  {string_body} multiline_string_character = any_character;

The filter for multiline_string_start would record the number of '=' characters and change the state to string_body. The filter for multiline_string_end would count the number of '=' characters and, if it doesn't match the recorded number, it would pushback the 2nd to last characters into the BufferedReader (in reverse order, of course) and replace the multiline_string_end token with a multiline_string_character corresponding to the first ']' character. If the number of '=' characters matches the recorded number, the state would be changed back to normal. 

A deterministic finite automaton (as used by SableCC lexers) doesn't have memory to match exact '=' counts. An LR automaton (as used by SableCC parsers) has a single stack; this is sufficient to properly handle recursion, but it's insufficient to match '=' counts or, similarly in a programming langage with declarations, to detect whether a variable has been declared or not. Such counting requires a Turing Machine (which is equivalent to a DFA with two stacks or an infinite tape). SableCC encourages (when possible) to delay semantic analysis (which is the analysis of everything not already handled by the lexer's DFA and the parser's LR automaton) to a separate phase that follows lexing and parsing, using tree traversals with the DepthFirstAdaptor class. But, sometimes, like with multiline strings, some semantics must be handled during lexing and parsing and can't be delayed later. The filter() methods of the Lexer and Parser classes provide the ability to use a Turing-complete langage (Java) to analyze semantics during lexing and parsing, and, affect the output.

In other words, there's no way to escape the use of the filter method to handle the case you provided, which requires counting '=' characters for lexical matching.

Have fun!

Etienne    

fidel....@gmail.com

unread,
Aug 11, 2022, 8:13:06 PM8/11/22
to SableCC
Hi, Ettienne!

Thanks for taking the time to reply.

I actually came to the same approach you used, and managed to make it work. However, the rules are more complex than that, which I think it is not possible to achieve with SableCC. That is, the following are all valid strings:

local s = [==[
[====[
This is a line
this is another line
[====[
this is another line
]==]

Only the first and the last must match. If you try to print this string, you get:

[====[
This is a line
this is another line
[====[
this is another line

If the above code was:

local s = [==[
[====[
This is a line
this is another line
[====[
this is another line
]===]

you'd get an error. I will hand craft the Lexer with the same signature as the one generated by SableCC and use that instead of the one it generates. The language is very small, so it isn't that much work.

Thanks a lot for your time!

Regards,

Fidel H Viegas

Etienne Gagnon

unread,
Aug 11, 2022, 8:51:48 PM8/11/22
to SableCC
Hi Fidel,

I think that you meant to write this example, instead (see how the marker on line 5 uses ']" characters instead of '[' characters):

1 local s = [==[
2 [====[
3 This is a line
4 this is another line
5 ]====]
6 this is another line
7 ]==]

The solution I outlined in my previous reply properly handles this example. The lexer would first match  '[==[' as a multiline_string_start token. This would trigger the filter which would record that there were 2 '=' characters in the string marker, change the lexer state to string_body, and then let the parser continue its job. In the new string_body lexer state, single characters would then be matched as multiline_string_character tokens until the first closing marker ']====]' is matched. The filter would notice that a multiline_string_end was matched, count its '=' characters and detect that 4 doesn't match the count of the opening marker (which was 2). As a result, the filter would (1) push back ']', '=', '=', '=', and '=' into the PushbackReader, (2) create a new multiline_string_character with '[' as text, (3) set the lexer's token to this new token and finally (4) return. Eventually, the second closing marker will be matched, and the count will be 2, corresponding to the recorded count. The filter will detect this, change back the lexer state to normal, and return.

This assumes that the grammar has a production for multiline strings like this:

Productions
...
multiline_string = multiline_string_start multiline_string_character* multiline_string_end;
...

Reminder: Chapter 4 of my M.Sc. thesis explains in Section 4.5 that in addition to the filter() methods, the Lexer class has 2 protected fields: protected Token token and protected State state. The filter method is called before the token is returned to the parser; it can change the token so that a different token is returned to the parser, or set the token to null so that the lexer directly starts matching the next token instead of returning control to the parser. The filter method can also change the lexer state, effectively changing the set of tokens that will be matched.

Have fun!

Etienne

fidel....@gmail.com

unread,
Aug 12, 2022, 11:53:16 AM8/12/22
to SableCC
Hi, Ettienne!

Thank you once again for your input. 

I initially implemented something similar to nested comments, and was keeping a count of the number of opening_string and closing_string, but also keeping track of the number of '=' signs of the first string_start   '[' '='* '[' and the last string_end   ']' '='* ']'. This works fine if the strings are nested, and we only keep track of the first string_start and the last string_end, because that's what needs to match. But looking at your suggestion where you implement it in parser, gave me another idea to test it.

The code I posted was correct, and it was indeed this one:

1 local s = [==[
2 [====[
3 This is a line
4 this is another line
5 [====[
6 this is another line
7 ]==]

This is a legal Lua string, and if you print it out, you will get this in the output:

-------------- BEGIN OUTPUT --------------------------

[====[
This is a line
this is another line
[====[
this is another line
------------- END OUTPUT ----------------------------

Only the first multi_line_start and the last multi_line_end must match in terms of number of '=' signs count. Thanks once again for your suggestion. I had totally neglected passing this to the parsing stage.

Fidel.
Reply all
Reply to author
Forward
0 new messages