Improving LexerException

Gonzalo Ortiz Jaureguizar

unread,

Sep 17, 2012, 5:34:16 AM9/17/12

to sab...@googlegroups.com

Hi all,

One of the worst things with SableCC is to manage errors. In my work we are developing a some new languages and we have to manage errors and show them to the user. By default, SableCC Lexers throw an LexerException when discover text that do not match with Lexer. This exception has a message like: "[line, pos] unknown token 'unknown_token_text'". We want change the error message, but is not easy to get line and position of the unknown token because these attributes are private in Lexer class and LexerException doesn't have these attributes. Of course we can get these attributes matching an easy regular expression with the exception message, but I think it would be better if we can access these attributes with getters and setters (in Lexer class or LexerException class).

What do you think?

Phuc Luoi

unread,

Sep 17, 2012, 3:57:49 PM9/17/12

to sab...@googlegroups.com

Hi Gonzalo Ortiz,

I used Lexer.peek() to get the last token, which the lexer success reads. I have of course not exact position
of the lexical error, but it is useful enough for my project. If you take a look into the LexerException you will
see that it is very simple. I guest you can use the method LexerExcepion.getMessage() to get the error message.

Hong Phuc

--
-- You received this message because you are subscribed to the SableCC group. To post to this group, send email to sab...@googlegroups.com. To unsubscribe from this group, send email to sablecc+u...@googlegroups.com. For more options, visit this group at https://groups.google.com/d/forum/sablecc?hl=en

Gonzalo Ortiz Jaureguizar

unread,

Sep 17, 2012, 4:04:00 PM9/17/12

to sab...@googlegroups.com

Yeah, I know that I can parse the error message, but this is not te best solution. Is very easy to improve the API! And, for example, ParserException has attributes like start token!.

My temporal solution is to add a new "error" token in the last position. This token match with all strings, so Lexer never throws an ParserException and I can "catch" the token error and show it as an error in the IDE. This way allow me to show more than one error, because another feature that would be great is the ability to get the next token once you find an error.

Niklas Matthies

unread,

Sep 17, 2012, 4:38:19 PM9/17/12

to sab...@googlegroups.com

In my own modified version of SableCC I use(d), I patched the Token
type to include the start and end positions (where position includes
line, column, and character index within the input stream). I also
added an UnknownTokenException type as a nested type of the Lexer
class, which provides the 'unknown_token_text' part as a separate
string.

You can make changes like these just by editing lexer.txt within the
SableCC jar file (i.e. no need to recompile SableCC). Of course, it
would be better to have the official API be improved.

Niklas Matthies

Gonzalo Ortiz Jaureguizar

unread,

Sep 18, 2012, 3:21:25 AM9/18/12

to sab...@googlegroups.com, ml_sabl...@nmhq.net

The problem with this aproach is, as you said, that every time you recomplie your grammar, changes you made are deleted. I work with a team, with very complex grammars, and is pretty usual that we have make some changes, so we have to change again the generated code.
One of the bests features of SableCC is that, if you do not change the abstract syntax, is not necessary to change your DFA... but lexer and parser are very monolitic! It would be great if generated lexer and grammar delegates its logic in protected methods, so we can extend the lexer/parser and modify some aspects of its behavior, and those changes would be retained to generate the code again.
This could be a very good improvement in a new version of SableCC!

Etienne Gagnon

unread,

Sep 20, 2012, 1:23:18 PM9/20/12

to sab...@googlegroups.com

Hi,

For lexical error position problem, I think that the easiest, backward-compatible solution would be to create an InvalidToken with a position and a single character of text (the character at the error location). Note that there must be such a character; otherwise EOF would have been found. Then, we could attach this token to the LexerException. What do you think?

As for making generated classes modifiable, I think that this is a Java-language problem.

One needs "aspects" or, even better, class refinement (often called "open classes") to be able to add new features to an existing class, without having to modify the original source file. I haven't found a clean and simple way to provide this functionality in Java.

Using class factories does not work, because of static typing. In other words, even if SableCC allowed users to define their own subclasses of Node and use a usere provided factory to create AST nodes, the walker (DepthFIrstAdapter) methods would still use the generated classes/interface as arguments for [case|in|out]XXX methods. So, one would need to use a type cast to access any new feature of the node class. e.g.

  public void caseAIfStatement(AIfStatement node) {
    MyIfStatement mine = (MyIfStatement) node;
    mine.newMethod();
  }

Class refinement (open classes) allows to add new methods to existing types. e.g.

refine class AIfStatement{
  public void newMethod() { ... }
}

so that you can, later, cleanly write:

  public void caseAIfStatement(AIfStatement node) {
    node.newMethod();
  }

without any typing error.

Etienne

Etienne Gagnon, Ph.D.
http://sablecc.org

Gonzalo Ortiz Jaureguizar

unread,

Sep 21, 2012, 3:54:23 AM9/21/12

to sab...@googlegroups.com

Hi,

Etienne, I agree with you. If we want to extend Node class, we have to use some castings (although, in this case some cast are not to painful). But if we want to extend Lexer (or maybe Parser) it is pretty easy to extend it overriding methods! For example, Lexer#getToken() works in this way when a character is not a valid token:

                    if(this.text.length() > 0)
                    {
                        throw new LexerException(
                            "[" + (start_line + 1) + "," + (start_pos + 1) + "]" +
                            " Unknown token: " + this.text);
                    }

If we change that part with this other:

                    if(this.text.length() > 0)
                    {
                        token = invalidToken(start_line, start_pos);
                    }

and we add this method:

    protected Token invalidToken(int start_line, int start_pos) throws LexerException {
        throw new LexerException(
                            "[" + (start_line + 1) + "," + (start_pos + 1) + "]" +
                            " Unknown token: " + this.text);
    }

Then Lexer will be backward-compatible and "extensible", ie, we can extend Lexer, override Lexer#InvalidToken method and we can do whatever we want (for example, generate our own invalid token, or thrown a sublass of LexerException. And we could regenerate the lexer without overwriting our changes

The same pattern can be applied in other parts of the code. For example, as Netbeans APIs points characters with offset and not with (line, column) coordinates, I need to translate SableCC token (line, column) to document offset. We can apply this pattern when '\n' or '\r' is read (where line is increased and pos set to 0) and I can override these new methods to modificate my own offset attribute and then (in other method, when each token is built) set the offset attribute to my own token.

I know that all these changes mean to spend some time in engineering tasks, but it would be a great improvement for SableCC users who need to do something that is not supported by the default lexer.

Reply all

Reply to author

Forward