Getting first/last tokens for AST node

Alexander Ryzhov

unread,

Nov 25, 2012, 3:19:33 PM11/25/12

to sab...@googlegroups.com

I am writing a semantic processor for SableCC-generated AST. I'd like to include the line/position for semantic error messages. I find it difficult to extract the token to get the line/position for an error. It would be really helpful if Node had methods like getFirstToken() / getLastToken(). Then it would be really easy to get the range of text corresponding to any AST node.

Thanks,
Alexander

Etienne Gagnon

unread,

Nov 25, 2012, 4:57:36 PM11/25/12

to sab...@googlegroups.com

Hi Alexander,

You compute (and store) this information easily by walking the tree. See
how I suggest to compute node positions using very little code in
http://lists.sablecc.org/pipermail/sablecc-discussion/msg00144.html .

Have fun!

Etienne

Etienne Gagnon, Ph.D.
http://sablecc.org

Alexander Ryzhov

unread,

Nov 25, 2012, 6:23:25 PM11/25/12

to sab...@googlegroups.com

Hi Etienne,

Thanks for the suggestion. The example needs several changes in order to work:
1) rename defaultCase -> defaultIn;
2) add if(node instanceof Token) since defaultIn is also called for Start which is not a Token;
3) replace two maps with one map where the value is a LineAndPos object.
4) instead of using the token as the key for the map, add a loop over all parent nodes and put line/pos information for all parents of the token node.

That will work but I can't say it's ideal because the complexity of the algorithm is N square where N is the number of nodes in the grammar plus the overhead of HashMap. I will use the suggested approach but I wish AST came with this information pre-computed since the desire to link the AST to the source text is quite typical (error reporting, visualization).

One more reason why I'd like to have position information in the node is because in some cases the position of the tree node is not equal to the position of the first token in its subtree. Let's say I have a block: '{' statements '}'. The position of the block is the first curly bracket, although AST won't contain the curly bracket tokens, and the position will be the position of the first statement which is also not ideal. I believe it's possible to capture the position of the first token during CST/AST conversion.

On a related subject, I'd like to add that line/pos isn't always sufficient. In some cases, it's desirable to get absolute position of the token from the beginning of the file - some editors use absolute positioning in the API. Converting line/pos to absolute position isn't trivial. Also, there is the start line/pos of the token but no end line/pos. Although it can be calculated using token text, it's more reliable to have it -- it doesn't add much memory.

Etienne Gagnon

unread,

Nov 25, 2012, 7:58:58 PM11/25/12

to sab...@googlegroups.com

Hi Alexander,

See below.

On 2012-11-25 18:23, Alexander Ryzhov wrote:

[...]Thanks for the suggestion. The example needs several changes in order to work:

1) rename defaultCase -> defaultIn;

I really meant:

@Override
public void defaultCase(Node node) {
...
}

This method is called for every token (in DepthFirstAdapter). It is not called for other nodes (because of redefinitions).

2) add if(node instanceof Token) since defaultIn is also called for Start which is not a Token;

A simple cast is sufficient, as it is not called for non-token nodes.

For other nodes, you want to use defaultOut, in the reverse adapter, as the last visited token is the first one.

3) replace two maps with one map where the value is a LineAndPos object.

You can play with it as you wish. :)

4) instead of using the token as the key for the map, add a loop over all parent nodes and put line/pos information for all parents of the token node.

You seem to forget about empty alternatives...

Productions

p =
  {alt1} a b c |
  {alt2} q;
q = ;

Notice how q has no children to give us a position. The reverse adapter will give it the location of EOF. Empty alternatives are the motivation for using the reverse adapter.

That will work but I can't say it's ideal because the complexity of the algorithm is N square where N is the number of nodes in the grammar plus the overhead of HashMap.

As far as I can see, it would be O(n), where n is the number of nodes in the AST, ignoring the cost of the HashMap. I don't see where you see quadratic complexity in my proposed code. Unless I missed something?

I will use the suggested approach but I wish AST came with this information pre-computed since the desire to link the AST to the source text is quite typical (error reporting, visualization).

SableCC 4 will have a clean implementation of locations. But this is actually difficult, as the position of a should be observably constant. The problem is that, sometimes, the user can get hold of a node before the AST is completely built using semantic selectors and investigators; like the filter() method of SableCC 2/3.

On a related subject, I'd like to add that line/pos isn't always sufficient. In some cases, it's desirable to get absolute position of the token from the beginning of the file - some editors use absolute positioning in the API. Converting line/pos to absolute position isn't trivial. Also, there is the start line/pos of the token but no end line/pos. Although it can be calculated using token text, it's more reliable to have it -- it doesn't add much memory.

Something like this is planned in SableCC 4 (e.g. start and end location for every node of the AST). For SableCC 3, my proposed tree walker should do the trick.

If you wish to compute end location, you must use the normal forward adapter, but still override defaultOut.

Have fun!

Etienne

Alexander Ryzhov

unread,

Nov 26, 2012, 11:43:34 AM11/26/12

to sab...@googlegroups.com

After giving it a second look, I realized I hadn't quite understood your example the first time. I'll get back to you once I implement it. Thanks!

Alexander Ryzhov

unread,

Dec 8, 2012, 2:14:00 AM12/8/12

to sab...@googlegroups.com

Etienne,

I have finally implemented a slightly modified version of your example and I confirm that it works. Here is the code for anyone who is interested:

    TokenMapper tm = new TokenMapper();
    astRoot.apply(tm);
    tokenMap = tm.getMap();
...

    Token getFirstToken(Node node) {
        if(node instanceof Token)
            return (Token) node;
        else
            return tokenMap.get(node);
    }

public class TokenMapper extends ReversedDepthFirstAdapter {

    private Map<Node, Token> map = new HashMap<Node, Token>();

    private Token lastToken;

    @Override
    public void defaultCase(Node node) {

        lastToken = (Token) node;
    }

    @Override
    public void defaultOut(Node node) {
        if(!(node instanceof Token))
            map.put(node, lastToken);
    }

    public Map<Node, Token> getMap() {
        return Collections.unmodifiableMap(map);
    }

Etienne Gagnon

unread,

Dec 8, 2012, 9:53:43 AM12/8/12

to sab...@googlegroups.com

Hi Alexandre,

I'm glad it worked for you. Thanks for the code!

Etienne

Etienne Gagnon, Ph.D.
http://sablecc.org

On 2012-12-08 02:14, Alexander Ryzhov wrote:
> Etienne,
>
> I have finally implemented a slightly modified version of your example
> and I confirm that it works. Here is the code for anyone who is
> interested:

> [...]

Reply all

Reply to author

Forward