Mining tokens before and after variable uses and declarations?

31 views
Skip to first unread message

Andrew Head

unread,
Apr 12, 2018, 8:28:00 PM4/12/18
to Boa Language and Infrastructure User Forum
Hi Boa community and maintainers,

Thank you for making this awesome tool!  It's a big help to have an existing, parsed dataset of thousands of Java files.

I'm trying to extract all variable declarations and uses from 100,000+ Java files, along with ~10 tokens worth of context on either side of each variable declaration or use.

Does Boa support this?  What would be the recommended way to do it?  After taking a look at the docs and trying out a few queries, my best guess is to:

Visit every node.  If a node does not have a `kind`, it is a terminal.  Save its value to the output as a context token. If a node has the variable `kind`, then save its value to the output with the label "variable".  In a second step offline from Boa, divide the results into variable uses and the tokens around them.
Is there any way to get the context of tokens before and after a node with the Boa API?  And would Boa be able to handle output probably tens of gigabytes in size?  Would love to hear if you have any tips on how I could best take advantage of this cool tool.

Best,
Andrew

Ganesha Upadhyaya Belle

unread,
Apr 13, 2018, 1:00:02 AM4/13/18
to Andrew Head, boa-...@googlegroups.com
Hi Andrew,

In Boa all nodes are either statements or expressions. Variable declarations are marked with expression kind VARDECL and variable accesses are marked with expression kind VARACCESS. By tokens, do you mean literals? If so, there are also marked with expression kind LITERAL. 

Once you have the variables and literals, you can simply construct and output a sequence (per project, per class, per method, etc). You may have to post process it to prepare a context of tokens before and after each variable. Something like this.

sequence := "";
visit(input, visitor{
before expr: Expression -> {
if (expr.kind == ExpressionKind.VARDECL) {
// extract the declared variable from the LHS of the expression
seq = format("%s,DEF_%s", seq, declVar);
} else if (expr.kind == ExpressionKind.VARACCESS) {
// extract the used variable
seq = format("%s,USE_%s", seq, usedVar);
} else if (expr.kind == ExpressionKind.LITERAL) {
// store the literal in either a collection or an aggregator
seq = format("%s,%s", seq, token);
}
}
});

Boa can support several GB output.

Hope this is useful.

Thanks
Ganesh

--
More information about Boa: http://boa.cs.iastate.edu/
---
You received this message because you are subscribed to the Google Groups "Boa Language and Infrastructure User Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boa-user+unsubscribe@googlegroups.com.
To post to this group, send email to boa-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Andrew Head

unread,
Apr 13, 2018, 12:34:23 PM4/13/18
to boa-...@googlegroups.com
Awesome, thanks, Ganesh!  This is a big help.  I'm currently playing around with the snippet you shared---thanks for that :)

By tokens, I mean all words in the program before and after that variable use or declaration.

Like given a line like this:
int a = multiply(1 + b, 2) + 3;

My ideal output includes tokens for keywords, operators, function calls, basically everything that's not a variable.  Something like:
TOKEN: int
VARIABLE DEF: a
TOKEN: =
TOKEN: multiply
TOKEN: (
TOKEN: 1
TOKEN: +
VARIABLE USE: b
TOKEN: ,
TOKEN, 2

For typical parse trees, I could get the tokens by printing out the text for each of the terminals in the tree.  Is there a way to visit all terminals?  Like by just visiting all Expressions that don't have child expressions, and printing out their literal value?  Here's what I'm currently trying:

p: Project = input;
sequence: output collection of string;

visit(p, visitor {
before expr: Expression -> {
if (expr.kind == ExpressionKind.VARDECL) 
sequence << format("D,%s", expr.variable);
else if (expr.kind == ExpressionKind.VARACCESS)
sequence << format("U,%s", expr.variable);
else if (len(expr.expressions) == 0 && def(expr.literal))
    # Question: Does this rule capture all terminal nodes that aren't variable declarations or uses?
sequence << format("T,%s", expr.literal);
}
});

To unsubscribe from this group and stop receiving emails from it, send an email to boa-user+u...@googlegroups.com.

To post to this group, send email to boa-...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

--
More information about Boa: http://boa.cs.iastate.edu/
---
You received this message because you are subscribed to the Google Groups "Boa Language and Infrastructure User Forum" group.
To unsubscribe from this group and stop receiving emails from it, send an email to boa-user+u...@googlegroups.com.

Ganesha Upadhyaya Belle

unread,
Apr 13, 2018, 1:01:38 PM4/13/18
to boa-...@googlegroups.com, head.a...@gmail.com
Unfortunately checking if an expression has subexpressions will not work. You will have to do it manually. Visit each expression and output all tokens. 
Reply all
Reply to author
Forward
0 new messages