Some samples using the treequery API prototype

20 views
Skip to first unread message

Fabien

unread,
Sep 29, 2011, 8:19:17 AM9/29/11
to met...@googlegroups.com
And here comes my first actual mail to the new Metalua mailing list!

A while ago, on the old list, we started a discussion about a proper API to visit syntax trees. Until now, the discussion focused only on what the API should look like, not on implementation details. I've since written a quick and dirty prototype of the API proposed in previous mails, over metalua.walk. It's currently available here:


(I've written it in metalua, but it doesn't use any metalua advanced wizardry for now, so unless this changes, the definitive version will probably be written in plain Lua for wider compatibility)

It doesn't use all  of metalua.walk features, very far from that. I think that the definitive version should be written as an alternative to walk, rather than a wrapping layer (I hope to deprecate the old walk library).

Also, I think I've found a clean way to implement scope-aware predicates, although it isn't done yet. More on that later.

Here are some examples of some non-trivial operations performed with the API, to give an idea of its current expressiveness. All of the code below runs with the prototype above.


Fist example
Let's get a list of all expressions, thanks to the is_expr predicate:

    local M = require 'treequery'
    function get_all_expresssions(ast) 
        return M.treequery(ast) :filter (M.is_expr) :list() 
    end


Filtering on node tags
Now let's list all identifier names appearing in an AST. We select the `Id nodes, then put the names in a dictionary:

function all_names(ast)
    local names = { }
    M.treequery(ast) :filter (M.has_tag ('Id')) :foreach (function(id) 
        names[id[1]] = true
    end)
    return names
end

Since here we want to perform an action on each node rather than returning the complete list, we use
:foreach(f) rather than :list().


Terser has_tag() syntax
There's a shortcut admitted in all relevant contexts: when a predicate is expected, and a string or a sequence of strings is found instead, the string(s) are interpreted as an has_tag() predicate. Therefore the two invocations below are equivalent:

    :filter (M.has_tag ('Id'))
    :filter 'Id'

The relevant statement above can therefore be rewritten:

    M.treequery(ast) :filter 'Id' :foreach (|x| names[x[1]]=true)

If I eventually support loop iterators and use metalua's lists by comprehension, the whole function will become a one-liner (warning, this code doesn't currently run):

    all_names = |ast| {[x[1]]=true for x in M(ast) :filter 'Id' :foreach())}


Composing several filters
Here's how to get all `Call nodes:

    M.treequery(ast) :filter(M.has_tag('Call'))

Or, more tersely:

    M.treequery(ast) :filter 'Call'

Now, we only want to find all function calls which are also statements (they can be expressions or statements depending on the context):

    M.treequery(ast) :filter 'Call' :filter (M.is_stat)
  

More complex predicates
Find all function call arguments. This requires to find `Call nodes, but to keep their children, rather than the call nodes themselves. This is done with the parent() predicate modifier (it transforms a predicate on a node into a predicate on the node's parent). Notice that parent also performs the "string" -> has_tag("string") transformation automatically.

Then, the first child node is the function being called, not an argument. It will be filtered out with the nth() predicate generator.

function func_args(ast)
    return M.treequery(ast) 
        :filter (M.parent 'Call') -- nodes whose parents are Calls
        :filter (M.nth(2,-1))     -- nodes which are #2...last child
        :list()
end

Notice that it would be nice to also select invocation arguments, but it can't be done properly without a way to denote a form of parallel visit. The only current way would be to concatenate lists together, but it would require two visits, and the resulting list wouldn't be properly ordered.


Nested requests
Here's an example which nests two requests: we want to build a table which associates each function node with the list of its return statements. It makes sure not to erroneously include returns from nested functions. Notice the use of two nested requests.

function func_returns(ast)
    local r = { }
    M.treequery(ast) :filter 'Function' :foreach (function (func_node)
        r[func_node] = M.treequery(func_node[2]) -- visit the function's body
            :not_under 'Function'                -- don't visit sub-functions
            :filter    'Return'
            :list()
    end)
    return r
end

The next examples will hopefully be based on scope-aware variables :)

David Manura

unread,
Oct 2, 2011, 12:00:03 AM10/2/11
to Fabien, met...@googlegroups.com
On Thu, Sep 29, 2011 at 8:19 AM, Fabien <fleut...@gmail.com> wrote:
> Here are some examples of some non-trivial operations performed with the
> API, to give an idea of its current expressiveness.

That's a good starting point and looks generally ok. Real world usage
will say how applicable this is. Eventually some benchmarks can be
worthwhile too (e.g. compared to hand-written searches).

>         :filter (M.parent 'Call') -- nodes whose parents are Calls [...]


> Notice that it would be nice to also select invocation arguments, but it
> can't be done properly without a way to denote a form of parallel visit.

Can't you just replace 'Call' with has_tag{'Call', 'Invoke'} ?

BTW, in has_tag, `node_tag and tags[node_tag]` can be replaced with
`tags[node_tag]`, which should still work even when node_tag is nil.

Fabien

unread,
Oct 2, 2011, 9:07:16 AM10/2/11
to David Manura, met...@googlegroups.com
On Sun, Oct 2, 2011 at 6:00 AM, David Manura <dm....@math2.org> wrote:
Eventually some benchmarks can be worthwhile too (e.g. compared to hand-written searches).

I wonder why. Sure, it will be slower than carefully hand-written code, but do you foresee many situations where one needs high-speed complex requests?

At the Lua workshop, Roberto made some very interesting remarks about how stressing programmers with irrelevant performance considerations encourages ugly code. I think that the minority who genuinely needs speed will always find its way, but ingraining the idea that readable code might be slower will prevent some less skilled developers from using the maintainable API, out of misguided machismo.

> :filter (M.parent 'Call') -- nodes whose parents are Calls [...]
> Notice that it would be nice to also select invocation arguments, but it
> can't be done properly without a way to denote a form of parallel visit.

Can't you just replace 'Call' with has_tag{'Call', 'Invoke'} ?

Yes you can, but you'll get the wrong answer :) the arguments are defined differently for both nodes: for `Cal,l they're all children except the first one, whereas for `Invoke, they're all children except the *two* first ones.

I think I have a correct solution to this, though, which will let write:

    keep_from_n = |n| |x| select(n, unpack(x))
    t = M(ast) :filter('Call', 'Invoke') :map('Call', keep_from_n(2)) :map('Invoke', keep_from_n(3)) :list()
    return table.flatten(t)

(map applies a transformation function, but also accepts filtering predicates before that function: if a predicate fails, the node isn't removed from the list, but it isn't transformed)

BTW, in has_tag, `node_tag and tags[node_tag]` can be replaced with `tags[node_tag]`

Indeed.

David Manura

unread,
Oct 4, 2011, 11:38:57 PM10/4/11
to met...@googlegroups.com
On Sun, Oct 2, 2011 at 9:07 AM, Fabien <fleut...@gmail.com> wrote:
>> Eventually some benchmarks can be worthwhile too (e.g. compared to
>> hand-written searches).
> I wonder why. [...] do you foresee many situations where one needs
> high-speed complex requests? [...] I think that the minority who genuinely
> needs speed will always find its way [...]

Speed probably usually doesn't matter. One case where it can matter
is real-time processing like in an IDE. Last time I did profiling on
LuaInspect (it was awhile ago), sluggishness on processing a large
file was due not only to parsing but also tree walking + decorating
(though I don't recall how much walking cost in isolation though).
The multiple pairs/ipairs per node traversal I see in your code make
me wonder about that. However, these applications indeed are the
minority, and they will find their own way as you say with appropriate
techniques. (Incidentally, people report sluggishness issues with
Visual Studio and Eclipse too, the latter of which you would know much
better than I.)

One possible alternate implementation of this library is to use code
generation techniques: compile a query into a Lua source code string
passed into loadstring (like my ListComprehensions and ExPattern Lua
libraries do). Regular expression matching libraries tend to be
similar in general: you "compile" the regular expression into some
efficient format and then reuse it any number of times. Such an
optimization is not a priority right now but it may be something to
keep in mind.

> some methods are "positional selectors": nodes are selected according to their position,
> relative to the node which satisfied the predicate. [...]
> :filter(p) is also a selector, it only select nodes which satisfy predicate p.

I think it helps clarify to think of it that way: filter is a special
type of positional selector. In the code, it looked like these were
two distinct concepts. You can probably even express (though maybe
not efficiently) all positional selectors in terms of filter because
each filter is passed the entire path of nodes up to the root. Your
parent and nth filter functions do something like that and thereby
blur the distinction between filter and the positional selectors.

Note: your selectors generally maintain some type of state across node
visits. Something else that maintains state across node visits is
lexical scope. Therefore, these both might share a pattern. Maybe
the maintenance of the "path of nodes up to the root" can likewise
also be a type of walk state that can be moved out of the core
behavior and be made optional for only those queries that need it.

> Another point which hasn't been addressed, but is extremely important, is scoped variable handling.
> [...] The way I see it, variables have a binder which "create" them. For instance,
> "`Function{ {`Id 'a', `Id 'b'}, body }" is the binder of the variables named "a" or "b" which occur
> in "body". Each occurrence of a local variable has one binder; each binder has zero, one,
> or many occurrences of its variable. Global variables are, by definition, the variable occurences
> without a binder.
> Maybe the binder shouldn't be defined as a node, but as a node with an indice. This way,
> each binder represents a single variable. In the example above, the binder of "b" in "body"
> is "{node=`Function{ {`Id 'a', `Id 'b'}, body }; indice=2}". The exact representation of a binder
> will be determined during implementation.

In comparison in LuaInspect, if ast.tag == 'Id', then
ast.localdefinition refers directly to the 'Id' node in the binder
(not to the binder itself). ast.localdefinition is nil for global
variables. It also happens to be the case that ast.localdefinition ==
ast for the 'Id' node in the binder itself. When storing information
about some lexical variable (e.g. a flag indicating whether the
variable is referred to (isused) or mutated (isset) anywhere else),
the ast.localdefinition node is used as the key or the storage
location for this information. This is more efficient as a table key
than trying to use the binder + index position. On the other hand, I
typically don't care about which binder node contains an 'Id' node,
but if I do I can walk up the ancestors (e.g. in my case, ast.parent
links). Highlighting tokens corresponding to `Id nodes of locals
declared as function parameters (`Function binders) differently from
tokens of locals declared via a local statement (`Local binders) is
one example of this.

> var.occurence(b)(n)

Here's another class of use cases to consider, one that I've needed in
more than one project: find all local variables that are constant (not
assigned to outside of the binder), are unused (e.g. for dead-code
elimination or typo detection), or mask another lexical of the same
name. In these cases you're dealing with more than one variable per
walk.

Fabien Fleutot

unread,
Oct 5, 2011, 5:48:58 AM10/5/11
to met...@googlegroups.com
On Wed, Oct 5, 2011 at 5:38 AM, David Manura <dm....@math2.org> wrote:
 The multiple pairs/ipairs per node traversal I see in your code make
me wonder about that.

Each child is iterated over 0 or 1 time, not more. There are several loops in the walker so that stats, exprs etc. are visited intelligently, not as dumb trees, but only one of these loops is run on a given node's children. In treequery, there's a loop on positional filter lists, to keep track of whether they're satisfied, but the complexity grows with the number of predicates, not the tree size.

One possible alternate implementation of this library is to use code
generation techniques: compile a query into a Lua source code string
passed into loadstring (like my ListComprehensions and ExPattern Lua
libraries do).  Regular expression matching libraries tend to be
similar in general: you "compile" the regular expression into some
efficient format and then reuse it any number of times.  Such an
optimization is not a priority right now but it may be something to
keep in mind.

loadstring() hacks cause many problems, most notably of scope, and are doomed to remain brittle. It's possible to do that robustly with a metalua extension, but it seems that we can get something pretty decent in plain Lua as well.

This is a question worth re-investigating after once dynamic version works fine, though, especially if it proves too slow in some contexts.
 
> some methods are "positional selectors": nodes are selected according to their position,
> relative to the node which satisfied the predicate. [...]
> :filter(p) is also a selector, it only select nodes which satisfy predicate p.

I think it helps clarify to think of it that way: filter is a special
type of positional selector.  In the code, it looked like these were
two distinct concepts.

Implementation details happen to differ indeed, mostly because :filter() doesn't affect the traversal strategy, but this shouldn't be the user's business. Conceptually, filters are (very local) positional filters.
 
 You can probably even express (though maybe
not efficiently) all positional selectors in terms of filter because
each filter is passed the entire path of nodes up to the root.

It isn't impossible, but I think the complexity would degrade from o(n) to o(n*log(n)) for :under(), and from o(n) to o(n^2) for :after(). Hardly an improvement :)

Note: your selectors generally maintain some type of state across node
visits.  Something else that maintains state across node visits is
lexical scope.  Therefore, these both might share a pattern.

They do, and in my latest prototype, scope info are kept in the state.
 
 Maybe
the maintenance of the "path of nodes up to the root" can likewise
also be a type of walk state that can be moved out of the core
behavior and be made optional for only those queries that need it.

I'm not sure that it represents a significant penalty, but it's not hard to check. However, there's a problem: predicates and :foreach arguments might use parent nodes, and I'd need them to somehow advertise whether they need it or not, so the usability cost is significant.
 

> Another point which hasn't been addressed, but is extremely important, is scoped variable handling.
> [...] The way I see it, variables have a binder which "create" them. For instance,
> "`Function{ {`Id 'a', `Id 'b'}, body }" is the binder of the variables named "a" or "b" which occur
> in "body". Each occurrence of a local variable has one binder; each binder has zero, one,
> or many occurrences of its variable. Global variables are, by definition, the variable occurences
> without a binder.
> Maybe the binder shouldn't be defined as a node, but as a node with an indice. This way,
> each binder represents a single variable. In the example above, the binder of "b" in "body"
> is "{node=`Function{ {`Id 'a', `Id 'b'}, body }; indice=2}". The exact representation of a binder
> will be determined during implementation.

In comparison in LuaInspect, if ast.tag == 'Id', then
ast.localdefinition refers directly to the 'Id' node in the binder
(not to the binder itself).

After some experimentation, it seems indeed that the `Id node is a good representation of the binder. In most relevant contexts, it's presented with its parents in a query, so all relevant info can be retrieved.
 
 ast.localdefinition is nil for global
variables.  It also happens to be the case that ast.localdefinition ==
ast for the 'Id' node in the binder itself.  When storing information
about some lexical variable (e.g. a flag indicating whether the
variable is referred to (isused) or mutated (isset) anywhere else),
the ast.localdefinition node is used as the key or the storage
location for this information.

I'm a bit wary of stored decorations. They might be useful for a given application, but they create more bookkeeping to keep synchronized, so they're dangerous in a generic service.

I think it's important for treequery to be able to recompose such information on the fly. If a user needs to go back repeatedly to that sort of information, on a piece of code which he won't modify, then it's easy for him to take note of whatever information he will need again. Finding is up to treequery, caching is up to the user.
 
> var.occurence(b)(n)
Here's another class of use cases to consider, one that I've needed in
more than one project: find all local variables that are constant (not
assigned to outside of the binder), are unused (e.g. for dead-code
elimination or typo detection), or mask another lexical of the same
name.  In these cases you're dealing with more than one variable per
walk.

When walking, var.binder(id) returns a variable's binder in o(1). So by keeping count, for each binder, of how many times it's used, you get the information you want in o(n). As for variable shadowing, handling it correctly is part of var.* function's job.


Stefan Talpalaru

unread,
Oct 5, 2011, 5:53:47 AM10/5/11
to met...@googlegroups.com


From: Fabien Fleutot <fle...@gmail.com>
To: met...@googlegroups.com
Sent: Wednesday, October 5, 2011 11:48 AM
Subject: Re: [metalua] Re: Some samples using the treequery API prototype

David Manura

unread,
Oct 5, 2011, 9:32:03 PM10/5/11
to met...@googlegroups.com
On Wed, Oct 5, 2011 at 5:48 AM, Fabien Fleutot <fle...@gmail.com> wrote:
> loadstring() hacks cause many problems, most notably of scope, and are
> doomed to remain brittle.

I don't agree: The basic concept is, like a compiler, to translate
some mathematical query expression into Lua source as an intermediate
representation (IR) that is then compiled into bytecode via loadstring
(and optionally further to machine code under JIT). It is possible to
do this all robustly. However, under JIT, I wonder if the trace
recording obviates this. In the case of Metalua, you go straight to
bytecode, bypassing Lua source IR, but despite some additional
flexibility some consider that a disadvantage (e.g. the current state
of Metalua incompatibility with 5.2 and JIT). The addition of goto in
5.2beta partly had this topic in mind. True, this is however somewhat
beside the point at the moment for reasons you state below :)

Fabien Fleutot

unread,
Oct 6, 2011, 3:12:29 AM10/6/11
to met...@googlegroups.com
On Thu, Oct 6, 2011 at 3:32 AM, David Manura <dm....@math2.org> wrote:
On Wed, Oct 5, 2011 at 5:48 AM, Fabien Fleutot <fle...@gmail.com> wrote:
> loadstring() hacks cause many problems, most notably of scope, and are
> doomed to remain brittle.

I don't agree: [...]  It is possible to do this all robustly.

Then the result won't be much simpler than Metalua.

It is possible if you apply your source transformation to the whole file. And then, if you have a 2nd transformation to apply, they will most likely not compose together (call the two extensions A and B: the original file in written in Lua+B+A, and should produce an output in Lua+B. Since Lua+B isn't Lua, there's no reason to expect the 1st preprocessor to handle it correctly.

If you generate+loadstring some source code, the local variables won't pass across the loadstring barrier. You might hack something to pass some of them as fake globals in a hacked fenv, but I'd bet there will be cornercase where this isn't possible, and it precludes the user from using fenv hacks himself.

You'll quickly end up with a need for a full compiler with modular extensibility. You can make it produce Lua source code so that it's not exactly Metalua, but it's already close enough. Actually, here's a source-to-source compiler based on Lua:

require 'metalua.compiler'
require 'metalua.ast_to_string'

function compile(src)
    local ast = mlc.luastring_to_ast(src)
    return ast_to_string(ast)
end

What makes a software engineering tool brittle is the introduction of subtle restrictions, and other things that require long term extra care. It might work today, but it can break in a subtle way, and in a place seemingly unrelated to where it's used (cue variable capture by C #defines). When it doesn't work, it often won't produce a good error diagnostic (cue C++ templates). When trying to analyze source code with a crude lexer, it will misinterpret any idiom that the designer didn't explicitly test (cue doxygen-like systems which don't embed a complete parser).

David Manura

unread,
Oct 6, 2011, 9:26:22 PM10/6/11
to met...@googlegroups.com
On Thu, Oct 6, 2011 at 3:12 AM, Fabien Fleutot <fle...@gmail.com> wrote:
> It is possible if you apply your source transformation to the whole file.
> And then, if you have a 2nd transformation to apply, they will most likely
> not compose together (call the two extensions A and B: the original file in
> written in Lua+B+A, and should produce an output in Lua+B. Since Lua+B isn't
> Lua, there's no reason to expect the 1st preprocessor to handle it correctly.
> If you generate+loadstring some source code, the local variables won't pass
> across the loadstring barrier.

I follow that, but I had something different in mind, as prototyped here:
https://gist.github.com/4839a28430f3b1ad7a8d .

Reply all
Reply to author
Forward
0 new messages