Why is recursion not allowed in skylark?

943 views
Skip to first unread message

Paul Johnston

unread,
Aug 3, 2016, 11:04:40 AM8/3/16
to bazel-discuss
As noted in the "Differences with Python" section of http://www.bazel.io/docs/skylark/concepts.html#differences-with-python, skylark does not allow function recursion.  Why is that?  

Jon Brandvein

unread,
Aug 4, 2016, 11:45:45 AM8/4/16
to bazel-discuss
It's part of the general philosophy of keeping .bzl code from becoming overly complicated and computationally expensive. Note that for similar reasons, unbounded iteration is prevented by disallowing While loops, and by making For loops iterate over a copy of the list on their right-hand side (so that the iteration cannot be extended indefinitely once the loop is underway).

If you are using recursion, consider whether you are trying to do too much in Skylark. If you believe your application really warrants recursion, we'd like to hear about it.

Jon

Paul Johnston

unread,
Aug 4, 2016, 3:56:00 PM8/4/16
to bazel-discuss
In writing a function to load workspace dependencies, I wanted to do https://github.com/pubref/rules_protobuf/blob/master/bzl/require.bzl#L64-L71.

I agree with the reasoning though.  Just wanted to know.

Matthew Woehlke

unread,
May 10, 2017, 2:37:47 PM5/10/17
to bazel-discuss
On Thursday, August 4, 2016 at 11:45:45 AM UTC-4, Jon Brandvein wrote:
If you are using recursion, consider whether you are trying to do too much in Skylark. If you believe your application really warrants recursion, we'd like to hear about it.

I'm trying to implement installation as an extension (which is a bad approach for several reasons, but expedient). One of the things I need to do is be able to strip prefixes from paths, where the prefix may contain globs (both "/*/" and "/**/"). It would probably be better, of course, if this was a built-in function, but I managed to *almost* implement it in Skylark.

The "natural" implementation is to split both the path and prefix into components, check if the first component matches, and if so, recursively call the function with the first part of both path and prefix dropped. This works especially well for parsing "**" globs (for which the strategy is to make a recursive call that strips N components of the path, for N == number of components¹ to 1, counted backward).

Alas, I can't do that :'-(.

I've managed to work around it by rewriting the tail-recursion as a for loop and having "**" call a copy of the function, but this means I have to duplicate code. (In this case, the copy just doesn't handle "**", so only at most one "**" is supported. I could support up to N "**"s by having N+1 copies of the function, but for now, only supporting one "**" is enough.)

(¹ And yes, I'm aware this is "wrong", as it will superfluously try some combinations that have no chance of succeeding.)

IMHO, intentionally crippling the build system stinks. The real world has corner cases like this, like needing to read version information from a file to determine the name of a file that will be used for other build rules, etc. Other build systems can handle these sorts of corner cases.

--
Matthew

László Csomor

unread,
May 11, 2017, 2:48:38 AM5/11/17
to Matthew Woehlke, bazel-discuss
Hi Matthew,

If you can rewrite the recursive logic as a for loop, try to find something that is a known upper bound for the number of iterations, such as len(path), and test for and break when the imaginary while-loop's terminating condition is met.

IMHO, intentionally crippling the build system stinks. The real world has corner cases like this, like needing to read version information from a file to determine the name of a file that will be used for other build rules, etc.

I hear you, and I'm sorry to say that we have to make decisions and draw the line somewhere. I'm not on the Skylark team and didn't partake in designing it, but as I see it, there's a long tail of things we could do, but have limited resources and headcount so we have to prioritize and balance features against the amount of effort implementing *and maintaining* them would entail. It's not perfect perhaps, but good enough, and perfect would require disproportionately more effort.

> Other build systems can handle these sorts of corner cases.

Maybe so, but at what cost of performance / scalability / maintainability / correctness / reproducibility? It's always a trade-off.


--
László Csomor | Software Engineer | laszlo...@google.com

Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle

--
You received this message because you are subscribed to the Google Groups "bazel-discuss" group.
To unsubscribe from this group and stop receiving emails from it, send an email to bazel-discuss+unsubscribe@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/bazel-discuss/96b2e93f-3e35-40a0-a927-8793c866e5b8%40googlegroups.com.

For more options, visit https://groups.google.com/d/optout.

Matthew Woehlke

unread,
May 11, 2017, 10:05:13 AM5/11/17
to bazel-discuss, László Csomor
On 2017-05-11 02:48, László Csomor wrote:
> If you can rewrite the recursive logic as a for loop, try to find something
> that is a known upper bound for the number of iterations, such as
> len(path), and test for and break when the imaginary while-loop's
> terminating condition is met.

There are two recursive parts. One is the tail-recursive "natural" way
to implement the base function. This I replaced with a for loop.

The other is handling of "**" globs. This one needs to strip the
components that have been matched so far and then attempt a match on the
remainder, starting with the shortest possible match (greedy) and
continuing through the longest possible match. To be "fully correct",
the inner body of this loop needs to match the entire function.

Here's the function:

def __remove_prefix(path, prefix):
# If the prefix has more parts than the path, failure is certain.
if len(prefix) > len(path):
return None

# Iterate over components to determine if a match exists.
for n in range(len(prefix)):
if prefix[n] == path[n]:
continue

elif prefix[n] == "*":
continue

elif prefix[n] == "**":
if n + 1 == len(prefix):
return path[-1]

for t in reversed(range(n, len(path) - (len(prefix) - n - 2))):
# *** RECURSION NEEDS TO HAPPEN HERE ***
x = __remove_prefix(path[t:], prefix[n+1:])
if x != None:
return x

return None

else:
return None

return "/".join(path[len(prefix):])

Do you know some way to implement this in Skylark?

(Note: this is a helper function that takes a path and prefix that have
already been split into component lists.)

--
Matthew

László Csomor

unread,
May 11, 2017, 1:56:45 PM5/11/17
to Matthew Woehlke, bazel-discuss, Laurent Le Brun
Globs are regular expressions, which define regular grammars, so glob matching is equivalent to solving a containment problem for a regular language. This requires a finite state machine, so the problem is solvable without recursion.

To illustrate: the glob "a/b/*/c/**/d" is the same as regex "a/b/[^/]+/c(/.+)+/d"
Unfortunately Skylark doesn't support regex matching on strings, so right now you have to implement glob matching on your own.  Or, if you can, use native.glob() if your input paths are in a filegroup -- that'd be by far the easiest solution.

I admit implementing glob matching in pure Skylark is painfully complicated, I spent a good couple hours on it now but I don't have the tenacity to finish. My naive algorithm is this:

1. Split up the glob patterns at "/"s, then create groups by splitting at the "**"s. Example: "a/b/*/c/**/d/**/e/f" --> [a, b, *, c, **, d, **, e, f] --> [[a,b,*, c], [d], [e,f]]
2. Find all occurrences of the first elements of each group, that gives you a list of lists. These are the potential starting positions of the groups in the input.
3. Try to match each group from each starting position, e.g. if "a" (in the first group) was found at positions [0, 2, 3], then check if the next character ("b") is at positions [1, 3, 4]. Remove starting positions that are dead ends. Do this for the starting positions of all groups. In the end you have the list of starting positions where each group was fully found.
4. Finally you need to create an ascending sequence of the starting positions, and see if the groups could fit fully, leaving at least one element for the "**" inbetween. In other words, if the first group ([a,b,*,c]) was found in positions 0 and 8, then the next group can only start at 0+4+1=5 or at 8+4+1=13, to fulfill the "a/b/*/c/**" fragment of the pattern. So if the second group was found at positions [5, 10], you know that the input cannot possibly match, but if it was found on [5, 10, 15], then you know that it may. You need to do this aligning for all groups.

Good luck!






--
László Csomor | Software Engineer | laszlo...@google.com

Google Germany GmbH | Erika-Mann-Str. 33 | 80636 München | Germany
Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Matthew Scott Sucherman, Paul Terence Manicle

Laurent Le Brun

unread,
May 11, 2017, 2:12:58 PM5/11/17
to László Csomor, Matthew Woehlke, bazel-discuss
Why do you need this?
Where do "/*/" and "/**/" come from?
Can you give a bit more context?
--
Laurent

Matthew Woehlke

unread,
May 11, 2017, 2:19:29 PM5/11/17
to László Csomor, bazel-discuss, Laurent Le Brun
On 2017-05-11 13:56, László Csomor wrote:
> To illustrate: the glob "a/b/*/c/**/d" is the same as regex
> "a/b/[^/]+/c(/.+)+/d"
> Unfortunately Skylark doesn't support regex matching on strings, so right
> now you have to implement glob matching on your own. Or, if you can, use
> native.glob() if your input paths are in a filegroup -- that'd be by far
> the easiest solution.

I don't think that will work? Besides that I don't necessarily have a
file group, the actual problem is not *selecting* files out of a set, it
is *modifying* the file path by removing a prefix which may contain
globs. For example:

strip_prefix("foo/bar/none.h", "**/") -> "none.h"

> My naive algorithm is this:
> [something complicated]

Um... yeah. No, thanks. A recursive algorithm is MUCH easier to
implement (basically, I showed it in my previous mail) and understand.

For now I'm just going to stick with my original "solution", which is to
duplicate the entire helper function once, less "**" matching, and only
support one "**".

The original mail to which I replied was asking "[i]f you believe your
application really warrants recursion". I assert that the relative
complexity and ease of implementation of your algorithm vs. mine is a
clear "yes".

Whether it's *possible* to implement the algorithm without recursion,
and whether it's *desirable* are different questions.

--
Matthew

Matthew Woehlke

unread,
May 11, 2017, 3:31:12 PM5/11/17
to Laurent Le Brun, bazel-discuss
On 2017-05-11 14:12, Laurent Le Brun wrote:
> Why do you need this?
> Where do "/*/" and "/**/" come from?
> Can you give a bit more context?

I'm trying to implement installation for Drake¹. Now, there are a *lot*
of reasons why doing this as extensions is bad, but we need it last week
rather than in six months, and can afford to cut a lot of corners (e.g.
hard-coding the libdir to "lib", stuff that would be totally wrong or
never work on Windows).

For installation (this would be true in a "proper" implementation also,
though the prefix stripping could be implemented in Java in such case),
I need to be able to do stuff like:

install_files(
name = "install_cmake",
dest = "lib/lcm/cmake",
files = [
"lcm-cmake/lcmUtilities.cmake",
"lcmConfig.cmake", # TODO(mwoehlke-kitware) generate me!
"lcmConfigVersion.cmake", # TODO(mwoehlke-kitware) generate me!
],
strip_prefix = ["**/"],
)

...and:

# Thank you Bazel for making it INCREDIBLY difficult to get the path
# of an artifact as specified by the rule that created it
i = _remove_prefix(i, _join_paths("bazel-out/*/*", package_root))

Basically, I need to be able to remove leading path components from a
path, and I don't want to have to specify all possible literal prefixes
when a glob will suffice.

http://drake.mit.edu/)

--
Matthew

Jon Brandvein

unread,
May 15, 2017, 8:03:56 AM5/15/17
to bazel-discuss, laur...@google.com
If you want the path without the leading directories that are based on the host and the configuration, does File.shortpath suffice?

Matthew Woehlke

unread,
May 17, 2017, 11:28:05 AM5/17/17
to Jon Brandvein, bazel-discuss, laur...@google.com
On 2017-05-15 08:03, 'Jon Brandvein' via bazel-discuss wrote:
> If you want the path without the leading directories that are based on the
> host and the configuration, does File.shortpath
> <https://bazel.build/versions/master/docs/skylark/lib/File.html#short_path>
> suffice?

No; I get mysterious ".."s with that, or I still get extra path parts.
(Anyway that's only one use case, and it's actually more complicated
than that; it turns out I need the path relative to a *different*
target. AFAIK there is no built-in for that?)


Examples (`.path`, `.shortpath`, "correct"):

bazel-out/gcc-6-linux-opt/genfiles/tools/drake-config.cmake.
tools/drake-config.cmake
drake-config.cmake.

bazel-out/gcc-6-linux-opt/genfiles/external/lcm/lcmConfig.cmake.
../lcm/lcmConfig.cmake
lcmConfig.cmake.

bazel-out/gcc-6-linux-opt/genfiles/external/lcm/lcm/lcm_export.h.
../lcm/lcm/lcm_export.h
lcm/lcm_export.h.

bazel-out/gcc-6-linux-opt/bin/external/lcm/liblcm.a.
../lcm/liblcm.a
liblcm.a.

--
Matthew
Reply all
Reply to author
Forward
0 new messages