Observation: remote execution services don't deal with Command.arguments[0] properly

Ed Schouten

unread,

Nov 12, 2021, 9:19:10 AM11/12/21

to Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

Hi there,

This is a follow up from what we discussed during the last monthly
meeting. I think it was Jeremiah (cc) who inquired about how
Command.arguments[0] gets interpreted. The latest revision of REv2
says this:

// The arguments to the command. The first argument must be the path to the
// executable, which must be either a relative path, in which case it is
// evaluated with respect to the input root, or an absolute path.
repeated string arguments = 1;

Let's disregard arguments[0] being an absolute path for the moment, as
that case is trivial and doesn't require any discussion. I looked at
how Buildbarn implements arguments[0] being relative, and it currently
gets it wrong:

- If it's a relative path consisting of a single component, it
currently does a $PATH lookup.
- If it consists of multiple components, it gets resolved relative to
the working directory of the action; not the input root.

I am working on a change to get this addressed. At the same time I
looked at some of the other remote execution implementations.

=== Bazel Buildfarm ===

Here's the relevant code from Bazel Buildfarm:

https://github.com/bazelbuild/bazel-buildfarm/blob/fb0aca2db87e035173265a05b1ee2ff85f97cdc1/src/main/java/build/buildfarm/worker/Executor.java#L212-L235
https://github.com/bazelbuild/bazel-buildfarm/blob/fb0aca2db87e035173265a05b1ee2ff85f97cdc1/src/main/java/build/buildfarm/worker/Executor.java#L401-L437

On Windows this code certainly resolves the executable relative to the
working directory; not the input root. On Linux it may do $PATH
lookups, but I'm not 100% sure what Java ProcessBuilder does. It may
be worth investigating.

=== Buildbox (part of Buildgrid) ===

There are various runner commands, but let's look at the one that just
executes stuff on the host:

https://gitlab.com/BuildGrid/buildbox/buildbox-run-hosttools/-/blob/f953eb90e743e21a68a8027e721471a65670d30c/buildboxrun-hosttools/buildboxrun_hosttools.cpp#L131-141
https://gitlab.com/BuildGrid/buildbox/buildbox-run-hosttools/-/blob/f953eb90e743e21a68a8027e721471a65670d30c/buildboxrun-hosttools/buildboxrun_hosttools.cpp#L94-106
https://gitlab.com/BuildGrid/buildbox/buildbox-run-hosttools/-/blob/f953eb90e743e21a68a8027e721471a65670d30c/buildboxrun-hosttools/buildboxrun_hosttools_pathprefixutils.cpp#L23-43
https://gitlab.com/BuildGrid/buildbox/buildbox-common/-/blob/702f50be659c3597eb0cdef99bbd3affc78babb1/buildbox-common/buildboxcommon/buildboxcommon_systemutils.cpp#L56-57

So it looks like there is some code there for prefixing/transforming
pathnames based on the path of the input root, but it isn't exactly
what we need. It's for prefixing absolute paths in arguments[1:] with
some value. My suspicion is that the path is resolved local to the
working directory instead of the input root. Because execv() is used
instead of execvp(), no $PATH resolution is performed, which is good.

=== Summary ===

It looks like Buildbarn, Bazel Buildfarm and Buildgrid all get it
wrong right now. Let's work towards getting this fixed!

--
Ed Schouten <e...@nuxi.nl>

Ed Schouten

unread,

Nov 12, 2021, 10:07:52 AM11/12/21

to Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

Hi there,

Op vr 12 nov. 2021 om 15:18 schreef Ed Schouten <e...@nuxi.nl>:

> It looks like Buildbarn, Bazel Buildfarm and Buildgrid all get it
> wrong right now. Let's work towards getting this fixed!

After I sent out this message, I realized it's a pretty complex topic.
Sure, I can live with not having $PATH lookups in there, but doing
resolution relative to the input root directory instead of the working
directory does have some weirdness to it: the value of argv[0] as
visible to the child process must also be rewritten. For example,
consider this case:

working_directory: "foo"
arguments: ["bar/baz", "someoutputfile.o"]

So this will run ${input_root}/bar/baz inside directory
${input_root}/foo. But what's the value of argv[0] inside the child
process supposed to be here? I hope it's not "bar/baz", as that causes
confusion in case the process wants to find its own location (and
doesn't use /proc/self/exe). Must it be "../bar/baz", or is it
permitted to transform it to an absolute path? If it must remain
relative, what happens if "bar" is a symbolic link to some other place
in the input root? The resolution logic for that may become pretty
hairy.

I'd personally find it a bit more intuitive if resolution was done
relative to the working directory. That way it's always safe to pass
on the value of argv[0] in literal form. Considering that all three
major Open Source implementations already do that I sent out this PR
to get the spec updated accordingly:

https://github.com/bazelbuild/remote-apis/pull/210

--
Ed Schouten <e...@nuxi.nl>

Steven Bergsieker

unread,

Nov 12, 2021, 10:17:59 AM11/12/21

to Ed Schouten, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

As far as I can tell by inspection (I haven't actually tested), RBE also gets this wrong. It correctly both uses the working directory rather than the input root, and it still does $PATH resolution.

Thinking more about this, I'm not sure that the API really describes the semantic we want--it's possible that $PATH resolution is actually desirable. Excluding it was intended to reduce uncertainty--it requires the user to know enough about the remote environment to be able to specify exactly what executable to use, regardless of the remote image, $PATH, shell environment, etc. This is probably a desirable property for a remote execution system focused on hermetic, cacheable compilation actions, but it's less good for other cases. If the user just wants to execute a command using a common system tool and doesn't care about what type of worker it's run on, then requiring a full path is counterproductive.

I'm now wondering if the API should specify that resolution is dependent on the properties of the remote system. Users who care can still specify an absolute path, but those who don't care aren't forced to jump through extra hoops.

--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/CABh_MKmiQ6tgm%2BwBPvKG8th8hreMKzgr5k0Px-ARxmrhJCO99A%40mail.gmail.com.

Ed Schouten

unread,

Nov 12, 2021, 11:09:08 AM11/12/21

to Steven Bergsieker, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

Op vr 12 nov. 2021 om 16:17 schreef Steven Bergsieker <bergs...@google.com>:

>
> As far as I can tell by inspection (I haven't actually tested), RBE also gets this wrong. It correctly both uses the working directory rather than the input root, and it still does $PATH resolution.
>
> Thinking more about this, I'm not sure that the API really describes the semantic we want--it's possible that $PATH resolution is actually desirable.

Then it sounds like RBE, Buildbarn, and Bazel Buildfarm on Linux are
all consistent with each other: lookup relative to the working
directory and have $PATH resolution enabled.

In that case, let me go ahead and rephrase PR 210 to state that $PATH
resolution should be performed (if the operating system in question
does provide such a feature).

Best regards,
--
Ed Schouten <e...@nuxi.nl>

John Millikin

unread,

Nov 12, 2021, 6:40:14 PM11/12/21

to Ed Schouten, Steven Bergsieker, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

Changing the resolution of non-absolute argv[0] from input_root to workingdir seems like a breaking change for the implementations that are currently in compliance with the spec.

I understand that not all implementations comply at the moment, but that seems like a problem that could be solved through additional testing on their part, rather than modifying the spec to match their incorrect behavior.
If there are implementations that want the ability to resolve relative to the workingdir, maybe changing the handling of commands with prefix "./" would be a viable minimal change? It's still backwards-incompatible, but the clients I'm aware of that send relative paths (such as Bazel) use "foo/bar/baz" rather than "./foo/bar/baz", so the practical disruption would be reduced.

Alternatively, a new enum or boolean flag field somewhere to control the resolution of relative paths.

Regarding the use of $PATH to resolve commands that don't contain a path separator, this seems like it's introducing undefined behavior into a part of the protocol where predictability and stability are important (due to the action caching + difficulty for end users to debug failures).

Different behavior between an argv[0] of "foo" and "bar/foo" would mean that moving a binary within the input_root (either to, or out of, the top level) would have unpredictable consequences on which binary gets run by a command.
Clients that want behavior to be "whatever the shell does" can wrap their command in their preferred shell prior to submitting it to the remote execution service.
Handling the case of an unset or empty $PATH is especially difficult because the system behavior can vary between versions or configurations of the same OS.

GNU glibc changed its behavior in this case a few years ago, which blocked us from upgrading the container image used for build actions until we had identified and fixed all the Bazel rules that were accidentally depending on unspecified $PATH handling.
Some of these were upstream, for example Kubernetes has a call to bare "cp" with no environment set: https://github.com/kubernetes/repo-infra/blob/v0.2.2/defs/go.bzl#L54-L60

--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.

To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/CABh_MKm2JFX0MB4a7npXZhvaFqd0EYGk3FoPP27kYQg4EipMBA%40mail.gmail.com.

Ed Schouten

unread,

Nov 13, 2021, 11:19:24 AM11/13/21

to John Millikin, Steven Bergsieker, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

John,

Op za 13 nov. 2021 om 00:40 schreef John Millikin <jmil...@stripe.com>:

> Changing the resolution of non-absolute argv[0] from input_root to workingdir seems like a breaking change for the implementations that are currently in compliance with the spec.

That's why I'm trying to get an inventory of what existing
implementations do. Here's another data point: Goma Server also
resolves relative to the working directory and respects $PATH:

https://chromium.googlesource.com/infra/goma/server/+/refs/heads/main/remoteexec/exec.go#738

So far I haven't been able to find a single implementation that does
resolve the path relative to the input root. I am very interested in
hearing what EngFlow does, for example. It may be the case that "the
implementations that are currently in compliance with the spec" is, in
fact, the empty set. If that is the case, then it would be a complete
waste of time for everyone involved to stick to the spec and require
that everyone follows suit. It simply means that the spec was written
in a non-intuitive way.

I think that discussing whether we should permit PATH resolution is
worth having, considering that Bazel Buildfarm on Windows and
Buildgrid deviate from the rest.

> Some of these were upstream, for example Kubernetes has a call to bare "cp" with no environment set: https://github.com/kubernetes/repo-infra/blob/v0.2.2/defs/go.bzl#L54-L60

I think you are providing an argument here for why we *should* in fact
support PATH resolution: Bazel already does it for local execution. I
wouldn't consider it predictable if local execution (at least through
Bazel) has semantics that differ from how remote execution works. I
disagree with your statement that PATH resolution is unpredictable:

- The Remote Execution protocol allows specifying environment
variables. I think that almost everyone that runs a somewhat serious
remote execution setup enables --incompatible_strict_action_env in
combination with --action_env=PATH=.... If we require that PATH
resolution must consider the value of PATH specified in the Command
message, then the list of paths to consider on a remote worker is
fully predictable.
- There is always a way to prevent PATH lookups: prepend relative
paths with "./"

Sure, this doesn't rule out that PATH resolution doesn't always yield
the same results, but that implies some form of non-deterministic
worker configuration that is no different from running a fleet of
workers that have deviating file system contents, kernel versions, CPU
frequencies or memory sizes. It's up to cluster operators to prevent
this from happening.

> Clients that want behavior to be "whatever the shell does" can wrap their command in their preferred shell prior to submitting it to the remote execution service.

Slight nitpick: Whether we do PATH resolution is unrelated to the
subject of shell interpreters, as shell interpreters aren't the only
tools that process PATH. This is about whether we want to use libc
functions execv() or execvp(). POSIX provides pretty clear semantics
for those functions and how they process the PATH environment
variable:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03

Let's make sure the discussion remains focused on that.

--
Ed Schouten <e...@nuxi.nl>

John Millikin

unread,

Nov 16, 2021, 2:13:22 AM11/16/21

to Ed Schouten, Steven Bergsieker, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

On Sun, Nov 14, 2021 at 1:19 AM Ed Schouten <e...@nuxi.nl> wrote:

That's why I'm trying to get an inventory of what existing

implementations do. [...]

So far I haven't been able to find a single implementation that does
resolve the path relative to the input root.

It's not clear to me that surveying open-source implementations will provide

accurate data here, considering the main use case of Bazel's remote execution

being secure distributed builds of large-scale codebases found within corporate

environments.

IMO we should assume that the most depended-on implementations of Bazel

remote execution are (1) not published to Github, and (2) are maintained by

teams capable of and used to writing against a stable spec. Changing the spec

would make it more difficult to decide what behavior a client and server ought

to implement (which version is "correct"?).

> Some of these were upstream, for example Kubernetes has a call to bare "cp" with no environment set: https://github.com/kubernetes/repo-infra/blob/v0.2.2/defs/go.bzl#L54-L60

I think you are providing an argument here for why we *should* in fact
support PATH resolution: Bazel already does it for local execution. I
wouldn't consider it predictable if local execution (at least through
Bazel) has semantics that differ from how remote execution works. I
disagree with your statement that PATH resolution is unpredictable:

I think the Kubernetes bug is more of an argument for changing Bazel's

local execution to match the remote execution semantics. At present, it's

too easy for locally executed build actions to depend on files from outside

the sandbox.

However, the behavior of Bazel when executing locally doesn't seem

important to the decision of whether the remote execution protocol should

break backwards compatibility in this case.

- The Remote Execution protocol allows specifying environment
variables. I think that almost everyone that runs a somewhat serious
remote execution setup enables --incompatible_strict_action_env in
combination with --action_env=PATH=.... If we require that PATH
resolution must consider the value of PATH specified in the Command
message, then the list of paths to consider on a remote worker is
fully predictable.

I think it's unlikely that --action_env=PATH= sees much use in remote

execution, because the client should be specifying the precise command

to run in every case. Nearly all build actions run with an empty environment,

and rules that populate environment variables generally do something

tool-specific.

The exception to this, the type of action for which $PATH is relevant, are

actions that run within a shell using a command like:

arguments: "/bin/bash"

arguments: "-c"

arguments: "the core logic of the build action"

But even then, actions with special $PATH requirements will probably embed

them into the top of the shell script. This is why I think the issue of PATH

resolution isn't relevant to the remote execution API.

- There is always a way to prevent PATH lookups: prepend relative
paths with "./"

Sure, this doesn't rule out that PATH resolution doesn't always yield
the same results, but that implies some form of non-deterministic
worker configuration that is no different from running a fleet of
workers that have deviating file system contents, kernel versions, CPU
frequencies or memory sizes. It's up to cluster operators to prevent
this from happening.

I'm a little confused about what this section is claiming, so please forgive me

if I've misunderstood, but it sounds like you're describing a system where the

build actions are running in direct contact with the worker machine's filesystem

and kernel.

I would expect remote build actions to always execute in one of two modes:

"Bring your own FS", where the action specifies a complete input root (including libc).
Specifying a Platform with a content-addressed filesystem archive, such as an OCI container image (for example the RBE "container-image" platform property).

In either case, the filesystem the build action experiences will be contained in a chroot, and

will be completely separate from what the host machine's root filesystem might look like.

> Clients that want behavior to be "whatever the shell does" can wrap their command in their preferred shell prior to submitting it to the remote execution service.

Slight nitpick: Whether we do PATH resolution is unrelated to the
subject of shell interpreters, as shell interpreters aren't the only
tools that process PATH. This is about whether we want to use libc
functions execv() or execvp(). POSIX provides pretty clear semantics
for those functions and how they process the PATH environment
variable:

https://pubs.opengroup.org/onlinepubs/9699919799/basedefs/V1_chap08.html#tag_08_03

Let's make sure the discussion remains focused on that.

As I mentioned in my previous email, the behavior of libc $PATH resolution

is not stable and can vary between libc implementations or versions.

Using this example program:

#include <stdio.h>
#include <unistd.h>
int main() {
char *const chroot_argv[] = {"chroot", "--help", NULL};
execvp("chroot", chroot_argv);
perror("execvp(\"chroot\")");
return 0;
}

Running in Ubuntu Focal and Xenial produces different behavior, because

older GNU libc searches the current directory and newer GNU libc doesn't:

root@focal:/# PATH= /home/user/test_exec
execvp("chroot"): No such file or directory

root@focal:/sbin# PATH= /home/user/test_exec
Usage: chroot [OPTION] NEWROOT [COMMAND [ARG]...]
[...]

root@xenial:/# PATH= /home/user/test_exec
execvp("chroot"): No such file or directory
root@xenial:/sbin# PATH= /home/user/test_exec
execvp("chroot"): No such file or directory

I think it would be unreasonable to expect the remote executor to inspect the

action to decide which libc the action expects to run with. IMO it's much better

to have the API specify a simple and deterministic logic (the current spec), and

have clients with more complex needs run their actions in a wrapper.

Furthermore, baking in a dependency on libc strikes me as unappealing in a

time when most new infrastructure software is moving away from C as an

implementation language. Defining the remote execution API's behavior by

reference to libc (even a specific implementation and version of libc) makes it

more difficult to write conforming remote executors in Go, Rust, or other

languages that may have non-libc semantics for spawning subprocesses.

Ed Schouten

unread,

Jan 12, 2022, 5:37:59 AM1/12/22

to John Millikin, Steven Bergsieker, Remote Execution APIs Working Group, Jeremiah Bonney, George Gensure, Ed Baunton

Hi John,

Sorry for not replying sooner; I first wanted to discuss this topic in
more detail in the working group meetings before continuing.

First and foremost, thanks for raising the issue that certain versions
of glibc are inconsistent in how they handle PATH. That said,
considering the fact that newer versions of glibc do get it right
(i.e., aligning with what's described in POSIX), I think we should
treat this as an outlier; an implementation bug that's out of scope
for the Remote Execution API.

During the last two working group meetings, we concluded the following:

- We are not aware of any implementations that resolve executable
paths relative to the input root. In addition to all Open Source
implementations considered, authors of proprietary implementations
such as EngFlow and BuildBuddy have been able to state that all of
those resolve paths relative to the working directory.

- Most, if not all attendees of the last two meetings seem to think
that providing support for PATH resolution makes sense. Reasons for
this include:
- the majority of the implementations represented already do this,
- clients also tend to do this for local execution,
- PATH resolution can trivially be disabled by prepending "./" to
paths that are already relative,
- the fact that if you copy the contents of Command message into a
shell (e.g., as printed by Bazel's --verbose_failures), it would also
be subject to PATH resolution.

This is why we will work towards altering the spec to require
resolution relative to the working directory, and respecting PATH.

We do realize that this is a backward incompatible change. This is
unfortunate, but the consensus during the meetings was that it is
justified taking all facts into account. One concern that was raised
was that we should not apply this change retroactively, which makes
sense. This is why we will state that the new behavior is only part of
REv2.3 or later.

I have just gone ahead and updated the PR to process most of the
feedback provided:

https://github.com/bazelbuild/remote-apis/pull/210

Please take a look and let me know whether the change is acceptable.
If no major concerns are raised, we should be able to merge this
before the next working group meeting.

Reply all

Reply to author

Forward