Remote Worker protocol: yet another attempt

243 views
Skip to first unread message

Ed Schouten

unread,
Dec 24, 2019, 5:40:24 AM12/24/19
to Remote Execution APIs Working Group, Jakob Buchgraber, George Gensure, Richard Dale, Sander Striker
Hi there,

Earlier this year we had some discussions regarding protocols used by
Remote Execution workers. Right now, every build cluster
implementation has its own scheduler <-> worker protocol. In addition
to that, there is the existing Remote Worker API that seems a bit more
complicated than what some of us desire.

I just wanted to let you folks know that earlier today, I published a
patchset for Buildbarn that should get merged to master within the
next week or so. It includes a new scheduler + worker client
implementation that uses a new protocol that I've been refining for
the last half a year:

https://github.com/buildbarn/bb-remote-execution/blob/eschouten/20191212-initial-code-drop/pkg/proto/remoteworker/remoteworker.proto

Some interesting aspects of it:

- There is just a single RPC left: Synchronize(). The idea is that
workers supply their current state and that the scheduler responds
with what the worker should be doing instead. This means that marking
an action as completed and requesting another piece of work is all
done through a single roundtrip.
- By having a single call, the state machine on both the worker and
scheduler remains relatively small.
- Workers are identified not by a string, but by a map<string,string>.
This allows a scheduler to automatically group workers based on
certain properties (data center, rack, physical server, worker thread,
etc.).
- Because the scheduler already needs to load the Action and Command
from the CAS (unfortunately), that information also gets sent from the
scheduler to the worker to prevent unnecessary CAS roundtrips.

Here is the client side logic:
https://github.com/buildbarn/bb-remote-execution/blob/eschouten/20191212-initial-code-drop/pkg/builder/build_client.go
Here is the server side logic:
https://github.com/buildbarn/bb-remote-execution/blob/eschouten/20191212-initial-code-drop/pkg/builder/in_memory_build_queue.go#L328-L400

My question is, what are your thoughts on this protocol? Could a
protocol like this serve as the basis for a standardized Remote Worker
API?

--
Ed Schouten <e...@nuxi.nl>

Sander Striker

unread,
Dec 31, 2019, 11:51:07 AM12/31/19
to Ed Schouten, Remote Execution APIs Working Group, Jakob Buchgraber, George Gensure, Richard Dale, Sander Striker
Hi Ed,

I wonder if you could write up a proposal around the protocol, outlining what the goals are you are looking to achieve with this protocol?  If your intention to propose this as a v2 for the Worker API and to have Workers standardize on this protocol, I think we need to weigh approaches to addressing the varying goals.

For example the worker_id map may be a clever implementation, I don't know.  There is very limited supporting motivation for doing it this way.

For later, Happy New Year!

Cheers,

Sander

--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/remote-execution-apis/CABh_MKnBbhGCuWz48hbPTXHGFgnwUahnq-pc0pbRrrDM4F28ZQ%40mail.gmail.com.

Eric Burnett

unread,
Jan 1, 2020, 10:34:39 PM1/1/20
to Ed Schouten, Remote Execution APIs Working Group, Jakob Buchgraber, George Gensure, Richard Dale, Sander Striker
Hi Ed,

Thanks for sharing! Always interesting to have more protocol examples to examine. 

> My question is, what are your thoughts on this protocol?

Let's see. Commenting as I go, based on the protocol itself plus contrasting with the RWAPI (which I have most experience with). Off the cuff, so take everything with a grain of salt.
  • Meta-comment: I like the clarity of this API. It's simple and easy to read, which would definitely make new implementations easier.

  • The `Synchronize` method strongly parallels RWAPI's `UpdateBotSession`, including the ability to return a result and get a new work item in a single request, and inlining of the work item itself for performance. FWIW, we've also been very happy with this pattern.

  • You write "This allows a scheduler to automatically group workers based on certain properties (data center, rack, physical server, worker thread, etc.).", but the proto docs state "nor should individual elements be interpreted by the scheduler." I'm curious, how do you find this works out in practice? Does the scheduler truly ignore the semantic context of all properties and simple allow you to group by any key/value combination, or do you find you're interpreting these fields in practice?

  • I see you've not emulated the whole worker/device hierarchy. I think that's a good choice - we haven't leveraged it as much as we expected either. In hindsight, a flat list of key/value pairs as worker properties would have sufficed just fine (and is effectively what we use). I think worker_id could be that, though it conflates identity with properties which might be clearer left separate.

  • ...though we're moving towards running multiple leases per worker still, so I think we're still happy with that bit of flexibility.

  • We debated using REAPI concepts (e.g. Platform) directly in the RWAPI for a long time, and ultimately decided against it. We've been fairly happy with that choice. A few examples:
    • SynchronizeRequest.platform being a REAPI platform directly implicitly assumes one worker should match one platform. We don't find that at all - workers have many properties clients don't need to specify explicitly (eg. OS is usually implicit in something else more specific that the client is keying by, but a key property of the worker), and many properties are runtime-variable inputs from the Worker's perspective (e.g. `container-image` for us), making it highly variable what Platform a client specifies to select a given worker. (Our list for colour). 
    • Future proofing: when REAPI hits V3, what happens?

  • Do you have anywhere to fit in server-provided context? Concrete example, we prefer to turn on new features server-side on a per-action basis, rather than bringing up a new worker with the new logic enabled for all actions it runs. We find it gives us more control, especially in how we roll out or roll back logic changes.

  • One thing I'm not seeing that I think you'd benefit from adding is stats - things like precise timing (how long did input fetching take) and other interesting values (e.g. peak RAM). (See `CommandResult.metadata` for where we squeeze it in). We've found it to be fairly implementation-specific and so changes with the bot itself, but still heavily rely on this metadata for release validation, performance optimizations, etc.

  • I doubt it's necessary for your protocol, but you might be interested in hearing that we (Google) have started to leverage more custom bot types for different workloads, with a common scheduler to dispatch work to them. In practice, that means we do leverage Lease.payload being an Any - as CommandTask is still by far the majority, but no longer the only type of task we run. (We also have a Reservation API we haven't OSS'd yet that allows for essentially creating Leases directly, without intermediating through any REAPI concepts. This gives us the flexibility to actually leverage it, which the REAPI kind of gets in the way for). 

>  Could a protocol like this serve as the basis for a standardized Remote Worker API?

For a standardized example API, I think it's a good foundation - the simplicity and readability are strong positives, and from your experience it sounds like it works pretty well already. I don't think it could absorb all current use-cases as-is though - for RBE, for example, we could probably stand to get rid of Devices and simplify that side of things, but otherwise leverage most of the flexibility points of the RWAPI that you've chosen to drop for simplicity. 

Which isn't a bad thing, to be clear! I think you should use the simplest API that does the trick, and iterating on worker APIs in-place is tractable enough when you control both client and server that adding complexity later is fine. But it's one reason I'm still not yet convinced that everyone should standardize on one Worker API - I see the cons of that approach, but am not yet clear on the Pros.

So let me turn the question around: what server or client implementations do you want to standardize with, and would this protocol suffice for them? If not, what do you need to add? I find it most tractable to start with concrete use-cases and work outwards towards generality, so having two or three concrete examples to consider together and evaluate fit on would be helpful before we get to asking whether this API would be a good base for everyone. That's a very big and scary question :).

Ed Schouten

unread,
Jan 3, 2020, 9:59:12 PM1/3/20
to Eric Burnett, George Gensure, Jakob Buchgraber, Remote Execution APIs Working Group, Richard Dale, Sander Striker
Hi Eric,

On Thu, 2 Jan 2020 at 14:34, Eric Burnett <ericb...@google.com> wrote:
  • Meta-comment: I like the clarity of this API. It's simple and easy to read, which would definitely make new implementations easier.

Thanks! \o/

  • You write "This allows a scheduler to automatically group workers based on certain properties (data center, rack, physical server, worker thread, etc.).", but the proto docs state "nor should individual elements be interpreted by the scheduler." I'm curious, how do you find this works out in practice? Does the scheduler truly ignore the semantic context of all properties and simple allow you to group by any key/value combination, or do you find you're interpreting these fields in practice?

The scheduler currently doesn’t interpret these labels, apart from displaying them to the admin/matching patterns provided by the admin. For example, the scheduler supports installing drains on workers to plan expected downtime. The admin can fill in a freeform pattern to do subset matching (e.g., {“datacenter”: “frankfurt”} to match workers called {“datacenter”: “frankfurt”, “rack”: ..., “hostname”: ...}).

At some point in the future we may add support for locality aware scheduling to improve cache hit rates/utilization. If we were to support that, the admin would need to specify a list of label names to specify the topology of the system (“datacenter” -> “cluster” -> “node” -> “pod” -> “thread”}.

The important aspect remains that all of this is user configurable. It shouldn’t be hardcoded in the design of a scheduler.

    • SynchronizeRequest.platform being a REAPI platform directly implicitly assumes one worker should match one platform. We don't find that at all - workers have many properties clients don't need to specify explicitly (eg. OS is usually implicit in something else more specific that the client is keying by, but a key property of the worker), and many properties are runtime-variable inputs from the Worker's perspective (e.g. `container-image` for us), making it highly variable what Platform a client specifies to select a given worker. (Our list for colour). 

Yeah, that’s something that’s missing in the protocol as it is right now. Right now it does exact matching on the platform properties. Having some kind of wildcard matching in there would be nice (as you pointed out, for container-image).

That said, I have to say I find the way of passing configuration attributes like that through string labels relatively weak. Think of the case where someone wants to run a build action not in a container image that is part of some registry, but one built through the remote execution protocol itself. Or one that is simply stored in a registry to which the worker has no direct access. There is no way a client can attach ‘the bytes’ of the container image to the action.

Alternatively: think of the case where people want to run a cc_test() directly on top of an embedded board/inside of a virtual machine. For those cases you’d want to be able to pass in an actual firmware image.
 
    • Future proofing: when REAPI hits V3, what happens?

When it comes to execution, we should be fine. We could expand the oneof’s in the protocol to distinguish between ‘idle’, ‘executing_v2’ and ‘executing_v3’ states. But yes, for platform properties that’s going to be annoying.

Hopefully the REAPI v2 and v3 messages would keep the same field numbering. If not, more complex workarounds would need to be added.

  • Do you have anywhere to fit in server-provided context? Concrete example, we prefer to turn on new features server-side on a per-action basis, rather than bringing up a new worker with the new logic enabled for all actions it runs. We find it gives us more control, especially in how we roll out or roll back logic changes.

Right now we’d spin up new workers, but give them different platform properties. That way existing builds will remain unaffected.

  • One thing I'm not seeing that I think you'd benefit from adding is stats - things like precise timing (how long did input fetching take) and other interesting values (e.g. peak RAM). (See `CommandResult.metadata` for where we squeeze it in). We've found it to be fairly implementation-specific and so changes with the bot itself, but still heavily rely on this metadata for release validation, performance optimizations, etc.

It’s interesting that you mention this. There is a change I’m planning on sending out to remote-apis in the nearby future:

Basically, we’re attaching these kinds of statistics to ActionResult. That way the statistics get stored in the AC, meaning it’s all in one place. bb-browser is capable of displaying these statistics directly when opening up an AC entry.

Buildbarn’s workers already uses this mechanism to attach POSIX getrusage(2) info and statistics on temporary file allocation to ActionResults.

So let me turn the question around: what server or client implementations do you want to standardize with, and would this protocol suffice for them? If not, what do you need to add? I find it most tractable to start with concrete use-cases and work outwards towards generality, so having two or three concrete examples to consider together and evaluate fit on would be helpful before we get to asking whether this API would be a good base for everyone. That's a very big and scary question :).

That is a very fair question. So the backstory behind this is that Sander Striker at some point suggested that the common Open Source implementations of REAPI (Build{barn,grid,farm}) should work towards a common worker protocol. In particular, I think he is/was interested in using Buildbarn’s storage/scheduler in combination with Buildgrid’s workers. Is that all right, Sander?

That said, so far I haven’t been contacted by anyone else who is interested in running such heterogeneous setups. Other users of Buildbarn seem to be happy with the custom protocol that’s being used there.

Sander, could you share with us whether this is still relevant to you? Is there still any interest in running a heterogeneous Buildbarn+Buildgrid setup on your end? If so, is there also any preparedness on the Buildgrid side to switch to a different protocol?

Ed
--
Ed Schouten <e...@nuxi.nl>

Sander Striker

unread,
Jan 5, 2020, 8:52:26 AM1/5/20
to Ed Schouten, Eric Burnett, George Gensure, Jakob Buchgraber, Remote Execution APIs Working Group, Richard Dale, Sander Striker
Hi,

On Sat, Jan 4, 2020 at 3:59 AM Ed Schouten <e...@nuxi.nl> wrote:
Hi Eric,

On Thu, 2 Jan 2020 at 14:34, Eric Burnett <ericb...@google.com> wrote:
  • Meta-comment: I like the clarity of this API. It's simple and easy to read, which would definitely make new implementations easier.

Thanks! \o/
 
[...] 
So let me turn the question around: what server or client implementations do you want to standardize with, and would this protocol suffice for them? If not, what do you need to add? I find it most tractable to start with concrete use-cases and work outwards towards generality, so having two or three concrete examples to consider together and evaluate fit on would be helpful before we get to asking whether this API would be a good base for everyone. That's a very big and scary question :).

That is a very fair question. So the backstory behind this is that Sander Striker at some point suggested that the common Open Source implementations of REAPI (Build{barn,grid,farm}) should work towards a common worker protocol. In particular, I think he is/was interested in using Buildbarn’s storage/scheduler in combination with Buildgrid’s workers. Is that all right, Sander?

Let me rephrase this a little bit.  My observation a while ago was that the problem space of workers is different than the problem space of scheduler and CAS.  Secondly the emerging pattern I observed was that entire RE stacks were limited in terms of supported platforms.  I think both are still true.
 
That said, so far I haven’t been contacted by anyone else who is interested in running such heterogeneous setups. Other users of Buildbarn seem to be happy with the custom protocol that’s being used there.

Sander, could you share with us whether this is still relevant to you? Is there still any interest in running a heterogeneous Buildbarn+Buildgrid setup on your end? If so, is there also any preparedness on the Buildgrid side to switch to a different protocol?

Just for clarification, BuildGrid implements the RE and Remote Worker APIs.  BuildBox is a toolkit to implement components of an RE stack, it includes buildbox-worker that implements the Remote Worker API and leverages other buildbox components to make up a complete worker.  In short, an alternative buildbox-worker could be implemented speaking this new protocol.  I am sure a contribution that would provide a "buildbox-worker-buildbarn" would be welcomed, in lieu of an agreed upon standard. 

My main interest is specifically in ensuring that we think about how we can invite more collaboration in this community.  One obvious way to do so is interoperability through standardization.

Ideally a protocol comes together through collaboration and iteration as well, which incentivises adoption among the collaborators.  As such I would rather that we actually take a step back and start with what we're trying to achieve first.  And work on an agenda for iteration to tackle the more difficult problems like platform specification and selection.

Make sense?

Cheers,

Sander
 
Ed
--
Ed Schouten <e...@nuxi.nl>

--
You received this message because you are subscribed to the Google Groups "Remote Execution APIs Working Group" group.
To unsubscribe from this group and stop receiving emails from it, send an email to remote-execution...@googlegroups.com.

Eric Burnett

unread,
Jan 5, 2020, 11:16:07 AM1/5/20
to Ed Schouten, George Gensure, Jakob Buchgraber, Remote Execution APIs Working Group, Richard Dale, Sander Striker
Thanks for the details Ed!

> The scheduler currently doesn’t interpret these labels, apart from displaying them to the admin/matching patterns provided by the admin.
...
> The important aspect remains that all of this is user configurable. It shouldn’t be hardcoded in the design of a scheduler.

Gotcha. We mostly follow the same principle - arbitrary opaque key-value labels for grouping/matching - but we don't have quite as clean a separation and do have a few keys the scheduler does interpret. Iirc at least "pool" (every worker must belong to one pool, and the scheduler produces metrics based on that), and I think OS (we used it briefly for tuning prefetching logic, though now I think that's all pool based so maybe nothing any longer). Possibly one or two more I'm not recalling.

Re, REAPI Platform: 

> That said, I have to say I find the way of passing configuration attributes like that through string labels relatively weak. Think of the case where someone wants to run a build action not in a container image that is part of some registry, but one built through the remote execution protocol itself. Or one that is simply stored in a registry to which the worker has no direct access. There is no way a client can attach ‘the bytes’ of the container image to the action.

Allowing string keys doesn't preclude inlined containers though. At one point this was discussed as a possible feature, but the only interested user at the time was using bazel, and it was determined bazel had no real way to express containers as a special type of input beyond simply passing the container as an input file. Which works today - building a docker container on the fly and running a wrapper script that 'trampolines' into it to execute the actual command is supported and used. It just ended up not taking any special API support. But a key-value "container-image" : "<CAS digest>" would also suffice to express an "inline" image, if you've got a client lined up to use it.

The other reason we didn't make anything docker-specific fully part of our API is that we didn't want to lock ourselves into it. In some cases we run actions without containers (still experimental, but necessary for some docker-incompatible workloads); in other cases we use containers but the keyspace is still different (e.g. Docker on Windows doesn't have the same set of tuning knobs as linux).

So we end up in a hybrid state where the server and worker both need to understand and agree on the keys - server to compose, worker to interpret, but it wouldn't buy us anything for the scheduler in between to also care, so we use flat string pairs to communicate between, that the scheduler just blindly passes over. (I could also imagine Any'ing a more complicated proto through, but we haven't needed anything like that yet, and perhaps don't want us to have that level of coupling if we can avoid it).

> Alternatively: think of the case where people want to run a cc_test() directly on top of an embedded board/inside of a virtual machine. For those cases you’d want to be able to pass in an actual firmware image.

This is another example where I'd want the server to first interpret the request from the client, translate it (e.g. maybe the fact that a board is specified implicitly imposes a scheduling constraint on what workers can run it, or the board image should be fetched from somewhere else server-side), give useful error messages if necessary, and then pass over the relevant bits to the worker. Perhaps that'd all get delegated to a server-controlled wrapper script, and to the worker it'd be a standard-looking action with command and inputs, where the wrapper script executed on the worker would do the heavy lifting of running the emulator on the right image. Or perhaps this would get implemented as a bot feature, and "emulator-image" : "AABBFF123" plus a bunch of known keys for tuning knobs (hardware to emulate, etc) would get passed in. But in either case, I personally prefer the flexibility of decoupling the user-facing REAPI from the "internal" server/bot interface.

(Side note: we've liked using server-injected wrapper scripts for prototyping new features easily, but most get re-implemented as bot features eventually, for better performance and/or error message handling. Our docker layer itself went through that progression, for example - server side wrapper script at first, fully baked bot feature later. So I don't think that that's strictly necessary, vs going straight to bot features).

Using the REAPI on the bot is an interesting tradeoff, and one we haven't fully explored ourselves. So I'll be interested in hearing more of your experience with it. I assume it forces you to do all the heavy lifting bot-side (or compose a new "action" if you want to inject anything from the server), but that in turn also allows you to avoid that translation layer if you weren't going to do anything server-side anyways. I'm curious how this weighs out on balance, especially as you add more supported features.

It’s interesting that you mention this. There is a change I’m planning on sending out to remote-apis in the nearby future:

Basically, we’re attaching these kinds of statistics to ActionResult. That way the statistics get stored in the AC, meaning it’s all in one place. bb-browser is capable of displaying these statistics directly when opening up an AC entry.

Huh, interesting. Would clients ever leverage this data directly, do you think, or would it only be for side-channel analysis? (I'm considering bb-browser an analysis tool here, not a "client" in the build tool sense).

I'm definitely in favour of having that sort of information for logging/analysis, but I hadn't thought of putting it into the Action Cache directly. (We use a different proto for logs, and so our analysis tools don't hit the AC directly). I'm slightly torn on the concept - on one hand, making the information more standardly accessible sounds nice; on the other hand, if everyone is going to get the action digest from logs already then having the information inline there is strictly better - no need to join logs against an RPC API that way - and so not polluting the API with data clients won't care about would have my vote. But I'd like to hear more about your usage/analysis pattern, as I doubt I'm thinking far outside the box that is our own current implementation :).

Jakob Buchgraber

unread,
Jan 8, 2020, 7:40:35 AM1/8/20
to Ed Schouten, Remote Execution APIs Working Group, George Gensure, Richard Dale, Sander Striker
Hi Ed,

the API looks good. As you know the previous worker protocol was slightly different: The worker and scheduler would communicate via a bidi streaming call: GetWork(stream ExecuteResponse) returns (stream ExecuteRequest). The new protocol has abandoned this approach and switched to a polling model instead.

I can see the advantages of the new protocol but I am still curious as to what motivated the change?

Best,
Jakob

Ed Baunton

unread,
Jan 22, 2020, 8:11:56 AM1/22/20
to Remote Execution APIs Working Group
Rekindling this thread so that it doesn't fall off our radars.

I would really like for us to be able to come to a conclusion as a community on the next version of the API. The landscape now is very fractured with 3 competing APIs and makes interoperability a non-starter.

BuildGrid and BuildBox go to a lot of effort to support many of the features defined by the worker spec: particularly the platform properties. We leverage the feature that was mentioned earlier to support wildcard container/chroot images: we merge the Merkle trees and fetch from the CAS.

Between us we have a good number of experiences and motivating use cases to drive a solid v2 of the API that would be cleaner, simpler and more flexible.
Reply all
Reply to author
Forward
0 new messages