FIDL API design question: "Destructive" methods and terminal events

4 views
Skip to first unread message

Gary Bressler

unread,
May 26, 2023, 3:00:42 PM5/26/23
to api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Adam Barth, Kevin Lindkvist
Greetings,

Claire is designing a new Controller component API that will allow components to directly control lifecycle of their children, and we have run into a subtle design issue.

Basically, Controller is a two-level API. There is a Controller protocol that represents a component, and an ExecutionController nested inside it that represents a component's execution state. When the component starts, it should spawn an ExecutionController, and when it stops, the ExecutionController should close.

For the purpose of this topic, there are two operations that are relevant: the ability to stop a component, and the ability to be notified that a component stopped (including its return code, if it had one). This notification might arrive due to a Stop(), or spontaneously if the component exited on its own.

Here is the API we currently have. Call it Option 1.

protocol Controller {
    /// Start the component, optionally providing additional handles to be given
    /// to the component. Returns INSTANCE_ALREADY_RUNNING if the instance is
    /// currently running.
    Start(resource struct {
        args StartChildArgs;
        execution_controller server_end:ExecutionController;
    }) -> () error Error;
};

protocol ExecutionController {
    /// Stops the component. This function blocks until the stop action is
    /// complete.
    ///
    /// Note that a component may stop running on its own at any time, so it is
    /// not safe to assume that this error code will not be returned if this
    /// function has not been called yet.
    Stop() -> () error Error;

    /// When the child is stopped due to `Stop` being called, the child exiting
    /// on its own, or for any other reason, `OnStop` is called and then this
    /// channel is closed.
    -> OnStop(struct {
        stopped_payload StoppedPayload;
    });
};


Stop() is a "destructive" method: one of its side effects is to close the channel of the protocol it's attached to. The problem is the following. When Stop() returns, we would like the server to close the ExecutionController, since that is what authoritatively signals to the client that the component has stopped. However, this is not possible without some delay between the response of Stop() and closing the channel, since the server cannot deliver the response after closing the channel, or atomically with closing the channel.

Option 2 is to make Stop() a void method. That way, there is no problem of giving inconsistent signals about when stop finished. Clients have to listen for PEER_CLOSED (or the event) to wait for the stop to complete.

protocol ExecutionController {
    /// Causes this component to stop. When stop completes, OnStop will be
    /// delivered on the channel and it will be closed.
    Stop();

    /// When the child is stopped due to `Stop` being called, the child exiting
    /// on its own, or for any other reason, `OnStop` is called and then this
    /// channel is closed.
    -> OnStop(struct {
        stopped_payload StoppedPayload;
    });
};

There are some obvious drawbacks to this option, though. In general, "Pipelining" APIs are less user friendly. Also, if we want Stop() to return an error, it's not clear how to do that. Maybe the error could become part of the StoppedPayload, but that's a lot less friendly than it coming straight from the method.

Finally, there's an Option 3 where we lift Stop() to the top-level Controller:

protocol Controller {
    /// Start the component, optionally providing additional handles to be given
    /// to the component. Returns INSTANCE_ALREADY_RUNNING if the instance is
    /// currently running.
    Start(resource struct {
        args StartChildArgs;
        execution_controller server_end:ExecutionController;
    }) -> () error Error;

    /// Stops the component. This function blocks until the stop action is
    /// complete.
    ///
    /// When this returns, the ExecutionController will be closed and contain
    /// the OnStop event.
    Stop() -> () error Error;
};

This would make it possible to close the ExecutionController before returning from Stop(). However, it decouples the method from the entity it's acting on.

Is there guidance for which pattern we should prefer? Or prior art we can refer to?

P.S. @Yifei Teng and I were discussing another hypothetical option if terminal events were available, which would let us have our cake and eat it too:

protocol ExecutionController {
    /// Stops the component. This function blocks until the stop action is
    /// complete.
    ///
    /// Atomically with sending the response, the server will put this channel
    /// into the terminal state, containing the OnStop terminal event.
@terminal Stop() -> (StoppedPayload) error Error; 
@terminal -> OnStop(StoppedPayload); 
};

The idea here is that the server would have a way to put the channel into the "terminal state", with the terminal event, simultaneously with the Stop response. This would give us everything we want, and there is no race between delivering the response and closing the channel.


Suraj Malhotra

unread,
May 26, 2023, 3:10:32 PM5/26/23
to Gary Bressler, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Adam Barth, Kevin Lindkvist
We've hit similar issues in the past and rely on epitaphs to signal the termination rather than an event. So the API looks like:


protocol ExecutionController {
    /// Causes this component to stop. When stop completes, OnStop will be
    /// delivered on the channel and it will be closed.
    Stop();
}

I don't think this is ideal and terminal events would improve ergonomics and fidelity of the API.

--
All posts must follow the Fuchsia Code of Conduct https://fuchsia.dev/fuchsia-src/CODE_OF_CONDUCT or may be removed.
---
You received this message because you are subscribed to the Google Groups "fidl-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fidl-dev+u...@fuchsia.dev.
To view this discussion on the web visit https://groups.google.com/a/fuchsia.dev/d/msgid/fidl-dev/CAK1yh2miydPotkP67%2B%2BrYY_kG_1aTcGWFjXfmtz%3DrQ3i-H2Lmw%40mail.gmail.com.

Alice Neels

unread,
May 26, 2023, 3:21:12 PM5/26/23
to Gary Bressler, TQ-Scenic-Team, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Adam Barth, Kevin Lindkvist
+TQ-Scenic-Team since very similar problems have come up when managing view lifecycles.

--
You received this message because you are subscribed to the Google Groups "api-council" group.
To unsubscribe from this group and stop receiving emails from it, send an email to api-council...@fuchsia.dev.
To view this discussion on the web visit https://groups.google.com/a/fuchsia.dev/d/msgid/api-council/CAK1yh2miydPotkP67%2B%2BrYY_kG_1aTcGWFjXfmtz%3DrQ3i-H2Lmw%40mail.gmail.com.

Adam Barth

unread,
May 26, 2023, 3:40:12 PM5/26/23
to Alice Neels, Gary Bressler, TQ-Scenic-Team, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Kevin Lindkvist
Can you say a bit more about what problem this timing gap causes?  In reality, the PEER_CLOSED signal and the response to Stop() are not synchronized in any way.  The PEER_CLOSED signal is asserted on the peer handle immediately, whereas the response to Stop() is queued through the message queue on the channel.  It's quite likely the PEER_CLOSED signal will reach the client before the Stop() response.  The FIDL bindings are designed to hide this race from their clients because they always drain a channel before reacting to the PEER_CLOSED signal.  In any case, there is a spin of the event loop between any two observations, including the Stop() response and the PEER_CLOSED observations.  Even if the server could atomicly queue the Stop() response and the PEER_CLOSED signal, the receiver would still process them in separate spins of the event loop.

Option 2 is to make Stop() a void method. That way, there is no problem of giving inconsistent signals about when stop finished. Clients have to listen for PEER_CLOSED (or the event) to wait for the stop to complete.

protocol ExecutionController {
    /// Causes this component to stop. When stop completes, OnStop will be
    /// delivered on the channel and it will be closed.
    Stop();

    /// When the child is stopped due to `Stop` being called, the child exiting
    /// on its own, or for any other reason, `OnStop` is called and then this
    /// channel is closed.
    -> OnStop(struct {
        stopped_payload StoppedPayload;
    });
};

There are some obvious drawbacks to this option, though. In general, "Pipelining" APIs are less user friendly. Also, if we want Stop() to return an error, it's not clear how to do that. Maybe the error could become part of the StoppedPayload, but that's a lot less friendly than it coming straight from the method.

Can you say a bit more about these drawbacks?  They're not obvious to me.  For example, can you be more specific about what is less user friendly about this approach?  For example, can you tell me a story about how a user interacts with this protocol and ends up unhappy?

Regarding returning an error from Stop(), can you say a bit more about what kind of error you might want to return from stop?  In general, our approach to close/stop operations in Fuchsia is to make them error-free.  For example, when you call exit(0) in C, there's no error that exit can report because it never returns.  Similarly, libzx swallows all errors from zx_handle_close because there's no sensible thing for the caller to do with them.

Gary Bressler

unread,
May 26, 2023, 4:08:20 PM5/26/23
to Adam Barth, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Kevin Lindkvist
In Claire's cl, this the order of operations:


If my understanding is correct, this means a client could observe the response, then later (in a separate transaction) observe the event and PEER_CLOSED. This seems confusing from a consistency perspective, because when Stop() returned that was supposed to indicate that the stop operation completed, but the client still sees aftereffects.

Basically, I would like (2) and (3) to be concurrent with, or at least precede, (1). But we can't simply move (3) before (1) and (2) because once a process closes its channel handle it can no longer do anything with that channel.
 
Option 2 is to make Stop() a void method. That way, there is no problem of giving inconsistent signals about when stop finished. Clients have to listen for PEER_CLOSED (or the event) to wait for the stop to complete.

protocol ExecutionController {
    /// Causes this component to stop. When stop completes, OnStop will be
    /// delivered on the channel and it will be closed.
    Stop();

    /// When the child is stopped due to `Stop` being called, the child exiting
    /// on its own, or for any other reason, `OnStop` is called and then this
    /// channel is closed.
    -> OnStop(struct {
        stopped_payload StoppedPayload;
    });
};

There are some obvious drawbacks to this option, though. In general, "Pipelining" APIs are less user friendly. Also, if we want Stop() to return an error, it's not clear how to do that. Maybe the error could become part of the StoppedPayload, but that's a lot less friendly than it coming straight from the method.

Can you say a bit more about these drawbacks?  They're not obvious to me.  For example, can you be more specific about what is less user friendly about this approach?  For example, can you tell me a story about how a user interacts with this protocol and ends up unhappy?

Regarding returning an error from Stop(), can you say a bit more about what kind of error you might want to return from stop?  In general, our approach to close/stop operations in Fuchsia is to make them error-free.  For example, when you call exit(0) in C, there's no error that exit can report because it never returns.  Similarly, libzx swallows all errors from zx_handle_close because there's no sensible thing for the caller to do with them.

This is going off anecdotes and personal experience, but users seem to find it tricky sometimes when a fidl method call doesn't convey the result of the operation. `fuchsia-io.Open` (such as in the context of capability routing) is the classic example of this: users connect to a protocol in their namespace, make a few method calls on it, which appears to succeed, but then something breaks later because the routing failed. That said, I'm open to the possibility that this is a standard pattern users should be expected to understand.

Regarding the error, in principle StopAction can fail, although I'm not aware of a real scenario where this has happened. It may be that in this case, leaving out the error from `Stop` would be fine and conveying it in the StoppedPayload instead (or simply logging it) would be enough.

This is speculative, but if we introduced something like rights or the controller, I could see an error for insufficient permissions (ACCESS_DENIED).

Adam Barth

unread,
May 26, 2023, 4:34:18 PM5/26/23
to Gary Bressler, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Kevin Lindkvist
They would process three separate events, with gaps for other async work between them, and they would be 1, 2, and 3, in order.  The kernel provides the ordering for (1) and (2) but lets (3) race.  The FIDL bindings provide the ordering that causes (3) to be processed after (1) and (2).
 
This seems confusing from a consistency perspective, because when Stop() returned that was supposed to indicate that the stop operation completed, but the client still sees aftereffects.

Yeah, I don't think you can say that Stop() blocks until the stop action is complete.  That's not a promise that can be kept.  You can say that no further messages will be sent by the server after (2).

Basically, I would like (2) and (3) to be concurrent with, or at least precede, (1).

The way the FIDL bindings are designed, PEER_CLOSED is the last thing that ever happens with a protocol.  It's not possible to deliver events after delivering the PEER_CLOSED event.
 
But we can't simply move (3) before (1) and (2) because once a process closes its channel handle it can no longer do anything with that channel.

You're conflating the sender and the receiver.  It sounds like you're mostly interested in how the receiver processes these events.  You're correct that trying to make the receiver process (3) before (1) and (2) is contrary to the design of the system and isn't feasible.
 
Option 2 is to make Stop() a void method. That way, there is no problem of giving inconsistent signals about when stop finished. Clients have to listen for PEER_CLOSED (or the event) to wait for the stop to complete.

protocol ExecutionController {
    /// Causes this component to stop. When stop completes, OnStop will be
    /// delivered on the channel and it will be closed.
    Stop();

    /// When the child is stopped due to `Stop` being called, the child exiting
    /// on its own, or for any other reason, `OnStop` is called and then this
    /// channel is closed.
    -> OnStop(struct {
        stopped_payload StoppedPayload;
    });
};

There are some obvious drawbacks to this option, though. In general, "Pipelining" APIs are less user friendly. Also, if we want Stop() to return an error, it's not clear how to do that. Maybe the error could become part of the StoppedPayload, but that's a lot less friendly than it coming straight from the method.

Can you say a bit more about these drawbacks?  They're not obvious to me.  For example, can you be more specific about what is less user friendly about this approach?  For example, can you tell me a story about how a user interacts with this protocol and ends up unhappy?

Regarding returning an error from Stop(), can you say a bit more about what kind of error you might want to return from stop?  In general, our approach to close/stop operations in Fuchsia is to make them error-free.  For example, when you call exit(0) in C, there's no error that exit can report because it never returns.  Similarly, libzx swallows all errors from zx_handle_close because there's no sensible thing for the caller to do with them.

This is going off anecdotes and personal experience, but users seem to find it tricky sometimes when a fidl method call doesn't convey the result of the operation. `fuchsia-io.Open` (such as in the context of capability routing) is the classic example of this: users connect to a protocol in their namespace, make a few method calls on it, which appears to succeed, but then something breaks later because the routing failed. That said, I'm open to the possibility that this is a standard pattern users should be expected to understand.

Regarding the error, in principle StopAction can fail, although I'm not aware of a real scenario where this has happened. It may be that in this case, leaving out the error from `Stop` would be fine and conveying it in the StoppedPayload instead (or simply logging it) would be enough.

This is speculative, but if we introduced something like rights or the controller, I could see an error for insufficient permissions (ACCESS_DENIED).

ACCESS_DENIED is an "early" failure in the sense that it's an error that's generated before the ExecutionController decides to accept the Stop() operation.  It's sensible for the Stop() operation to have a reply that indicates whether the ExecutionController accepted the operation.  It's not sensible for the reply to Stop() to be the last thing the client observes about the ExecutionController because the client will observe PEER_CLOSED afterwards.

Maybe the following design would make sense:

    /// Stops the component.
    ///
    /// Upon success, the component will eventually stop. To observe when the
    /// component stops, wait for the OnStop event, followed by the PEER_CLOSED.
    ///
    /// Errors:
    ///
    ///  * ZX_ERR_ACCESS_DENIED: the client has insufficient rights to stop
    /// the component.

    ///
    /// Note that a component may stop running on its own at any time, so it is
    /// not safe to assume that this error code will not be returned if this
    /// function has not been called yet.
    Stop() -> () error Error;

Gary Bressler

unread,
May 26, 2023, 5:58:09 PM5/26/23
to Adam Barth, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Yifei Teng, Claire Gonyeo, Kevin Lindkvist
Wondering if it makes sense for fidl to provide some abstract notion of "channel has terminated" that doesn't necessarily imply PEER_CLOSED? Most clients won't check for the PEER_CLOSED signal directly, they are using the fidl bindings (`set_error_handler in C++, or `None` from `stream.next()` in rust) to tell them when the channel is closed. 
Yes, I considered this design, but I was uncertain about it because users who don't read the doc might have the mistaken interpretation that a response from Stop() implies stop has completed. If that is a risk we are willing to accept, I think this design would solve the requirements.

Yifei Teng

unread,
May 26, 2023, 6:43:24 PM5/26/23
to Gary Bressler, Adam Barth, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Claire Gonyeo, Kevin Lindkvist
Yes there's an out of band peer-closed handler. In that case what's the sharp edge you would like to guard against? After `Stop` returns, do you consider the `OnStopped` event (if any) and the PEER_CLOSED signal (and the out of band handler) unwanted aftereffects?

Gary Bressler

unread,
May 26, 2023, 7:08:28 PM5/26/23
to Yifei Teng, Adam Barth, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Claire Gonyeo, Kevin Lindkvist
I consider them effects wanted before `Stop` returns, but not after (assuming we're using Stop response to mean "this stop has completed").

Adam Barth

unread,
May 26, 2023, 7:48:35 PM5/26/23
to Gary Bressler, Yifei Teng, Alice Neels, TQ-Scenic-Team, api-council, fidl-dev, Claire Gonyeo, Kevin Lindkvist
On Fri, May 26, 2023 at 11:08 PM Gary Bressler <g...@google.com> wrote:
I consider them effects wanted before `Stop` returns, but not after (assuming we're using Stop response to mean "this stop has completed").

You could provide those semantics in a client library, but I don't see a way to provide that invariant with the FIDL interface directly.  Fundamentally, the last event you receive in the FIDL model is the PEER_CLOSED event, which will always be some async gap after the last reply you get to a method.
If we introduced the concepts of terminal events to FIDL and we marked the OnStop event as terminal, we could have the client close the connection when it receives the OnStop event.  From the client's perspective, the ExecutionController would "go away" simultaneously with receiving the OnStop event.

I guess we could do something similar with a "terminal message" feature in FIDL where the client would close the connection when receiving the reply to the message.  That would similarly make it appear to the client that the ExecutionController "went away" simultaneously with receiving the reply.

If you marked both Stop() -> () error Error and -> OnStop as terminal, the client would receive only one of them, although it would be hard to predict which one.
If you don't have a strong use case for the Error reply, I think it's cleaner to just have the OnStop as the only way of getting notified.  Regardless of how long the server waits to reply to Stop(), there's always a risk that the Stop() message is racing with the ExecutionController going away of its own accord.  In that race, sometimes you won't get a reply to Stop() and instead get a transport error.  The client will need code that's resilient to these multiple termination conditions that arise due to this race.  However, in the model where Stop() has no reply, none of these races are visible to the client.
Reply all
Reply to author
Forward
0 new messages