[fidl-dev] Use case survey: observing client call cancellation

38 views
Skip to first unread message

Yifei Teng

unread,
Jun 17, 2021, 6:36:27 PM6/17/21
to fidl-dev
Hi FIDL users,

In making the C++ client bindings free of use-after-frees, one tricky aspect is whether to notify the users of cancellation of in-flight calls on a per-call basis when the client is tearing down. Here's an illustrative example:

class Foo {
 public:
  void DoWork();
 private:
  fidl::Client<MyProtocol> client_;
};

void Foo::DoWork() {
  client_.TwoWayCall([this] (Result result) {
    // Do more work based on |result|.
  });
}

When a `Foo` object destructs, it's natural to cancel (i.e. stop monitoring and drop the callback) any in-progress calls that haven't received a reply. Here, there are some API design options:

1. We could drop the callback without invoking it.
2. We could invoke the callback with a |result| value that signifies that the call was cancelled because destruction of |Foo|.

It seems that #2 could be useful if the user would like to link the cancellation to the release of other associated resources, or the cancellation of subsequent asynchronous continuations. However, it is less safe because we could only arrange the callback to be invoked when |Foo| is _destroying_ (re-entrancy, deadlocks, ...), or when |Foo| is destroyed (prone to use-after-free).

We would like to know if there are current use cases you could think of that necessitates a #2 style API. Note that #1 isn't completely free of use-after-free either if the callback pokes at the captured `this` object in its destructor, but it should be much less common in practice.

Cheers,
Yifei

Wez

unread,
Jun 17, 2021, 7:11:53 PM6/17/21
to Yifei Teng, fidl-dev
In this example the Foo object knows when it is being destroyed, so there does not seem to be value in invoking the callback when the Client is destroyed?

It's also the case by the time Client is destructed (which is presumably when the callbacks would be run), Foo is already partially destructed, i.e. it is no longer safe to make calls on the Foo instance.

--
All posts must follow the Fuchsia Code of Conduct https://fuchsia.dev/fuchsia-src/CODE_OF_CONDUCT or may be removed.
---
You received this message because you are subscribed to the Google Groups "fidl-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to fidl-dev+u...@fuchsia.dev.
To view this discussion on the web visit https://groups.google.com/a/fuchsia.dev/d/msgid/fidl-dev/65c0db64-42da-4928-9496-8297871af4d8n%40fuchsia.dev.

Yifei Teng

unread,
Jun 17, 2021, 7:24:59 PM6/17/21
to Wez, fidl-dev
That matches my current thinking and leads to less use-after-free. I'm somewhat concerned about a potential raciness which I'll illustrate below. Would love to hear folks' thoughts. Thanks!

On Thu, Jun 17, 2021 at 4:09 PM Wez <w...@google.com> wrote:
In this example the Foo object knows when it is being destroyed, so there does not seem to be value in invoking the callback when the Client is destroyed?

Makes sense. We could imagine a design where all extrinsic causes for teardown are propagated to the user (e.g. peer closed), but not when the user themselves initiated the teardown.

A corner case (maybe much less of a corner case) is when the user initiates the teardown of |client_| ahead of time before |Foo| destructs, for various reasons. LLCPP supports this via AsyncTeardown. It seems useful to fail future calls on the same client with an error signalling that the client is tearing down, but this error is the same kind of error which we wanted to avoid propagating to the user in the case of in-flight calls. As a result, we end up with a strange API where in-flight calls are silently dropped, but future calls after the teardown are notified with an error.

Perhaps we could silently drop future calls too? It seems to make the error non-obvious.

Wez

unread,
Jun 17, 2021, 7:54:34 PM6/17/21
to Yifei Teng, fidl-dev
If things have been torn-down then there can't be any further calls made through them, surely?

Yifei Teng

unread,
Jun 17, 2021, 8:07:50 PM6/17/21
to Wez, fidl-dev
Pardon, I should've been more precise. There are two related concepts:

- Teardown (i.e. destruction) of the client object. If a client is destroyed, it's not accessible by definition.
- Teardown of the bindings, i.e. the procedure of monitoring new messages and the association of the endpoint and the client. In HLCPP a similar operation is called "unbind", which also returns the channel back to the caller.

It's possible for the client object to be in scope, while the bindings have torn down. For example, a fatal error may happen in the transport that requires tearing down the bindings, in which case we'd want to fail further calls made on the client. Similarly, the user may explicitly initiate the teardown of the bindings (`AsyncTeardown` above).

Different causes of teardown seems to warrant different error notification behaviors:
- If the user requested teardown then it seems desirable to silently drop outstanding callbacks to prevent use-after-free.
- If a transport error caused teardown then it seems possible to propagate the error to outstanding callbacks.

An open question is what happens when the user continues to make more method calls on the client, during/after bindings teardown.

Pascal

unread,
Jun 18, 2021, 1:59:31 AM6/18/21
to Yifei Teng, Wez, fidl-dev
An open question is what happens when the user continues to make more method calls on the client, during/after bindings teardown.

Wouldn't that be a state violation, and lead to either error or panic, i.e. you are using a part of the API which even though exists (methods are there) is not accessible at that moment in time, and should be treated as if it is not there? A little like trying to read from a closed fd.

PL

Yifei Teng

unread,
Jun 18, 2021, 5:46:14 AM6/18/21
to Pascal, Wez, fidl-dev
Makes sense. Some thoughts inline. 

On Thu, Jun 17, 2021 at 10:59 PM Pascal <pasca...@google.com> wrote:
An open question is what happens when the user continues to make more method calls on the client, during/after bindings teardown.

Wouldn't that be a state violation, and lead to either error or panic, i.e. you are using a part of the API which even though exists (methods are there) is not accessible at that moment in time, and should be treated as if it is not there? A little like trying to read from a closed fd.


One trickiness is that the teardown could come from a source that is outside of the user's control, e.g. the server might close their end of the channel oor send invalid data. It seems overly restrictive to penalize the user by panicking if the server closes their endpoint while the user is making a client call.

Using the fd analogy, I think it's similar to reading from a fd whose remote side has closed (e.g. reading from a socket whose peer has terminated). We could return an error reflecting this situation. On the other hand, if the user themselves requested teardown (similar to closing the local fd or closing a local channel), it sounds more reasonable if we chose to panic in this case. Perhaps we could implement distinct behavior based on the cause of the teardown.

A slightly less trickiness is that doing so forces users to add external synchronization between making calls and teardown. Currently in LLCPP, we'd simply return the same error when making a call when teardown is happening/has completed. If we chose to panic in a subset of the cases, users would have to take a lock whenever they're making a call or calling `AsyncTeardown` on the client, to avoid triggering the panic.

We could continue to return the same error, but that doesn't mesh well with moving sync errors to async (fxbug.dev/75324), since that gets us to approach #2 in the first email in the thread which as Wez mentions is prone to use-after-free: we'd surface an asynchronous error saying that the bindings have torn down, but that callback might end up touching destroyed user objects if not careful.
 
PL

Wez

unread,
Jun 21, 2021, 2:03:55 AM6/21/21
to Yifei Teng, Pascal, fidl-dev
On Fri, 18 Jun 2021 at 02:46, Yifei Teng <yif...@google.com> wrote:
Makes sense. Some thoughts inline. 

On Thu, Jun 17, 2021 at 10:59 PM Pascal <pasca...@google.com> wrote:
An open question is what happens when the user continues to make more method calls on the client, during/after bindings teardown.

Wouldn't that be a state violation, and lead to either error or panic, i.e. you are using a part of the API which even though exists (methods are there) is not accessible at that moment in time, and should be treated as if it is not there? A little like trying to read from a closed fd.


One trickiness is that the teardown could come from a source that is outside of the user's control, e.g. the server might close their end of the channel oor send invalid data. It seems overly restrictive to penalize the user by panicking if the server closes their endpoint while the user is making a client call.

That's why the HLCPP bindings don't take action if writing to the channel fails, but instead watch the channel for peer-closed, and teardown & execute the client-supplied callback when that is observed.  The API contract is that callers should never call via an un-bound client.

Using the fd analogy, I think it's similar to reading from a fd whose remote side has closed (e.g. reading from a socket whose peer has terminated). We could return an error reflecting this situation. On the other hand, if the user themselves requested teardown (similar to closing the local fd or closing a local channel), it sounds more reasonable if we chose to panic in this case. Perhaps we could implement distinct behavior based on the cause of the teardown.

This is already implemented in the HLCPP bindings:
1. If an InterfacePtr<> is bound and peer-closed is observed then the error-handler is invoked.
2. If an InterfacePtr<> is explicitly Unbind()ed then the error-handler is not invoked.

There's nothing particularly tricky about that - there's just a simple API guarantee, inherent in having Unbind()ed the InterfacePtr<> - it cannot possibly observe the peer-closed after being unbound. :)

A slightly less trickiness is that doing so forces users to add external synchronization between making calls and teardown. Currently in LLCPP, we'd simply return the same error when making a call when teardown is happening/has completed. If we chose to panic in a subset of the cases, users would have to take a lock whenever they're making a call or calling `AsyncTeardown` on the client, to avoid triggering the panic.

We could continue to return the same error, but that doesn't mesh well with moving sync errors to async (fxbug.dev/75324), since that gets us to approach #2 in the first email in the thread which as Wez mentions is prone to use-after-free: we'd surface an asynchronous error saying that the bindings have torn down, but that callback might end up touching destroyed user objects if not careful.

That these kinds of questions are still coming up suggests that LLCPP bindings still lack a coherent threading model.

If I understand correctly, it is a requirement that it be possible for a single LLCPP binding to be shareable by clients running on multiple threads?
With a per-client single-owner object wrapping a reference to the shared multi-threaded binding object, it should be straightforward to arrange that clients are guaranteed not to be called-back from the moment they teardown that single-owner object.

Is that not something we can/want to do?
 
 
PL

Yifei Teng

unread,
Jun 21, 2021, 6:19:19 PM6/21/21
to fidl-dev, Wez, Pascal Perez, fidl-dev, Yifei Teng, Suraj Malhotra
Thanks for the comments, Wez. LLCPP is under a more challenging set of requirements on the concurrency and safety front, which I'll elaborate towards the end - would appreciate your thoughts on this.

On Sunday, June 20, 2021 at 11:03:55 PM UTC-7 Wez wrote:
On Fri, 18 Jun 2021 at 02:46, Yifei Teng <yif...@google.com> wrote:
Makes sense. Some thoughts inline. 

On Thu, Jun 17, 2021 at 10:59 PM Pascal <pasca...@google.com> wrote:
An open question is what happens when the user continues to make more method calls on the client, during/after bindings teardown.

Wouldn't that be a state violation, and lead to either error or panic, i.e. you are using a part of the API which even though exists (methods are there) is not accessible at that moment in time, and should be treated as if it is not there? A little like trying to read from a closed fd.


One trickiness is that the teardown could come from a source that is outside of the user's control, e.g. the server might close their end of the channel oor send invalid data. It seems overly restrictive to penalize the user by panicking if the server closes their endpoint while the user is making a client call.

That's why the HLCPP bindings don't take action if writing to the channel fails, but instead watch the channel for peer-closed, and teardown & execute the client-supplied callback when that is observed.  The API contract is that callers should never call via an un-bound client.

IIUC, this was referring to the `set_error_handler` callback which notifies the user when tearing down. But not all users attach an error handler to an `InterfacePtr`. There's also a potential race condition if the callback is not attached "immediately" after creating the `InterfacePtr` e.g. if the callback was attached asynchronously. I think it is for these reasons that the HLCPP `InterfacePtr` precisely defined the effect of calling on an un-bound client to mean silently discarding all input and closing the handles.

We could imagine changing the HLCPP `InterfacePtr` invariants to instead require that callers never call via an un-bound client, and have `InterfacePtr` panic if a call was made when un-bound. I think it could make for better guarantees and still be a viable API - users could insert teardown/reconnect logic in their error handlers, or set a boolean to avoid subsequent calls when there is an error.

However, In LLCPP, it is a requirement that FIDL method calls could be made from arbitrary threads. It turns out that it is much harder to make the same guarantees (panic if call on un-bound) under this requirement. When the user is making a call, they could never be sure if the client is bound, because the client may receive a peer-closed signal, process it on another thread, and start the teardown process all while a call is being made from an unrelated thread. Even adding synchronization wouldn't help because teardown (becoming unbound) is an internal state change of the bindings.

Today, LLCPP does the same thing as HLCPP and silently drops the message if the user makes a call on a client whose bindings have torn down. I'm wondering if there are ways to be more explicit, or improve the guarantees, without panicking.
 

Using the fd analogy, I think it's similar to reading from a fd whose remote side has closed (e.g. reading from a socket whose peer has terminated). We could return an error reflecting this situation. On the other hand, if the user themselves requested teardown (similar to closing the local fd or closing a local channel), it sounds more reasonable if we chose to panic in this case. Perhaps we could implement distinct behavior based on the cause of the teardown.

This is already implemented in the HLCPP bindings:
1. If an InterfacePtr<> is bound and peer-closed is observed then the error-handler is invoked.
2. If an InterfacePtr<> is explicitly Unbind()ed then the error-handler is not invoked.

There's nothing particularly tricky about that - there's just a simple API guarantee, inherent in having Unbind()ed the InterfacePtr<> - it cannot possibly observe the peer-closed after being unbound. :)


Following this behavior, coupled with silently dropping messages, would mean that users will not get any error when they send messages on a client that has been explicitly unbound (using `Unbind`, or the LLCPP counterpart `AsyncTeardown`). I don't think this is the best behavior, but I also couldn't think of better alternatives :/
 
A slightly less trickiness is that doing so forces users to add external synchronization between making calls and teardown. Currently in LLCPP, we'd simply return the same error when making a call when teardown is happening/has completed. If we chose to panic in a subset of the cases, users would have to take a lock whenever they're making a call or calling `AsyncTeardown` on the client, to avoid triggering the panic.

We could continue to return the same error, but that doesn't mesh well with moving sync errors to async (fxbug.dev/75324), since that gets us to approach #2 in the first email in the thread which as Wez mentions is prone to use-after-free: we'd surface an asynchronous error saying that the bindings have torn down, but that callback might end up touching destroyed user objects if not careful.

That these kinds of questions are still coming up suggests that LLCPP bindings still lack a coherent threading model.

I've attempted to clarify the threading model in go/llcpp-threading-requirements. A brief summary is:
- It should be possible to implement two-phase shutdown on both client and server side. It should be possible to initiate teardown of the bindings from an arbitrary thread, and then schedule subsequent shutdown logic of user objects after teardown completion.
- Allow making FIDL calls or replies from multiple threads.
 

If I understand correctly, it is a requirement that it be possible for a single LLCPP binding to be shareable by clients running on multiple threads?

IIUC, this assumes the following model:
- Shared FIDL bindings object
- Object (user client) A, running on thread A, making call via shared FIDL bindings
- Object (user client) B, running on thread B, making call via shared FIDL bindings

The requirements LLCPP is working under is slightly different:
- FIDL bindings object
- Object (user client) A, running on thread A, B, C, D, ...., making calls via FIDL bindings

The latter model better matches the driver threading model which is a customer of LLCPP. I've cc-ed +Suraj Malhotra who may be able to speak more to this. A driver may be invoked from many different threads, and would not want to push all FIDL-related work to one particular thread. This means a single client in the former model is instead receiving calls from multiple threads.
 
With a per-client single-owner object wrapping a reference to the shared multi-threaded binding object, it should be straightforward to arrange that clients are guaranteed not to be called-back from the moment they teardown that single-owner object. 

Is that not something we can/want to do?

 I agree. However, at this point I'm not yet sure if the "multiple single-owner object referencing shared binding" would be able to satisfy all multithreading use cases.

 
 
PL
Reply all
Reply to author
Forward
0 new messages