Error space is limited to set of Zircon status values

已查看 6 次
跳至第一个未读帖子

Yaneury Fermin

未读,
2021年8月30日 21:06:382021/8/30
收件人 component-framework-dev
I recently ran into a hurdle during the `fuchsia.sys2/Realm.BindChild` [migration](https://fuchsia.dev/fuchsia-src/contribute/governance/rfcs/0108_component_binder_protocol).  I hope that my roadblock can be used to highlight a limitation of the Component Framework API, and that we can help find a solution for this.

Test Manager uses RealmBuilder to create a hermetic environment for a test suite. It takes the component URL passed in by users (`fx test COMPONENT_URL`) and adds that component to a realm with Archivist and other fancy things. When a user mistypes the component URL, Test Manager gives them a helpful message suggesting that they double-check that the test component is included in the user's package set. Internally, this path is executed when Test Manager fails to construct the realm for a given test. It interprets all failures to create the realm as a case of either unresolvable component URL (read: not in package set) or a typo.

This was fine and dandy until I came in and changed `ScopedInstance` to use `fuchsia.sys2/Realm.OpenExposedDir` instead of `BindChild`. `ScopedInstance` is the class used by RealmBuilder to create the root of the realm. This change caused certain Test Manager tests to fail because unresolvable component URLs were no longer yielding an error during realm construction. This is so because `OpenExposedDir` doesn't try to start the child component provided to it, and so the realm was created successfully since Component Manager doesn't try to resolve the test component URL. This made tests cases for this behavior, e.g. give a Test Manager an unresolvable component URL and assert that the proper error is returned, fail because Test Manager returned different errors than expected.

So faced with the need to change this API, I had to figure out how to keep Test Manager, and its tests, at parity during the migration. To solve this, there were only two options available:

1) Attempt to connect to `fuchsia.component.Binder` of the root of the realm to trigger a start. If the channel to this connection closes with an epitaph, read its epitaph to determine how to handle the error.

2) Test components are expected to expose the `fuchsia.test.Suite` protocol. This is the primary mechanism in which Test Manager communicates with the test component. If connecting to this protocol, or invoking a method on it, fails, then use the epitaph written on channel closure to handle the errors appropriately.

Option #1 is inferior to option #2 because there's no good way to reliably read the epitaph on the connection. Connecting to `fuchsia.component.Binder` is async and there's no way to know if it fails unless you block waiting for an error event. However, if the component URL *is* valid, then no such event will be dispatched, and Test Manager would wait forever. We can add a timeout to hedge against this, but then we'd be artificially delaying the runtime of Test Manager.

The problem with option #2, however, and the point of this email (took a long time to get here) is that too many errors are folded into the Zircon status value UNAVAILABLE. For Test Manager, this is problematic because it can't differentiate on *why* the connection to `fuchsia.test.Suite` fail. Most errors, whether it'd be bad manifest or unresolvable component URL, were yielding this broad status value. So Test Manager wasn't able to provide users with helpful error messages because it wouldn't be able to parse the epitaph value into a meaningful inference.

I went an implemented option #2 (fxr/573463), but beforehand had to change Component Manager's internals to write NOT_FOUND for all component resolution failures (fxr/574055). This change was probably not the best long-term solution, but it helped unblocked the `BindChild` migration, so was deemed good enough for now.

With all that context out of the way, I can now make my point. We shouldn't restrict the CF API to the size of the Zircon status error space. Instead, we should provide a mechanism to allow users to retain or retrieve CF errors when they encounter an error with a FIDL protocol provided to them via capability resolution. I haven't quite come up with a solution, but am eager to hear any ideas and if this problem isn't that big of a deal.

One thing that may work is to take advantage of the large domain of int64 and map CF errors to its values. If the values are set at a low or high enough value, I don't think we'd run the risk of colliding with the Zircon status error space.  For example, we can map CF errors to int64s at 10,000+.

Yegor Pomortsev

未读,
2021年8月30日 22:01:332021/8/30
收件人 Yaneury Fermin、component-framework-dev
There's a related discussion in fidl-dev about error handling that might be of interest.

I believe that returning zx_status_t from non-Zircon code is an anti-pattern. It works for some basic error cases but soon enough you'll want more specific errors, as you describe here. There's also the problem that zx_status_t is too broad, by design. The actual set of possible errors only described in FIDL comments.

Encoding CF errors inside an epitaph is makes an assumption that the server will not use CF's "address space" for something else. The epitaph type is determined by the server, and is not always zx_status_t.

Perhaps there's a possible events-based solution?



--
All posts must follow the Fuchsia Code of Conduct https://fuchsia.dev/fuchsia-src/CODE_OF_CONDUCT or may be removed.
---
You received this message because you are subscribed to the Google Groups "component-framework-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to component-framewo...@fuchsia.dev.
To view this discussion on the web visit https://groups.google.com/a/fuchsia.dev/d/msgid/component-framework-dev/7d932135-9bf4-4195-82cf-cad61a539b17n%40fuchsia.dev.

Shai Barack

未读,
2021年8月31日 01:04:162021/8/31
收件人 Yegor Pomortsev、Yaneury Fermin、component-framework-dev
I agree it's an anti pattern, but I can't find docs that state this.


This is an ok overview of FIDL errors, but is a bit light on best practices. I was hoping for more guidance from the wizards. 🧙‍♂️

Gary Bressler

未读,
2021年8月31日 17:24:032021/8/31
收件人 Shai Barack、Yegor Pomortsev、Yaneury Fermin、component-framework-dev
Here's the question: is it good practice to set a non-zx.status error code on an epitaph? The understanding that I have is, "it depends". If the channel speaks a specific protocol, like ComponentController, it's ok because the epitaph values can be specified as part of the protocol contract. However, for 'generic' channels, the epitaph should be a zx.status because if you set a custom error code then it's impossible for the client to discern what error space the code belongs to.

That said, setting a custom epitaph seems like it's what you want to do for routing failures. This would allow component_manager to return descriptive errors about routing failures, rather than attempt to clumsily map them onto the zx.status error space.

BTW, we may have found a workaround for this particular issue (fxb/83772), but the general problem still stands.

Yifei Teng

未读,
2021年8月31日 17:50:252021/8/31
收件人 Gary Bressler、Shai Barack、Yegor Pomortsev、Yaneury Fermin、component-framework-dev、fidl-dev
+fidl-dev  🧙‍♂️

The FIDL team has plans to improve the error handling story. Though just from reading this thread, I'm not yet certain which particular aspect of error handling is causing friction in this case. My understanding is that two points were raised:

- Not being able to return a more expressive structure in methods with the error syntax (currently only primitives are allowed). Here, we'd need to balance between an expressive error schema and burden on the ABI complexity.
- Not being able to use a custom epitaph type without confusing the client which error space the epitaph belongs to (is it my custom one or is it zx_status_t?).

Is the latter the more directly impacting issue? Would love to hear more from folks.

Cheers,
Yifei


Gary Bressler

未读,
2021年8月31日 18:29:142021/8/31
收件人 Yifei Teng、Shai Barack、Yegor Pomortsev、Yaneury Fermin、component-framework-dev、fidl-dev
Both of those problems are important, but I think this thread is mainly focused on #2. I perceive #1 to be an issue as well, but it's a much more complex problem.

Gabe Schine

未读,
2021年8月31日 18:31:272021/8/31
收件人 Gary Bressler、Yifei Teng、Shai Barack、Yegor Pomortsev、Yaneury Fermin、component-framework-dev、fidl-dev
There's an interesting issue here:

The epitaph is used to issue an error but potentially by two different parties on a single channel, depending on where the channel is in its lifecycle. When one component requests a capability, Component Manager identifies the server end component and hands it the server end handle. If something goes wrong in this routing process, CM will close the channel with an epitaph.

Once the channel handle reaches the server end, it adopts a new role: to speak the protocol the two components have agreed upon. At this point, the epitaph error domain could be different.



--
Gabe Schine
Software Engineer / Manager

Yifei Teng

未读,
2021年8月31日 18:32:402021/8/31
收件人 Gabe Schine、Gary Bressler、Shai Barack、Yegor Pomortsev、Yaneury Fermin、component-framework-dev、fidl-dev
Correct. This protocol switch is part of the reason of the epitaph type confusion/ambiguity.

Yaneury Fermin

未读,
2021年8月31日 23:30:292021/8/31
收件人 Yifei Teng、Gabe Schine、Gary Bressler、Shai Barack、Yegor Pomortsev、component-framework-dev、fidl-dev
Perhaps the switch calls for splitting the error reporting mechanism into two separate entities. Component manager can fold all errors into PEER_CLOSED. Then, iff a client end wishes to discern why Component manager failed to route a capability, it can hook up to the event capability to inspect further. The Events capability already contains a CapabilityRequested event that can be modified to contain a fuchsia.component.Error if need be.

Justin Mattson

未读,
2021年9月1日 00:06:582021/9/1
收件人 Yaneury Fermin、Yifei Teng、Gabe Schine、Gary Bressler、Shai Barack、Yegor Pomortsev、component-framework-dev、fidl-dev
Disclaimer: Sorry if this is naive or off-base.

I think we're talking about fuchsia.test.manager/Query.Enumerate. In this case if the target can't be resolved I'd think we should return a LaunchError. If it can be resolved, but the list of tests can't be enumerated, could we close the `iterator` channel with an appropriate epitaph?

Gary Bressler

未读,
2021年9月1日 12:41:292021/9/1
收件人 Justin Mattson、Yaneury Fermin、Yifei Teng、Gabe Schine、Shai Barack、Yegor Pomortsev、component-framework-dev、fidl-dev、Ankur Mittal
That's a question for the test architecture folks. The original problem that triggered this discussion was how component manager could make it possible for test_manager to make that kind of distinction in the first place.
回复全部
回复作者
转发
0 个新帖子