Test Manager uses RealmBuilder to create a hermetic environment for a test suite. It takes the component URL passed in by users (`fx test COMPONENT_URL`) and adds that component to a realm with Archivist and other fancy things. When a user mistypes the component URL, Test Manager gives them a helpful message suggesting that they double-check that the test component is included in the user's package set. Internally, this path is executed when Test Manager fails to construct the realm for a given test. It interprets all failures to create the realm as a case of either unresolvable component URL (read: not in package set) or a typo.
This was fine and dandy until I came in and changed `ScopedInstance` to use `fuchsia.sys2/Realm.OpenExposedDir` instead of `BindChild`. `ScopedInstance` is the class used by RealmBuilder to create the root of the realm. This change caused certain Test Manager tests to fail because unresolvable component URLs were no longer yielding an error during realm construction. This is so because `OpenExposedDir` doesn't try to start the child component provided to it, and so the realm was created successfully since Component Manager doesn't try to resolve the test component URL. This made tests cases for this behavior, e.g. give a Test Manager an unresolvable component URL and assert that the proper error is returned, fail because Test Manager returned different errors than expected.
So faced with the need to change this API, I had to figure out how to keep Test Manager, and its tests, at parity during the migration. To solve this, there were only two options available:
1) Attempt to connect to `fuchsia.component.Binder` of the root of the realm to trigger a start. If the channel to this connection closes with an epitaph, read its epitaph to determine how to handle the error.
2) Test components are expected to expose the `fuchsia.test.Suite` protocol. This is the primary mechanism in which Test Manager communicates with the test component. If connecting to this protocol, or invoking a method on it, fails, then use the epitaph written on channel closure to handle the errors appropriately.
Option #1 is inferior to option #2 because there's no good way to reliably read the epitaph on the connection. Connecting to `fuchsia.component.Binder` is async and there's no way to know if it fails unless you block waiting for an error event. However, if the component URL *is* valid, then no such event will be dispatched, and Test Manager would wait forever. We can add a timeout to hedge against this, but then we'd be artificially delaying the runtime of Test Manager.
The problem with option #2, however, and the point of this email (took a long time to get here) is that too many errors are folded into the Zircon status value UNAVAILABLE. For Test Manager, this is problematic because it can't differentiate on *why* the connection to `fuchsia.test.Suite` fail. Most errors, whether it'd be bad manifest or unresolvable component URL, were yielding this broad status value. So Test Manager wasn't able to provide users with helpful error messages because it wouldn't be able to parse the epitaph value into a meaningful inference.
I went an implemented option #2 (fxr/573463), but beforehand had to change Component Manager's internals to write NOT_FOUND for all component resolution failures (fxr/574055). This change was probably not the best long-term solution, but it helped unblocked the `BindChild` migration, so was deemed good enough for now.
With all that context out of the way, I can now make my point. We shouldn't restrict the CF API to the size of the Zircon status error space. Instead, we should provide a mechanism to allow users to retain or retrieve CF errors when they encounter an error with a FIDL protocol provided to them via capability resolution. I haven't quite come up with a solution, but am eager to hear any ideas and if this problem isn't that big of a deal.
One thing that may work is to take advantage of the large domain of int64 and map CF errors to its values. If the values are set at a low or high enough value, I don't think we'd run the risk of colliding with the Zircon status error space. For example, we can map CF errors to int64s at 10,000+.