Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Adding other reviewers. Please also take a look, thanks!
// receiver_.ReportBadMessage(context_lost_info);
Do we need it?
void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {
Should this info be reported to JS code?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Can we handle context lost here as well?
// TODO(crbug.com/41492165): generate error using context.
nit: remove TODO?
void HandleWebNNContextLost(HRESULT hr, const WebNNContextImpl* context) {
Shouldn't this be off the `WebNNContextImpl`?
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
context_provider_->OnConnectionError(this);
If the context gets destroyed, isn't that considered "lost"?
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
// TODO(crbug.com/41492165): generate error using context.
Can we handle context lost here as well?
Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
// This interface runs in the GPU process and is called from the GPU process.
```suggestion
// This interface runs in the renderer process and is called from the GPU process.
```
interface WebNNContextLostCapturer {
Can we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?
context_lost_capturer_remote = mojo::NullRemote());
Remove this default parameter and pass this parameter through all the subclasses.
context_provider_->OnConnectionError(this);
If the context gets destroyed, isn't that considered "lost"?
If this method is being called then the renderer isn't listening anymore.
// receiver_.ReportBadMessage(context_lost_info);
Do we need it?
No, this isn't a bad message.
// Reset the remote of `WebNNContext` and the receiver of
// `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
// indicates that the WebNN context is lost. It will be called in two cases:
// 1. The `WebNNContext` remote is cut off from its receiver, for example,
// when the WebNN Service is crashed.
// 2. We need to release the WebNNContext in the WebNN Service when the the
// WebNN Service captured a WebNN context lost error.
```suggestion
// Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
// because the context has been lost.
```
&MLContext::OnWebNNContextDisconnected, WrapWeakPersistent(this)));
I would attach this to `context_lost_capturer_receiver_` so that we have the invariant that either we get a call to `OnWebNNContextLostCaptured()` explaining why the context has been lost or we get a call to this handler to indicate we never will (and assume a GPU process crash).
"WebNN context is lost due to connection error."));
Call `OnWebNNContextLostCaptured()` with this error. Put all the logic for reporting errors in that function.
] interface MLContextLostInfo {
Use a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
context_provider_->OnConnectionError(this);
Reilly GrantIf the context gets destroyed, isn't that considered "lost"?
If this method is being called then the renderer isn't listening anymore.
I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
I don't think we should be reporting a context loss for errors like this. That's a separate concept.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
I don't think we should be reporting a context loss for errors like this. That's a separate concept.
If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Addressed some of your comments, will continue addressing others.
// TODO(crbug.com/41492165): generate error using context.
Mingming1 Xunit: remove TODO?
Done
// This interface runs in the GPU process and is called from the GPU process.
```suggestion
// This interface runs in the renderer process and is called from the GPU process.
```
Done
interface WebNNContextLostCapturer {
Can we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?
Good idea! Now I delete this file to move the `WebNNContextClient` to `webnn_context_provider.mojom`, PTAL.
Remove this default parameter and pass this parameter through all the subclasses.
Done
context_provider_->OnConnectionError(this);
Reilly GrantIf the context gets destroyed, isn't that considered "lost"?
Bryan BernhartIf this method is being called then the renderer isn't listening anymore.
I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.
Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.
Reilly GrantDo we need it?
No, this isn't a bad message.
Done
// Reset the remote of `WebNNContext` and the receiver of
// `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
// indicates that the WebNN context is lost. It will be called in two cases:
// 1. The `WebNNContext` remote is cut off from its receiver, for example,
// when the WebNN Service is crashed.
// 2. We need to release the WebNNContext in the WebNN Service when the the
// WebNN Service captured a WebNN context lost error.
```suggestion
// Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
// because the context has been lost.
```
Done. I move these comments above the `OnWebNNContextLostCaptured` because now `OnWebNNContextDisconnected` will call it. Also rename them as `OnLostCaptured` and `OnDisconnected` to simplify.
&MLContext::OnWebNNContextDisconnected, WrapWeakPersistent(this)));
I would attach this to `context_lost_capturer_receiver_` so that we have the invariant that either we get a call to `OnWebNNContextLostCaptured()` explaining why the context has been lost or we get a call to this handler to indicate we never will (and assume a GPU process crash).
Done
void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {
Should this info be reported to JS code?
Now I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:
`std::string context_lost_info;
if (hr == E_OUTOFMEMORY) {
context_lost_info = "out of memory.";
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
context_lost_info = "device removed.";
} else if (hr == DXGI_ERROR_DEVICE_HUNG) {
context_lost_info = "device hung.";
} else if (hr == DXGI_ERROR_DEVICE_RESET) {
context_lost_info = "device reset.";
} else {
return;
}`
Call `OnWebNNContextLostCaptured()` with this error. Put all the logic for reporting errors in that function.
Done
String message_;
Mingming1 Xuconst String?
Now I use `dictionary MLContextLostInfo`, the header file will be generated automatically.
] interface MLContextLostInfo {
Use a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.
Done. I deleted this file and remove the dictionary definition to `ml_context.idl`, PTAL.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
HandleWebNNContextLost(hr, context);
@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?
std::string message;
nit: use `std::string_view` and a `switch` statement.
// Called by the GPU process to create a message pipe for
// `WebNNContextClient`, the `message` is the error message
// sent to the renderer process, the renderer process will
// override this method to handle the lost error.
```suggestion
// Called to explain why the `WebNNContextClient` is no longer available.
```
OnLostCaptured(string message);
To match WebGPU and better express that the context is unusable after this.
```suggestion
OnDestroyed(string message);
```
client_remote_->OnLostCaptured(message);
After sending this message we should call `context_provider_->OnConnectionError(this)` so that the context provider destroys this object.
#include "services/webnn/public/mojom/webnn_context_lost_capturer.mojom.h"
This file no longer exists.
context_provider_->OnConnectionError(this);
Reilly GrantIf the context gets destroyed, isn't that considered "lost"?
Bryan BernhartIf this method is being called then the renderer isn't listening anymore.
Mingming1 XuI suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.
Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.
Dug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.
HandleWebNNContextLost(hr, context);
@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?
I'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.
The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.
I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.
Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.
We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
HandleWebNNContextLost(hr, context);
Rafael Cintron@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?
I'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.
We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
Bryan BernhartI don't think we should be reporting a context loss for errors like this. That's a separate concept.
If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.
HandleWebNNContextLost(hr, context);
Rafael Cintron@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?
Reilly GrantI'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.
We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.
@rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) as context lost errors? Currently I only handle the `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, `DXGI_ERROR_DEVICE_HUNG` and `DXGI_ERROR_DEVICE_RESET`.
void HandleWebNNContextLost(HRESULT hr, const WebNNContextImpl* context) {
Shouldn't this be off the `WebNNContextImpl`?
Done
nit: use `std::string_view` and a `switch` statement.
Done
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2
Any ideas? @rafael....@microsoft.com
interface WebNNContextLostCapturer {
Mingming1 XuCan we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?
Good idea! Now I delete this file to move the `WebNNContextClient` to `webnn_context_provider.mojom`, PTAL.
Done
// Called by the GPU process to create a message pipe for
// `WebNNContextClient`, the `message` is the error message
// sent to the renderer process, the renderer process will
// override this method to handle the lost error.
```suggestion
// Called to explain why the `WebNNContextClient` is no longer available.
```
Done
To match WebGPU and better express that the context is unusable after this.
```suggestion
OnDestroyed(string message);
```
Done
context_provider_->OnConnectionError(this);
Reilly GrantIf the context gets destroyed, isn't that considered "lost"?
Bryan BernhartIf this method is being called then the renderer isn't listening anymore.
Mingming1 XuI suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.
Bryan BernhartYes, I think this method will only be called when the renderer destroys the WebNNContext pipe.
Dug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.
Can we resolve this discussion? @bryan.b...@intel.com
After sending this message we should call `context_provider_->OnConnectionError(this)` so that the context provider destroys this object.
Yes, I intended to trigger the `WebNNContextImpl::OnConnectionError` by the `context_remote_.reset();` in blink side. I am OK to explicitly call it here.
#include "services/webnn/public/mojom/webnn_context_lost_capturer.mojom.h"
This file no longer exists.
Done
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Yes, agree, I should add tests for that, will do in following patchset.
// Reset the remote of `WebNNContext` and the receiver of
// `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
// indicates that the WebNN context is lost. It will be called in two cases:
// 1. The `WebNNContext` remote is cut off from its receiver, for example,
// when the WebNN Service is crashed.
// 2. We need to release the WebNNContext in the WebNN Service when the the
// WebNN Service captured a WebNN context lost error.
Mingming1 Xu```suggestion
// Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
// because the context has been lost.
```
Done. I move these comments above the `OnWebNNContextLostCaptured` because now `OnWebNNContextDisconnected` will call it. Also rename them as `OnLostCaptured` and `OnDisconnected` to simplify.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Code-Review | +1 |
LGTM with nits.
I think I'm satisfied that the situations within the DirectML backend where we treat an error as context loss represent errors which are not the developer's responsibility and are instead situations in which the system has gotten in the an unrecoverable state (e.g. the GPU process is nearly at its commit limit) and so the only reasonable course of action is to clean everything up and start again.
We will have to see how this works out in real-world usage.
raw_ptr<WebNNContextImpl> context_;
Why this change?
"Invalid script state.")));
```suggestion
"Context is lost.")));
```
// Remote context gets automatically unbound when the execution context
// destructs.
if (!context_remote_.is_bound()) {
return;
}
This check should move into `MLContext::createBuffer()` and throw a "Context is lost." error. We can then `CHECK()` here that `context_remote_` is bound.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
DLOG(ERROR) << error_message << " " << logging::SystemErrorCodeToString(hr);
I believe that by the time we get here, we should have already `LOG`ed all of the errors encountered. If so, we may not need this particular `DLOG`.
But, I'm OK with leaving it in and converting to `LOG`. We should prepend the incoming error with a "[WebNN Graph Build Error]" string so people looking at the log know where the error came from.
OK to be a separate change and also fixup the other "HandleX" methods that follow a similar pattern.
HandleWebNNContextLost(hr, context);
Rafael Cintron@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?
Reilly GrantI'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.
We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
Mingming1 XuMy concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.
@rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) as context lost errors? Currently I only handle the `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, `DXGI_ERROR_DEVICE_HUNG` and `DXGI_ERROR_DEVICE_RESET`.
In general we should not CHECK on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use base::debug::DumpWithoutCrashing().
I disagree with this thinking, in general. There are many instances in Chromium today where we CHECK that OS APIs return success values or, in limited cases, known errors. The alternative is to gracefully handle (and, of course, have tests for) all of the combinations of failures you can get.
In rendering code, an incorrectly handled error case in a stateful API can result in very difficult to debug "my browser window is black" bugs. Correlating these bug reports with `DumpWithoutCrashing` metrics is not straightforward. Aside from being willing to upload a full memory dump of the browser process to a Chromium engineer to investigate, there's not much an end user can do to help themselves (and you) root cause because, well, their browser is black.
We do need to have provisions for gracefully handling things when Chromium is not at fault. But, in my opinion, errors like `E_INVALIDARG` should always cause a `CHECK`. There is a general trend in Chromium to fail fast in more instances where previously, we would have limped along.
My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
I share this concern and I like how you and others have championed synchronous errors getting returned to web developers at graph building time rather than compilation or inferencing time. WebGPU/WebGL do not quite have this luxury and, in the case of WebGPU, have an "ErrorScope" mechanism to help you make sense of errors asynchronously.
For WebNN, the only case for "the developer has asked for something unreasonable" that comes to mind is `MLBuffer` creation. We can continue to handle this with "content lost" (similar to WebGL/WebGPU) or we can go further and make buffer creation be asynchronous. I go back and forth on it myself. Might be best to tackle asynchronous buffer creation as a separate discussion thread.
rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like Dawn code as context lost errors? Currently I only handle the E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_HUNG and DXGI_ERROR_DEVICE_RESET.
`DEVICE_HUNG` is the application's fault. `DEVICE_RESET` is usually not.
With that in mind, let's have the final list of errors we handle gracefully be `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET`. All others, I think we should CHECK.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?
if (!context) {
return;
}
`context` is always non-nullptr so we should be able to remove this if statement.
I'm trying to reconcile this with [our official guidance](https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/checks.md) because I generally agree that some interactions with OS APIs can be considered invariants, like `E_INVALIDARG` errors, if we're confident that the API isn't going to change out from underneath us.
@rei...@chromium.org, which part of the official guidance were you specifically referring to?
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
Bryan BernhartI don't think we should be reporting a context loss for errors like this. That's a separate concept.
Mingming1 XuIf this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.
I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com
DLOG(ERROR) << error_message << " " << logging::SystemErrorCodeToString(hr);
I believe that by the time we get here, we should have already `LOG`ed all of the errors encountered. If so, we may not need this particular `DLOG`.
But, I'm OK with leaving it in and converting to `LOG`. We should prepend the incoming error with a "[WebNN Graph Build Error]" string so people looking at the log know where the error came from.
OK to be a separate change and also fixup the other "HandleX" methods that follow a similar pattern.
Agree, thanks! https://issues.chromium.org/issues/347822602 tracked.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?
I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.
`context` is always non-nullptr so we should be able to remove this if statement.
Done
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Mingming1 XuIt would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2Any ideas? @rafael....@microsoft.com
Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.
raw_ptr<WebNNContextImpl> context_;
Mingming1 XuWhy this change?
Done
```suggestion
"Context is lost.")));
```
Done
// Remote context gets automatically unbound when the execution context
// destructs.
if (!context_remote_.is_bound()) {
return;
}
This check should move into `MLContext::createBuffer()` and throw a "Context is lost." error. We can then `CHECK()` here that `context_remote_` is bound.
Done
Mingming1 XuUse a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.
Mingming1 XuDone. I deleted this file and remove the dictionary definition to `ml_context.idl`, PTAL.
Done
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
context_provider_->OnConnectionError(this);
Reilly GrantIf the context gets destroyed, isn't that considered "lost"?
Bryan BernhartIf this method is being called then the renderer isn't listening anymore.
Mingming1 XuI suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.
Bryan BernhartYes, I think this method will only be called when the renderer destroys the WebNNContext pipe.
Mingming1 XuDug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.
Can we resolve this discussion? @bryan.b...@intel.com
@mingmi...@intel.com Yes, thanks for following up. Thinking some more, I don't think we need to address this until we need `MLContext.destroy()`.
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Mingming1 XuIt would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
Mingming1 Xuchoice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2Any ideas? @rafael....@microsoft.com
Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.
Any HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].
_Bad/untrusted/changing driver, kernel API, hardware failure:_ A misbehaving GPU driver may cause us to be unable to proceed. This is not an invariant failure. On platforms where we are wary that a kernel API may change without sufficient prior notice we should not CHECK() its result as we expect the rug to be pulled from under our feet. In the case of hardware failure we should not for instance CHECK() that a write succeeded.
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Mingming1 XuIt would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
Mingming1 Xuchoice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2Any ideas? @rafael....@microsoft.com
Bryan BernhartThink again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.
Any HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].
So we don't need the `switch-case` here to get the contest lost message, we can directly call `ID3D12Device::GetDeviceRemovedReason` to get the message and send it to renderer by `OnDestroyed`. That sounds good!
Two questions:
1. if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET` (@rafael....@microsoft.com mentioned)?
2. I find there is another API called `IDMLDevice::GetDeviceRemovedReason`, what's the difference and should we use this?
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Yes, agree, I should add tests for that, will do in following patchset.
Added in patchset 10, PTAL.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
@rei...@chromium.org, thank you for the quote.
I agree with the guidance on not `CHECK`-ing for "hardware failure" or "changing driver". These map, roughly, to `DXGI_ERROR_DEVICE_REMOVED`/`DXGI_ERROR_DEVICE_RESET` and I think we should definitely handle that gracefully by `LOG(ERROR)`-ing and arranging for orderly GPU process shutdown like the rest of the code does.
Windows does not really have the analogous of "... kernel API may change without sufficient prior notice...". Windows components tend to be stable in terms the input they receive and error out on. If anything, my experience has been they're overly generous about continuing to accept bad input that was accepted by a previous version of the API.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Mingming1 XuThe "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?
I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.
Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.
The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.
The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Mingming1 XuIt would be helpful to LOG the actual reason too via GetDeviceRemovedReason().
Mingming1 Xuchoice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2Any ideas? @rafael....@microsoft.com
Bryan BernhartThink again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.
Mingming1 XuAny HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].
So we don't need the `switch-case` here to get the contest lost message, we can directly call `ID3D12Device::GetDeviceRemovedReason` to get the message and send it to renderer by `OnDestroyed`. That sounds good!
Two questions:
1. if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET` (@rafael....@microsoft.com mentioned)?
2. I find there is another API called `IDMLDevice::GetDeviceRemovedReason`, what's the difference and should we use this?
1) if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED and DXGI_ERROR_DEVICE_RESET (@rafael....@microsoft.com mentioned)?
All DML entrypoints which return an HRESULT first call `ID3D12Device::GetDeviceRemovedReason` and return `DXGI_ERROR_DEVICE_REMOVED` on failure. To be expected since DML is heavily tied to D3D12 for command execution and resource allocation/management.
2) I find there is another API called `IDMLDevice::GetDeviceRemovedReason` what's the difference and should we use this?
In the current DML implementation, `IDMLDevice::GetDeviceRemovedReason` calls `ID3D12Device::GetDeviceRemovedReason`. So in today's implementation, DML cannot become removed when D3D12 is still unremoved or vice versa.
I think we should check the return value of all HRESULT-returning DirectX methods and `LOG(ERROR)` right at the point of failure. This will help us better diagnose problems in the wild. We already largely do this with the macros in `services/webnn/dml/error.h`.
I can't think of a good reason to also use the "GetDeviceRemovedReason" APIs, except maybe as a sanity check in special codepaths.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
Bryan BernhartI don't think we should be reporting a context loss for errors like this. That's a separate concept.
Mingming1 XuIf this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
Mingming1 XuThis discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.
I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com
Updated, PTAL.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Mingming1 XuThe "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?
Rafael CintronI am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.
Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.
The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.
The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.
Thanks! Please see my updates.
I think we should check the return value of all HRESULT-returning DirectX methods and LOG(ERROR) right at the point of failure. This will help us better diagnose problems in the wild. We already largely do this with the macros in services/webnn/dml/error.h.
Agree, sorry to be unclear, I just mean I don't want to LOG(ERROR) the `logging::SystemErrorCodeToString(hr)` twice. Now I only LOG it in `HandleWebNNContextLost` as you suggested.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
For DirectML, the question is how transparently driver quirks are exposed to API consumers. I trust Windows components more than vendor-specific drivers.
If history is any indication, I expect we will encounter driver/hardware quirks for incorrect output. I do not expect rogue, new errors to be returned straight from the driver to the application. The reason DML and D3D runtimes exist is to shield applications from this.
Would it make you feel more comfortable if we only `CHECK`-ed certain errors and gracefully handled everything else by declaring "context lost" and emitting a LOG + histogram? With a histogram, we can track the values we get over time to see how bad things are in practice.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Mingming1 XuThe "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?
Rafael CintronI am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.
Mingming1 XuCurrently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.
The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.
The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.
Thanks! Please see my updates.
Acknowledged
CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
hr == DXGI_ERROR_DEVICE_REMOVED);
The last two statements are the same.
}
@mingmi...@intel.com, apologies for misleading. What you have here so far is good.
If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.
moving myself to CC since this CL already has overlapping reviewers!
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Added in patchset 10, PTAL.
What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
HandleWebNNContextLost(hr, context);
That makes sense to me.
The only other thing I want to make sure we are clear on is separating errors by how we expect developers to react to them. As I understand it developers should respond to context loss by simply redoing what they did before (i.e. if the dGPU is removed, create a new context for the iGPU and _maybe_ use a different model for the smaller chip). On the other hand there might be graph or buffer creation errors which mean the developer should not try again, or should take some specific corrective action (e.g. trying to load a different model). These should not be context loss errors and we'll need to use a different method for reporting them (e.g. rejecting from `build()`, `dispatch()` or `readBuffer()`).
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
(I hadn't seen this CR until Rafael brought it to my attention)
Agreed with `E_INVALIDARG`, as Chromium should not be sending invalid parameters to DML in the first place, and DML would report this error due to bad *caller* parameters, not bubbled up from the driver (which become different error messages).
Â
Some others besides `DXGI_ERROR_DEVICE_REMOVED` to gracefully handle include: `DXGI_ERROR_UNSUPPORTED`, `DXGI_DDI_ERR_UNSUPPORTED`, `DXGI_ERROR_DEVICE_RESET`.
Though, `DXGI_ERROR_DRIVER_INTERNAL_ERROR` is more serious - "An internal issue prevented the driver from carrying out the specified operation. The driver's state is probably suspect, and the application should not continue." 🤔
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
Bryan BernhartI don't think we should be reporting a context loss for errors like this. That's a separate concept.
Mingming1 XuIf this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
Mingming1 XuThis discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.
Mingming1 XuI am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com
Updated, PTAL.
It depends if the HRESULT error will be recoverable or not. In this case, E_OUTOFMEMORY is not recoverable and leaves the context inoperable so a "lost" seems appropriate. In other cases, this is not so [1].
HandleWebNNContextLost(hr, this);
[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Since it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?
As I understand it developers should respond to context loss by simply redoing what they did before
@rei...@chromium.org, both WebGL and WebGPU allow the web developer _one_ context lost of grace. After that, no more WebGL/WebGPU for you. Not a great story but is the best we have with the information we're given by the platform about what's really happening.
That said, WebNN is not like WebGPU/WebGL where the platform ingests arbitrary shaders with infinite loops so I expect the platform will be better able to preempt work the web developer submits.
Of the errors Dwayne quoted, `DXGI_ERROR_UNSUPPORTED` and `DXGI_DDI_ERR_UNSUPPORTED` are a temporary implementation detail specific to NPUS when not all of the ops are supported at compile time. I'm double checking with the DML team but I expect/hope that we don't have to completely tear down and restart DML when this happens. OK with me if we handle those particular ones as a separate change.
HandleWebNNContextLost(hr, this);
[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Since it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?
Good points, now move it to `ContextImplDml` as `ContextImplDml::HandleContextLost`
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Done
CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
hr == DXGI_ERROR_DEVICE_REMOVED);
The last two statements are the same.
Done
@mingmi...@intel.com, apologies for misleading. What you have here so far is good.
If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.
Updated, PTAL, thanks!
pending_receiver<WebNNContextClient>? context_client_receiver;
@ningx...@intel.com we also need reviewer for this mojom file, could you help invite? Thanks!
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
Rafael CintronWhat we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
Rafael CintronWhat we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Mingming1 XuHere are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.
HandleContextLost(hr);
Every occurrence of `HandleContextLost` in this CL is preceded by `LOG(ERROR) << "some error"`.
To better streamline how we do error handling, Consider having `HandleContextLost`, itself, accept an error string that it can combine with the existing strings it already outputs.
LOG(ERROR) << error_message;
Nit: While you're here, please prepend the string "[WebNN] " to this error message so that we know which system it is coming from in logs.
(Same goes for the other places where you're outputting to the error log)
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
hr == DXGI_ERROR_DEVICE_REMOVED);
These two || expressions are the same.
Every occurrence of `HandleContextLost` in this CL is preceded by `LOG(ERROR) << "some error"`.
To better streamline how we do error handling, Consider having `HandleContextLost`, itself, accept an error string that it can combine with the existing strings it already outputs.
Good idea,done.
Nit: While you're here, please prepend the string "[WebNN] " to this error message so that we know which system it is coming from in logs.
(Same goes for the other places where you're outputting to the error log)
Done
CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
hr == DXGI_ERROR_DEVICE_REMOVED);
These two || expressions are the same.
Done
void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {
Mingming1 XuSince it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?
Good points, now move it to `ContextImplDml` as `ContextImplDml::HandleContextLost`
Done
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
Rafael CintronWhat we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Mingming1 XuHere are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
Bryan BernhartThanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.
Thinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com
void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {
Mingming1 XuShould this info be reported to JS code?
Now I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:
`std::string context_lost_info;
if (hr == E_OUTOFMEMORY) {
context_lost_info = "out of memory.";
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
Mingming1 Xucontext_lost_info = "device removed.";
} else if (hr == DXGI_ERROR_DEVICE_HUNG) {
context_lost_info = "device hung.";
} else if (hr == DXGI_ERROR_DEVICE_RESET) {
context_lost_info = "device reset.";
} else {
return;
}`
Discussed a lot in other comments, PTAL, thanks!
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
Rafael CintronWhat we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Mingming1 XuHere are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
Bryan BernhartThanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Mingming1 XuRe-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.
Thinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com
Destroy serves two purposes. One is for testing. The other is for the web developer to free the memory and resources taken up by the context. In the case of hybrid GPUs, that means powering down the discrete GPU and giving hours of battery life back to the end user.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
Rafael CintronAnd there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.
Code-Review | +1 |
Code change LGTM lone "nit" I just opened.
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
Rafael CintronAnd there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
Bryan Bernhart@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.
My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.
bool IsLost() { return is_lost_; }
Nit: Please `IsLost` a const method.
Mingming1 Xu@mingmi...@intel.com, apologies for misleading. What you have here so far is good.
If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.
Updated, PTAL, thanks!
Done
Nit: Please `IsLost` a const method.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
pending_receiver<WebNNContextClient>? context_client_receiver;
@ningx...@intel.com we also need reviewer for this mojom file, could you help invite? Thanks!
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
void HandleContextLost(std::string_view error_message, HRESULT hr);
This method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.
It might be worth adding a comment explaining which errors are treated as "context lost".
HandleContextLost("Failed to start recording.", hr);
`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
HandleRecordingError("Failed to close and execute the command list.", hr);
`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.
HandleRecordingError("Failed to download the buffer.", hr);
ditto.
if (FAILED(hr)) {
Why don't we call `HandleContextLost()` for this failure like others?
void ContextImplDml::HandleContextLost(std::string_view error_message,
This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.
OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));
Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(
adapter_->d3d12_device()->GetDeviceRemovedReason());
Should this log only for device removal error?
if (hr == E_OUTOFMEMORY) {
std::move(callback).Run(base::unexpected(CreateError(
mojom::Error::Code::kUnknownError,
error_message + " No enough memory resources are available.")));
Should we remove this case? Because "out of memory" is now handled by context lost.
context->HandleContextLost(error_message, hr);
Where does the `context` come from? Did you forget passing it via parameter?
std::move(callback).Run(
base::unexpected(std::move(create_operator_result.error())));
Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.
HandleGraphCreationFailure("Failed to create identity operator.",
std::move(callback));
This may need to pass `context`.
if (hr == E_OUTOFMEMORY) {
std::move(callback).Run(ComputeResult::NewError(CreateError(
mojom::Error::Code::kUnknownError,
error_message + " No enough memory resources are available.")));
ditto, this case might be unnecessary.
Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
Open it for comments.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
ningxin huAdapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
Open it for comments.
While the adapter is created before creating a context, if fails, there is no context.
HandleContextLost("Failed to start recording.", hr);
`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.
HandleRecordingError("Failed to close and execute the command list.", hr);
`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.
Do you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?
if (FAILED(hr)) {
Why don't we call `HandleContextLost()` for this failure like others?
Yes, I missed it. Added now.
void ContextImplDml::HandleContextLost(std::string_view error_message,
This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.
This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/
OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));
Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(
adapter_->d3d12_device()->GetDeviceRemovedReason());
Should this log only for device removal error?
Agree, maybe we should:
` HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
if (FAILED(device_removed_reason)) {
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(device_removed_reason);
}`
if (hr == E_OUTOFMEMORY) {
std::move(callback).Run(base::unexpected(CreateError(
mojom::Error::Code::kUnknownError,
error_message + " No enough memory resources are available.")));
Should we remove this case? Because "out of memory" is now handled by context lost.
Done
context->HandleContextLost(error_message, hr);
Where does the `context` come from? Did you forget passing it via parameter?
Done
std::move(callback).Run(
base::unexpected(std::move(create_operator_result.error())));
Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.
`void HandleGraphCreationFailure(
const std::string& error_message,
WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
const std::string& error_message,
HRESULT hr,
WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.
HandleGraphCreationFailure("Failed to create identity operator.",
std::move(callback));
This may need to pass `context`.
There is no `HRESULT` to be handled by `HandleContextLost`.
if (hr == E_OUTOFMEMORY) {
std::move(callback).Run(ComputeResult::NewError(CreateError(
mojom::Error::Code::kUnknownError,
error_message + " No enough memory resources are available.")));
ditto, this case might be unnecessary.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
pending_receiver<WebNNContextClient>? context_client_receiver;
don't really need _receiver in the name here - but up to you
// Closes the `context_remote_` and `context_client_receiver_` pipes
// because the context has been lost.
void OnDestroyed(const String& message) override;
void OnDisconnected();
I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
Rafael CintronAnd there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
Bryan Bernhart@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
Rafael Cintron@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.
My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.
@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
Code-Review | +1 |
void ContextImplDml::HandleContextLost(std::string_view error_message,
Mingming1 XuThis is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.
This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/
I think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.
The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.
Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
Rafael CintronAnd there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
Bryan Bernhart@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
Rafael Cintron@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.
Bryan BernhartMy opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.
@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate E_OUTOFMEMORY. Is that not recoverable?
When Chromium runs in the sandbox and the job limit is reached, a message will be sent to the browser process notifying it. The browser process will turnaround and kill the GPU process. So, while it is technically recoverable, the GPU process is doomed. :-) FWIW, WebGPU also makes E_OUTOFMEMORY turn into WebGPU context lost.
The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
Device removed, device reset and similar HRESULTs should cause a graceful termination of the GPU process. This CL handles them gracefully but doesn't terminate the GPU process yet. We'll need to hook that up in a followon change similar to existing code in the GPU process.
In my opinion, HRESULTS like E_INVALIDARG should result in a CHECK. This is a bug in the DML backend that we should treat similar to CHECK failures that we're counting on the validation layer have dealt before arriving at DML code.
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(
adapter_->d3d12_device()->GetDeviceRemovedReason());
Mingming1 XuShould this log only for device removal error?
Agree, maybe we should:
` HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
if (FAILED(device_removed_reason)) {
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(device_removed_reason);
}`
+1
std::move(callback).Run(
base::unexpected(std::move(create_operator_result.error())));
Mingming1 XuDo we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.
`void HandleGraphCreationFailure(
const std::string& error_message,
WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
const std::string& error_message,
HRESULT hr,
WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.
@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.
We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.
OK to solve this as a separate change.
HandleWebNNContextLost(hr, this);
Mingming1 Xu[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.
Bryan BernhartWhat do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.
Rafael CintronAnd there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).
Bryan Bernhart@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?
Rafael Cintron@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.
Bryan BernhartMy opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.
Rafael Cintron@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate E_OUTOFMEMORY. Is that not recoverable?
When Chromium runs in the sandbox and the job limit is reached, a message will be sent to the browser process notifying it. The browser process will turnaround and kill the GPU process. So, while it is technically recoverable, the GPU process is doomed. :-) FWIW, WebGPU also makes E_OUTOFMEMORY turn into WebGPU context lost.
The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
Device removed, device reset and similar HRESULTs should cause a graceful termination of the GPU process. This CL handles them gracefully but doesn't terminate the GPU process yet. We'll need to hook that up in a followon change similar to existing code in the GPU process.
In my opinion, HRESULTS like E_INVALIDARG should result in a CHECK. This is a bug in the DML backend that we should treat similar to CHECK failures that we're counting on the validation layer have dealt before arriving at DML code.
Okay, @rafael....@chromium.org I see where you're coming from.
We'll need to hook that up in a followon change similar to existing code in the GPU process.
I agree with you we'll need to fix this at some point. But once we do, E_OUTOFMEMORY will be recoverable and this code shouldn't generate a context lost (due to the reset). But I'm fine with a TODO there if the browser process won't (currently) allow it.
We'll need to hook that up in a followon change similar to existing code in the GPU process.
I am not very clear what's the followon change. Where is the existing code, can you paste link here to help me understand? Thanks!
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(
adapter_->d3d12_device()->GetDeviceRemovedReason());
Mingming1 XuShould this log only for device removal error?
Rafael CintronAgree, maybe we should:
` HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
if (FAILED(device_removed_reason)) {
LOG(ERROR) << "[WebNN] Device Removed Reason: "
<< logging::SystemErrorCodeToString(device_removed_reason);
}`
+1
Done
pending_receiver<WebNNContextClient>? context_client_receiver;
don't really need _receiver in the name here - but up to you
I just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.
// Closes the `context_remote_` and `context_client_receiver_` pipes
// because the context has been lost.
void OnDestroyed(const String& message) override;
void OnDisconnected();
I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.
`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/
ningxin huAdapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
Mingming1 XuOpen it for comments.
While the adapter is created before creating a context, if fails, there is no context.
You are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.
HandleContextLost("Failed to start recording.", hr);
Mingming1 Xu`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.
We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.
HandleRecordingError("Failed to close and execute the command list.", hr);
Mingming1 Xu`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.
Do you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?
Yes, I suppose the user code should get promise rejection from `readBuffer()` method.
void ContextImplDml::HandleContextLost(std::string_view error_message,
Mingming1 XuThis is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.
Rafael CintronThis change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/
I think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.
The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.
Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.
Perhaps we can better name these to disambiguate; rename message to message_for_promise or something like that.
This helps. Or we can rename `error_message` to `message_for_log`.
OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));
Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a CHECK so that we can get crash dumps and diagnose.
We should LOG(ERROR) any failure we receive from DirectX along with the logging::SystemErrorCodeToString friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
I didn't find the decision that we should report "everything else" errors to JS code by rejecting `lost` promise. Did I miss anything?
std::move(callback).Run(
base::unexpected(std::move(create_operator_result.error())));
Mingming1 XuDo we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.
Rafael Cintron`void HandleGraphCreationFailure(
const std::string& error_message,
WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
const std::string& error_message,
HRESULT hr,
WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.
@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.
We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.
OK to solve this as a separate change.
OK to solve this as a separate change.
+1. Worth adding a TODO.
HandleGraphCreationFailure("Failed to create identity operator.",
std::move(callback));
This may need to pass `context`.
There is no `HRESULT` to be handled by `HandleContextLost`.
Same as above, we may need to crash GPU process for operator creation failures. Please add a TODO.
// Closes the `context_remote_` and `context_client_receiver_` pipes
// because the context has been lost.
void OnDestroyed(const String& message) override;
void OnDisconnected();
Mingming1 XuI'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.
`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/
OnDestroyed should somehow refer to the other side being gone
Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.
OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));
Mingming1 XuDo we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?
ningxin huThe discussion is here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/ and https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/
I only read @rafael....@microsoft.com's [comment](https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/)
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a CHECK so that we can get crash dumps and diagnose.We should LOG(ERROR) any failure we receive from DirectX along with the logging::SystemErrorCodeToString friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
I didn't find the decision that we should report "everything else" errors to JS code by rejecting `lost` promise. Did I miss anything?
Mingming helped me find another [comment](https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/) that I missed
Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call OnDestroyed so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a CHECK or graceful termination of the GPU process.
I am fine with notifying user JS code with "internal error" via "lost" promise because GPU process is going to crash.
ningxin huAdapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
Mingming1 XuOpen it for comments.
ningxin huWhile the adapter is created before creating a context, if fails, there is no context.
You are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.
void HandleContextLost(std::string_view error_message, HRESULT hr);
This method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.
It might be worth adding a comment explaining which errors are treated as "context lost".
Updated, ptal.
HandleContextLost("Failed to start recording.", hr);
Mingming1 Xu`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
ningxin huDo you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.
We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.
Updated. Also cc @bryan.b...@intel.com to check the change here.
HandleRecordingError("Failed to close and execute the command list.", hr);
Mingming1 Xu`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.
ningxin huDo you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?
Yes, I suppose the user code should get promise rejection from `readBuffer()` method.
Done
HandleRecordingError("Failed to download the buffer.", hr);
Mingming1 Xuditto.
Done
void ContextImplDml::HandleContextLost(std::string_view error_message,
Mingming1 XuThis is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.
Rafael CintronThis change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/
ningxin huI think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.
The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.
Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.
Perhaps we can better name these to disambiguate; rename message to message_for_promise or something like that.
This helps. Or we can rename `error_message` to `message_for_log`.
Done
Leave TODO tracked by https://issues.chromium.org/issues/349640008.
std::move(callback).Run(
base::unexpected(std::move(create_operator_result.error())));
Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.
Rafael Cintron`void HandleGraphCreationFailure(
const std::string& error_message,
WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
const std::string& error_message,
HRESULT hr,
WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.
ningxin hu@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.
We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.
OK to solve this as a separate change.
OK to solve this as a separate change.
+1. Worth adding a TODO.
Leave TODO tracked by https://issues.chromium.org/issues/349649099
HandleGraphCreationFailure("Failed to create identity operator.",
std::move(callback));
Mingming1 XuThis may need to pass `context`.
ningxin huThere is no `HRESULT` to be handled by `HandleContextLost`.
Same as above, we may need to crash GPU process for operator creation failures. Please add a TODO.
Done
FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,
Mingming1 XuLooks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?
Mingming1 XuYes, agree, I should add tests for that, will do in following patchset.
Bryan BernhartAdded in patchset 10, PTAL.
Rafael CintronWhat we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?
Mingming1 XuHere are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.
Bryan BernhartThanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Mingming1 XuRe-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.
Rafael CintronThinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com
Destroy serves two purposes. One is for testing. The other is for the web developer to free the memory and resources taken up by the context. In the case of hybrid GPUs, that means powering down the discrete GPU and giving hours of battery life back to the end user.
Thanks!
// Closes the `context_remote_` and `context_client_receiver_` pipes
// because the context has been lost.
void OnDestroyed(const String& message) override;
void OnDisconnected();
Mingming1 XuI'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.
ningxin hu`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/
OnDestroyed should somehow refer to the other side being gone
Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.
Rename as `OnLost` now.
void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {
Should this info be reported to JS code?
Mingming1 XuNow I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:
`std::string context_lost_info;
if (hr == E_OUTOFMEMORY) {
context_lost_info = "out of memory.";
} else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
context_lost_info = "device removed.";
} else if (hr == DXGI_ERROR_DEVICE_HUNG) {
context_lost_info = "device hung.";
} else if (hr == DXGI_ERROR_DEVICE_RESET) {
context_lost_info = "device reset.";
} else {
return;
}`
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
HandleContextLost("Failed to start recording.", hr);
Mingming1 Xu`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
ningxin huDo you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.
Mingming1 XuWe can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.
Updated. Also cc @bryan.b...@intel.com to check the change here.
@mingmi...@intel.com LGTM
HandleWebNNContextLost(hr, this);
I am not very clear what's the followon change.
Perhaps one way is to assign WebNN a lower job limit, which gives us first chance to handle OOM.
Code-Review | +1 |
pending_receiver<WebNNContextClient>? context_client_receiver;
Mingming1 Xudon't really need _receiver in the name here - but up to you
I just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.
totally up to you - feel free to resolve either way.
// Closes the `context_remote_` and `context_client_receiver_` pipes
// because the context has been lost.
void OnDestroyed(const String& message) override;
void OnDisconnected();
Mingming1 XuI'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.
ningxin hu`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/
Mingming1 XuOnDestroyed should somehow refer to the other side being gone
Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.
Rename as `OnLost` now.
Code-Review | +1 |
Still LGTM assuming issues others have raised have been fixed.
I agree with you we'll need to fix this at some point. But once we do, E_OUTOFMEMORY will be recoverable and this code shouldn't generate a context lost (due to the reset).
Perhaps one way is to assign WebNN a lower job limit, which gives us first chance to handle OOM.
@bryan.b...@intel.com. The job limit is for the entire GPU process. There is no way to have a different job limit for WebNN vs. everything else. Attempting to allocate memory with D3D will trigger the job limit (if you're reached it) and cause the browser process to terminate the GPU process. To make large buffer allocations gracefully fail, we'll need to guesstimate whether the process will OOM before making the allocation.
I am not very clear what's the followon change. Where is the existing code, can you paste link here to help me understand? Thanks!
@mingmi...@intel.com the code I had in mind is `GpuServiceImpl::MaybeExitOnContextLost`. If WebNN is the only thing using D3D12 in the GPU process, it can attempt to recover on its own. When several systems use it at the same time, there needs to either be coordination on restarting everything together (difficult) or coordination to self-terminate (easier).
// TODO(crbug.com/41492165): generate error using context.
Reilly GrantCan we handle context lost here as well?
Bryan BernhartAre we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.
Reilly GrantAre we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.
Bryan BernhartI don't think we should be reporting a context loss for errors like this. That's a separate concept.
Mingming1 XuIf this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.
Mingming1 XuThis discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.
Mingming1 XuI am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com
Bryan BernhartUpdated, PTAL.
It depends if the HRESULT error will be recoverable or not. In this case, E_OUTOFMEMORY is not recoverable and leaves the context inoperable so a "lost" seems appropriate. In other cases, this is not so [1].
There are two OOM recovery situations: the case of creating a very large buffer that would otherwise OOM vs needing to free-up existing memory from OOM (ex. command recorder reset).
In latter case, I don't think guestimates help. But if we can recover from large buffer allocations then it'll likely we can recover from other OOM failures too.
Code-Review | +1 |
Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.
Mingming1 XuOpen it for comments.
ningxin huWhile the adapter is created before creating a context, if fails, there is no context.
Mingming1 XuYou are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.
Acknowledged
lgtm % open comments
void HandleContextLost(std::string_view error_message, HRESULT hr);
Mingming1 XuThis method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.
It might be worth adding a comment explaining which errors are treated as "context lost".
Updated, ptal.
Acknowledged
HandleContextLost("Failed to start recording.", hr);
Mingming1 Xu`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.
ningxin huDo you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.
Mingming1 XuWe can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.
Bryan BernhartUpdated. Also cc @bryan.b...@intel.com to check the change here.
@mingmi...@intel.com LGTM
pending_receiver<WebNNContextClient>? context_client_receiver;
Mingming1 Xudon't really need _receiver in the name here - but up to you
Alex GoughI just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.
totally up to you - feel free to resolve either way.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Code-Review | +1 |
context_impl_->OnLost("Context is lost");
context_impl_ = nullptr;
This should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.
```suggestion
context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```
base::RunLoop().RunUntilIdle();
EXPECT_TRUE(context_client_impl.IsLost());
```suggestion
base::RunUntil([&]() { return context_client_impl.IsLost(); });
```
if (lost_property_->GetState() == LostProperty::kPending) {
We can assert that this is true because `OnLost()` can only be called once.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Code-Review | +1 |
context_impl_->OnLost("Context is lost");
context_impl_ = nullptr;
This should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.
```suggestion
context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```
Done
base::RunLoop().RunUntilIdle();
EXPECT_TRUE(context_client_impl.IsLost());
```suggestion
base::RunUntil([&]() { return context_client_impl.IsLost(); });
```
I guess you mean `base::test::RunUntil` from "base/test/run_until.h". PTAL, thanks!
if (lost_property_->GetState() == LostProperty::kPending) {
We can assert that this is true because `OnLost()` can only be called once.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
context_impl_->OnLost("Context is lost");
context_impl_ = nullptr;
Mingming1 XuThis should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.
```suggestion
context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```
Done
If I remove `DanglingUntriaged`, following tests `WebNNGraphImplTest.ValidateDispatchTest` and `WebNNGraphImplTest.ValidateInputsTest` will fail due to `Detected dangling raw_ptr` https://ci.chromium.org/ui/p/chromium/builders/try/win-rel/664189/overview, so I have to revert here to patchset 27. PTAL.
Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |