Mingming1 Xu added 1 comment

File-level comment, Patchset 4 (Latest):

Related details

Attention is currently required from:

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 13, 2024, 8:59:44 AMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu and Rafael Cintron

ningxin hu added 4 comments

Patchset-level comments

ningxin hu . resolved

Adding other reviewers. Please also take a look, thanks!

File services/webnn/webnn_context_impl.cc

Line 98, Patchset 4 (Latest): // receiver_.ReportBadMessage(context_lost_info);

ningxin hu . unresolved

Do we need it?

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 224, Patchset 4 (Latest):void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {

ningxin hu . unresolved

Should this info be reported to JS code?

File third_party/blink/renderer/modules/ml/ml_context_lost_info.h

Line 27, Patchset 4 (Latest): String message_;

ningxin hu . unresolved

const String?

Line 154, Patchset 4 (Latest):

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu
Rafael Cintron

Submit Requirements:

Code-Coverage
Code-Review

No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 13, 2024, 1:45:38 PMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu and Rafael Cintron

Bryan Bernhart added 6 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4 (Latest): // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

File services/webnn/dml/graph_impl_dml.cc

Line 5755, Patchset 4 (Latest):// TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

nit: remove TODO?

File services/webnn/dml/utils.cc

Line 369, Patchset 4 (Latest):void HandleWebNNContextLost(HRESULT hr, const WebNNContextImpl* context) {

Bryan Bernhart . unresolved

Shouldn't this be off the `WebNNContextImpl`?

Line 377, Patchset 4 (Latest): } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4 (Latest): context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4 (Latest): FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).

What's our plan there?

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 13, 2024, 5:12:33 PMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 10 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4 (Latest): // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

File services/webnn/public/mojom/webnn_context_lost_capturer.mojom

Line 7, Patchset 4 (Latest):// This interface runs in the GPU process and is called from the GPU process.

Reilly Grant . unresolved

```suggestion
// This interface runs in the renderer process and is called from the GPU process.
```

Line 11, Patchset 4 (Latest):interface WebNNContextLostCapturer {

Reilly Grant . unresolved

Can we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?

File services/webnn/webnn_context_impl.h

Line 34, Patchset 4 (Latest): context_lost_capturer_remote = mojo::NullRemote());

Reilly Grant . unresolved

Remove this default parameter and pass this parameter through all the subclasses.

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4 (Latest): context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Line 98, Patchset 4 (Latest): // receiver_.ReportBadMessage(context_lost_info);

ningxin hu . unresolved

Do we need it?

Reilly Grant

No, this isn't a bad message.

File third_party/blink/renderer/modules/ml/ml_context.h

  // Reset the remote of `WebNNContext` and the receiver of
  // `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
  // indicates that the WebNN context is lost. It will be called in two cases:
  // 1. The `WebNNContext` remote is cut off from its receiver, for example,
  // when the WebNN Service is crashed.
  // 2. We need to release the WebNNContext in the WebNN Service when the the
  // WebNN Service captured a WebNN context lost error.

Reilly Grant . unresolved

```suggestion
  // Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
  // because the context has been lost.
```

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 214, Patchset 4 (Latest): &MLContext::OnWebNNContextDisconnected, WrapWeakPersistent(this)));

Reilly Grant . unresolved

I would attach this to `context_lost_capturer_receiver_` so that we have the invariant that either we get a call to `OnWebNNContextLostCaptured()` explaining why the context has been lost or we get a call to this handler to indicate we never will (and assume a GPU process crash).

Line 239, Patchset 4 (Latest): "WebNN context is lost due to connection error."));

Reilly Grant . unresolved

Call `OnWebNNContextLostCaptured()` with this error. Put all the logic for reporting errors in that function.

File third_party/blink/renderer/modules/ml/ml_context_lost_info.idl

Line 11, Patchset 4 (Latest):] interface MLContextLostInfo {

Reilly Grant . unresolved

Use a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu
Rafael Cintron
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 13, 2024, 6:48:54 PMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 2 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4 (Latest): // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?

We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4 (Latest): context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Bryan Bernhart

I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 13, 2024, 7:20:46 PMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4 (Latest): // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Related details

Attention is currently required from:

Austin Sullivan

Bryan Bernhart
Mingming1 Xu
Rafael Cintron

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 13, 2024, 7:35:12 PMJun 13

to Mingming1 Xu, Rafael Cintron, Austin Sullivan, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4 (Latest): // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

File-level comment, Patchset 6 (Latest):

Related details

Attention is currently required from:

Austin Sullivan

Mingming1 Xu
Rafael Cintron
Reilly Grant

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 14, 2024, 5:28:19 AMJun 14

to AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 13 comments

Patchset-level comments

Mingming1 Xu . resolved

Addressed some of your comments, will continue addressing others.

File services/webnn/dml/graph_impl_dml.cc

Line 5755, Patchset 4:// TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . resolved

nit: remove TODO?

Mingming1 Xu

Done

File services/webnn/public/mojom/webnn_context_lost_capturer.mojom

Line 7, Patchset 4:// This interface runs in the GPU process and is called from the GPU process.

Reilly Grant . resolved

```suggestion
// This interface runs in the renderer process and is called from the GPU process.
```

Mingming1 Xu

Done

Line 11, Patchset 4:interface WebNNContextLostCapturer {

Reilly Grant . unresolved

Can we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?

Mingming1 Xu

Good idea! Now I delete this file to move the `WebNNContextClient` to `webnn_context_provider.mojom`, PTAL.

File services/webnn/webnn_context_impl.h

Line 34, Patchset 4: context_lost_capturer_remote = mojo::NullRemote());

Reilly Grant . resolved

Remove this default parameter and pass this parameter through all the subclasses.

Mingming1 Xu

Done

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4: context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Bryan Bernhart

I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.

Mingming1 Xu

Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.

Line 98, Patchset 4: // receiver_.ReportBadMessage(context_lost_info);

ningxin hu . resolved

Do we need it?

Reilly Grant

No, this isn't a bad message.

Mingming1 Xu

Done

File third_party/blink/renderer/modules/ml/ml_context.h

Line 154, Patchset 4: // Reset the remote of `WebNNContext` and the receiver of


  // `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
  // indicates that the WebNN context is lost. It will be called in two cases:
  // 1. The `WebNNContext` remote is cut off from its receiver, for example,
  // when the WebNN Service is crashed.
  // 2. We need to release the WebNNContext in the WebNN Service when the the
  // WebNN Service captured a WebNN context lost error.

Reilly Grant . unresolved

```suggestion
  // Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
  // because the context has been lost.
```

Mingming1 Xu

Done. I move these comments above the `OnWebNNContextLostCaptured` because now `OnWebNNContextDisconnected` will call it. Also rename them as `OnLostCaptured` and `OnDisconnected` to simplify.

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 214, Patchset 4: &MLContext::OnWebNNContextDisconnected, WrapWeakPersistent(this)));

Reilly Grant . resolved

I would attach this to `context_lost_capturer_receiver_` so that we have the invariant that either we get a call to `OnWebNNContextLostCaptured()` explaining why the context has been lost or we get a call to this handler to indicate we never will (and assume a GPU process crash).

Mingming1 Xu

Done

Line 224, Patchset 4:void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {

ningxin hu . unresolved

Should this info be reported to JS code?

Mingming1 Xu

Now I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:

  `std::string context_lost_info;
  if (hr == E_OUTOFMEMORY) {
    context_lost_info = "out of memory.";


  } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

    context_lost_info = "device removed.";
  } else if (hr == DXGI_ERROR_DEVICE_HUNG) {
    context_lost_info = "device hung.";
  } else if (hr == DXGI_ERROR_DEVICE_RESET) {
    context_lost_info = "device reset.";
  } else {
    return;
  }`

Line 239, Patchset 4: "WebNN context is lost due to connection error."));

Reilly Grant . resolved

Call `OnWebNNContextLostCaptured()` with this error. Put all the logic for reporting errors in that function.

Mingming1 Xu

Done

File third_party/blink/renderer/modules/ml/ml_context_lost_info.h

Line 27, Patchset 4: String message_;

ningxin hu . resolved

const String?

Mingming1 Xu

Now I use `dictionary MLContextLostInfo`, the header file will be generated automatically.

File third_party/blink/renderer/modules/ml/ml_context_lost_info.idl

Line 11, Patchset 4:] interface MLContextLostInfo {

Reilly Grant . unresolved

Use a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.

Mingming1 Xu

Done. I deleted this file and remove the dictionary definition to `ml_context.idl`, PTAL.

Line 113, Patchset 6 (Latest):

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart

Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 14, 2024, 2:45:43 PMJun 14

to Mingming1 Xu, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 6 comments

File services/webnn/dml/graph_impl_dml.cc

Line 4619, Patchset 6 (Latest): HandleWebNNContextLost(hr, context);

Reilly Grant . unresolved

@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?

File services/webnn/dml/utils.cc

Line 374, Patchset 6 (Latest): std::string message;

Reilly Grant . unresolved

nit: use `std::string_view` and a `switch` statement.

File services/webnn/public/mojom/webnn_context_provider.mojom

  // Called by the GPU process to create a message pipe for
  // `WebNNContextClient`, the `message` is the error message
  // sent to the renderer process, the renderer process will
  // override this method to handle the lost error.

Reilly Grant . unresolved

```suggestion
  // Called to explain why the `WebNNContextClient` is no longer available.
```

Line 117, Patchset 6 (Latest): OnLostCaptured(string message);

Reilly Grant . unresolved

To match WebGPU and better express that the context is unusable after this.

```suggestion
  OnDestroyed(string message);
```

File services/webnn/webnn_context_impl.cc

Line 95, Patchset 6 (Latest): client_remote_->OnLostCaptured(message);

Reilly Grant . unresolved

After sending this message we should call `context_provider_->OnConnectionError(this)` so that the context provider destroys this object.

File services/webnn/webnn_context_provider_impl.cc

Line 14, Patchset 6 (Latest):#include "services/webnn/public/mojom/webnn_context_lost_capturer.mojom.h"

Reilly Grant . unresolved

This file no longer exists.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu
Rafael Cintron
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 14, 2024, 5:00:07 PMJun 14

to Mingming1 Xu, AyeAye, Rafael Cintron, Austin Sullivan, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4: context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Bryan Bernhart

I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.

Mingming1 Xu

Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.

Bryan Bernhart

Dug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.

Related details

Attention is currently required from:

Austin Sullivan

Mingming1 Xu
Rafael Cintron
Reilly Grant

ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 14, 2024, 7:16:20 PMJun 14

to Mingming1 Xu, AyeAye, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Line 4619, Patchset 6 (Latest): HandleWebNNContextLost(hr, context);

Reilly Grant . unresolved

@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?

Rafael Cintron

I'm glad we're having this conversation. :-)

Here are the three errors Chromium should handle from DirectML and Direct3D:

1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.

The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.

Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.

Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.

For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.

2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.

I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".

3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.

Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.

We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 14, 2024, 8:46:56 PMJun 14

to Mingming1 Xu, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Line 4619, Patchset 6 (Latest): HandleWebNNContextLost(hr, context);

Reilly Grant . unresolved

@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?

Rafael Cintron

I'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.
The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.
I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.
Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.

We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.

Reilly Grant

My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.

In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu
Rafael Cintron
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 17, 2024, 5:02:22 AM (12 days ago) Jun 17

to AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, Reilly Grant, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 13 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4: // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

Mingming1 Xu

This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.

File services/webnn/dml/graph_impl_dml.cc

Line 4619, Patchset 6: HandleWebNNContextLost(hr, context);

Reilly Grant . unresolved

@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?

Rafael Cintron

I'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.
The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.
I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.
Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.

We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.

Reilly Grant

My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.

Mingming1 Xu

@rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) as context lost errors? Currently I only handle the `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, `DXGI_ERROR_DEVICE_HUNG` and `DXGI_ERROR_DEVICE_RESET`.

File services/webnn/dml/utils.cc

Line 369, Patchset 4:void HandleWebNNContextLost(HRESULT hr, const WebNNContextImpl* context) {

Bryan Bernhart . resolved

Shouldn't this be off the `WebNNContextImpl`?

Mingming1 Xu

Done

Line 374, Patchset 6: std::string message;

Reilly Grant . resolved

nit: use `std::string_view` and a `switch` statement.

Mingming1 Xu

Done

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

Mingming1 Xu

choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2

Any ideas? @rafael....@microsoft.com

File services/webnn/public/mojom/webnn_context_lost_capturer.mojom

Line 11, Patchset 4:interface WebNNContextLostCapturer {

Reilly Grant . resolved

Can we generalize this to `WebNNContextClient` to match the style of other Mojo interfaces where we have a `Foo` interface and a `FooClient` interface for any async events that need to be sent back to the consumer of the `Foo` interface?

Mingming1 Xu

Good idea! Now I delete this file to move the `WebNNContextClient` to `webnn_context_provider.mojom`, PTAL.

Mingming1 Xu

Done

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 113, Patchset 6: // Called by the GPU process to create a message pipe for


  // `WebNNContextClient`, the `message` is the error message
  // sent to the renderer process, the renderer process will
  // override this method to handle the lost error.

Reilly Grant . resolved

```suggestion
  // Called to explain why the `WebNNContextClient` is no longer available.
```

Mingming1 Xu

Done

Line 117, Patchset 6: OnLostCaptured(string message);

Reilly Grant . resolved

To match WebGPU and better express that the context is unusable after this.
```suggestion
  OnDestroyed(string message);
```

Mingming1 Xu

Done

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4: context_provider_->OnConnectionError(this);

Bryan Bernhart . unresolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Bryan Bernhart

I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.

Mingming1 Xu

Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.

Bryan Bernhart

Dug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.

Mingming1 Xu

Can we resolve this discussion? @bryan.b...@intel.com

Line 95, Patchset 6: client_remote_->OnLostCaptured(message);

Reilly Grant . resolved

After sending this message we should call `context_provider_->OnConnectionError(this)` so that the context provider destroys this object.

Mingming1 Xu

Yes, I intended to trigger the `WebNNContextImpl::OnConnectionError` by the `context_remote_.reset();` in blink side. I am OK to explicitly call it here.

File services/webnn/webnn_context_provider_impl.cc

Line 14, Patchset 6:#include "services/webnn/public/mojom/webnn_context_lost_capturer.mojom.h"

Reilly Grant . resolved

This file no longer exists.

Mingming1 Xu

Done

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

File third_party/blink/renderer/modules/ml/ml_context.h

Line 154, Patchset 4: // Reset the remote of `WebNNContext` and the receiver of


  // `WebNNContextLostCapturer`, and then resolve the `lost_property_` which
  // indicates that the WebNN context is lost. It will be called in two cases:
  // 1. The `WebNNContext` remote is cut off from its receiver, for example,
  // when the WebNN Service is crashed.
  // 2. We need to release the WebNNContext in the WebNN Service when the the

// WebNN Service captured a WebNN context lost error.

Reilly Grant . resolved

```suggestion
  // Closes the `remote_context_` and `context_lost_capturer_receiver_` pipes
  // because the context has been lost.
```

Mingming1 Xu

Done. I move these comments above the `OnWebNNContextLostCaptured` because now `OnWebNNContextDisconnected` will call it. Also rename them as `OnLostCaptured` and `OnDisconnected` to simplify.

Mingming1 Xu

Done

File-level comment, Patchset 8 (Latest):

Related details

Attention is currently required from:

Austin Sullivan

Bryan Bernhart
Rafael Cintron
Reilly Grant

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 17, 2024, 5:51:23 PM (12 days ago) Jun 17

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant voted and added 4 comments

Votes added by Reilly Grant

Code-Review

+1

4 comments

Patchset-level comments

Reilly Grant . resolved

LGTM with nits.

I think I'm satisfied that the situations within the DirectML backend where we treat an error as context loss represent errors which are not the developer's responsibility and are instead situations in which the system has gotten in the an unrecoverable state (e.g. the GPU process is nearly at its commit limit) and so the only reasonable course of action is to clean everything up and start again.

We will have to see how this works out in real-world usage.

File services/webnn/webnn_graph_impl.h

Line 69, Patchset 8 (Latest): raw_ptr<WebNNContextImpl> context_;

Reilly Grant . unresolved

Why this change?

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 183, Patchset 8 (Latest): "Invalid script state.")));

Reilly Grant . unresolved

```suggestion
            "Context is lost.")));
```

Line 248, Patchset 8 (Latest):

  // Remote context gets automatically unbound when the execution context
  // destructs.
  if (!context_remote_.is_bound()) {
    return;
  }

Reilly Grant . unresolved

This check should move into `MLContext::createBuffer()` and throw a "Context is lost." error. We can then `CHECK()` here that `context_remote_` is bound.

Line 371, Patchset 8 (Latest):

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu
Rafael Cintron
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 17, 2024, 6:45:46 PM (12 days ago) Jun 17

to Mingming1 Xu, Reilly Grant, AyeAye, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu and ningxin hu

Rafael Cintron added 4 comments

File services/webnn/dml/graph_impl_dml.cc

Line 4610, Patchset 8 (Latest): DLOG(ERROR) << error_message << " " << logging::SystemErrorCodeToString(hr);

Rafael Cintron . unresolved

I believe that by the time we get here, we should have already `LOG`ed all of the errors encountered. If so, we may not need this particular `DLOG`.

But, I'm OK with leaving it in and converting to `LOG`. We should prepend the incoming error with a "[WebNN Graph Build Error]" string so people looking at the log know where the error came from.

OK to be a separate change and also fixup the other "HandleX" methods that follow a similar pattern.

Line 4619, Patchset 6: HandleWebNNContextLost(hr, context);

Reilly Grant . unresolved

@rafael....@microsoft.com, can you confirm that all graph creation failures should be considered "context lost" scenarios?

Rafael Cintron

I'm glad we're having this conversation. :-)
Here are the three errors Chromium should handle from DirectML and Direct3D:
1- `DXGI_ERROR_DEVICE_REMOVED`
"Device removed" can happen for reasons Chromium is not able to control such as a driver update, other application on the system behaving poorly or the adapter actually being physically removed.
The latter can happen if you're performing computation on the Surface Book's NVidia GPU (which is under the keyboard) and the user detaches the tablet portion from the keyboard. The Intel GPU, which is behind the monitor, is the only one left you can use.
Since removed devices are basically useless and the D3D12 device is a singleton per process, and per adapter, it is difficult for WebNN, or other individual Chromium system to recover from this on its own. Hence, other code in the GPU process detects the device has become removed (at least on the D3D11 device) and terminates the GPU process with a special error code. The browser process restarts it.
Separately from this change, we should work with the people who own that code in Chromium to make WebNN be a part of it somehow.
For this case, #1, the renderer process will see that the GPU process has been disconnected and we should resolve the "lost" promise.
2- `E_OUTOFMEMORY`
It is possible to handle `E_OUTOFMEMORY` without everything becoming hosed. _However_, when the GPU process reaches the job commit limit, the browser process is notified by the OS. Upon the browser process being notified, it terminates and restarts the GPU process. So, you have a race between the GPU process handling this gracefully and the browser process terminating it.
I think we should handle this case gracefully. In practice, the web developer will see the GPU process connection get terminated, same as the "device removed".
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a `CHECK` so that we can get crash dumps and diagnose.
Here is the [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) which handles errors similar to what I described above.

We should `LOG(ERROR)` any failure we receive from DirectX along with the `logging::SystemErrorCodeToString` friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.

Reilly Grant

My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.
In general we should not `CHECK` on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use `base::debug::DumpWithoutCrashing()`.

Mingming1 Xu

@rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like [Dawn code](https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d/D3DError.cpp) as context lost errors? Currently I only handle the `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, `DXGI_ERROR_DEVICE_HUNG` and `DXGI_ERROR_DEVICE_RESET`.

Rafael Cintron

In general we should not CHECK on OS return values. If we want to collect crash reports to diagnose unexpected errors we can use base::debug::DumpWithoutCrashing().

I disagree with this thinking, in general. There are many instances in Chromium today where we CHECK that OS APIs return success values or, in limited cases, known errors. The alternative is to gracefully handle (and, of course, have tests for) all of the combinations of failures you can get.

In rendering code, an incorrectly handled error case in a stateful API can result in very difficult to debug "my browser window is black" bugs. Correlating these bug reports with `DumpWithoutCrashing` metrics is not straightforward. Aside from being willing to upload a full memory dump of the browser process to a Chromium engineer to investigate, there's not much an end user can do to help themselves (and you) root cause because, well, their browser is black.

We do need to have provisions for gracefully handling things when Chromium is not at fault. But, in my opinion, errors like `E_INVALIDARG` should always cause a `CHECK`. There is a general trend in Chromium to fail fast in more instances where previously, we would have limped along.

My concern is mainly that context loss is not an error handling mechanism. If the system is broken in a way that is out of the developer's control then we can report a context loss. If the developer has asked for something unreasonable that's an error we should throw.

I share this concern and I like how you and others have championed synchronous errors getting returned to web developers at graph building time rather than compilation or inferencing time. WebGPU/WebGL do not quite have this luxury and, in the case of WebGPU, have an "ErrorScope" mechanism to help you make sense of errors asynchronously.

For WebNN, the only case for "the developer has asked for something unreasonable" that comes to mind is `MLBuffer` creation. We can continue to handle this with "content lost" (similar to WebGL/WebGPU) or we can go further and make buffer creation be asynchronous. I go back and forth on it myself. Might be best to tackle asynchronous buffer creation as a separate discussion thread.

 rafael....@microsoft.com should I handle all the DXGI, D3D11, D3D12 errors just like Dawn code as context lost errors? Currently I only handle the E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_HUNG and DXGI_ERROR_DEVICE_RESET.

`DEVICE_HUNG` is the application's fault. `DEVICE_RESET` is usually not.

With that in mind, let's have the final list of errors we handle gracefully be `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET`. All others, I think we should CHECK.

File services/webnn/dml/utils.cc

Line 370, Patchset 8 (Latest):void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Rafael Cintron . unresolved

The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.

To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.

We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.

WDYT?

  if (!context) {
    return;
  }

Rafael Cintron . unresolved

`context` is always non-nullptr so we should be able to remove this if statement.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu

ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 17, 2024, 10:04:57 PM (12 days ago) Jun 17

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Reilly Grant

I'm trying to reconcile this with [our official guidance](https://chromium.googlesource.com/chromium/src/+/main/styleguide/c++/checks.md) because I generally agree that some interactions with OS APIs can be considered invariants, like `E_INVALIDARG` errors, if we're confident that the API isn't going to change out from underneath us.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu

Rafael Cintron
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 17, 2024, 10:20:43 PM (12 days ago) Jun 17

to Mingming1 Xu, Reilly Grant, AyeAye, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Rafael Cintron

@rei...@chromium.org, which part of the official guidance were you specifically referring to?

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu

Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 18, 2024, 5:18:45 AM (11 days ago) Jun 18

to Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 9 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4: // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

Mingming1 Xu

This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.

Mingming1 Xu

I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com

File services/webnn/dml/graph_impl_dml.cc

Line 4610, Patchset 8: DLOG(ERROR) << error_message << " " << logging::SystemErrorCodeToString(hr);

Rafael Cintron . resolved

I believe that by the time we get here, we should have already `LOG`ed all of the errors encountered. If so, we may not need this particular `DLOG`.
But, I'm OK with leaving it in and converting to `LOG`. We should prepend the incoming error with a "[WebNN Graph Build Error]" string so people looking at the log know where the error came from.
OK to be a separate change and also fixup the other "HandleX" methods that follow a similar pattern.

Mingming1 Xu

Agree, thanks! https://issues.chromium.org/issues/347822602 tracked.

File services/webnn/dml/utils.cc

Line 370, Patchset 8:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Rafael Cintron . unresolved

The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?

Mingming1 Xu

I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.

We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.

The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.

Line 371, Patchset 8:

  if (!context) {
    return;
  }

Rafael Cintron . resolved

`context` is always non-nullptr so we should be able to remove this if statement.

Mingming1 Xu

Done

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

Mingming1 Xu

choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2
Any ideas? @rafael....@microsoft.com

Mingming1 Xu

Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.

File services/webnn/webnn_graph_impl.h

Line 69, Patchset 8: raw_ptr<WebNNContextImpl> context_;

Reilly Grant . resolved

Why this change?

Mingming1 Xu

Done

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 183, Patchset 8: "Invalid script state.")));

Reilly Grant . resolved

```suggestion
            "Context is lost.")));
```

Mingming1 Xu

Done

Line 248, Patchset 8: // Remote context gets automatically unbound when the execution context


  // destructs.
  if (!context_remote_.is_bound()) {
    return;
  }

Reilly Grant . resolved

This check should move into `MLContext::createBuffer()` and throw a "Context is lost." error. We can then `CHECK()` here that `context_remote_` is bound.

Mingming1 Xu

Done

File third_party/blink/renderer/modules/ml/ml_context_lost_info.idl

Line 11, Patchset 4:] interface MLContextLostInfo {

Reilly Grant . resolved

Use a dictionary instead of an interface since this type doesn't have any methods. I know there are a number of examples where this sort of type is defined as an interface but most of the ones I'm responsible for (such as the WebUSB transfer result types) I wish I'd defined as dictionaries instead.

Mingming1 Xu

Done. I deleted this file and remove the dictionary definition to `ml_context.idl`, PTAL.

Mingming1 Xu

Done

https://data.norfolk.gov/nominate/364194

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

udantumexico

unread,

Jun 18, 2024, 5:20:00 AM (11 days ago) Jun 18

to Chromium-reviews, Mingming1 Xu (Gerrit), Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org, change-...@chromium-review.googlesource.com

Bryan Bernhart (Gerrit)

unread,

Jun 18, 2024, 6:54:14 PM (11 days ago) Jun 18

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/webnn_context_impl.cc

Line 39, Patchset 4: context_provider_->OnConnectionError(this);

Bryan Bernhart . resolved

If the context gets destroyed, isn't that considered "lost"?

Reilly Grant

If this method is being called then the renderer isn't listening anymore.

Bryan Bernhart

I suppose the renderer or `MLContext` could generate "lost" event beforehand. Not exactly certain why this was needed in WebGPU.

Mingming1 Xu

Yes, I think this method will only be called when the renderer destroys the WebNNContext pipe.

Bryan Bernhart

Dug into this more. Looks like WebGPU considers destroyed as "lost" to communicate the context/device shouldn't be kept alive, by the web developer, if some transient (but still fatal) failure occurred. A transient error could be OOM, for example.

Mingming1 Xu

Can we resolve this discussion? @bryan.b...@intel.com

Bryan Bernhart

@mingmi...@intel.com Yes, thanks for following up. Thinking some more, I don't think we need to address this until we need `MLContext.destroy()`.

[1] https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d12/DeviceD3D12.h;drc=eb9fd54dfff30384cac3f208cb1ddaf58afd639a;l=224

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 18, 2024, 7:06:22 PM (11 days ago) Jun 18

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/utils.cc

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

Mingming1 Xu

choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2
Any ideas? @rafael....@microsoft.com

Mingming1 Xu

Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.

Bryan Bernhart

Any HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 18, 2024, 7:12:46 PM (11 days ago) Jun 18

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Reilly Grant

_Bad/untrusted/changing driver, kernel API, hardware failure:_ A misbehaving GPU driver may cause us to be unable to proceed. This is not an invariant failure. On platforms where we are wary that a kernel API may change without sufficient prior notice we should not CHECK() its result as we expect the rug to be pulled from under our feet. In the case of hardware failure we should not for instance CHECK() that a write succeeded.

Related details

Attention is currently required from:

Austin Sullivan
Mingming1 Xu
Rafael Cintron

ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 18, 2024, 9:28:47 PM (11 days ago) Jun 18

to Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron and ningxin hu

Mingming1 Xu added 1 comment

File services/webnn/dml/utils.cc

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

Mingming1 Xu

choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2
Any ideas? @rafael....@microsoft.com

Mingming1 Xu

Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.

Bryan Bernhart

Any HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].
[1] https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d12/DeviceD3D12.h;drc=eb9fd54dfff30384cac3f208cb1ddaf58afd639a;l=224

Mingming1 Xu

So we don't need the `switch-case` here to get the contest lost message, we can directly call `ID3D12Device::GetDeviceRemovedReason` to get the message and send it to renderer by `OnDestroyed`. That sounds good!
Two questions:
1. if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET` (@rafael....@microsoft.com mentioned)?
2. I find there is another API called `IDMLDevice::GetDeviceRemovedReason`, what's the difference and should we use this?

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Rafael Cintron
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 19, 2024, 3:46:39 AM (11 days ago) Jun 19

to Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron and ningxin hu

Mingming1 Xu added 1 comment

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Rafael Cintron
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 19, 2024, 7:02:01 PM (10 days ago) Jun 19

to Mingming1 Xu, Reilly Grant, AyeAye, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 3 comments

File services/webnn/dml/graph_impl_dml.cc

Rafael Cintron

@rei...@chromium.org, thank you for the quote.

I agree with the guidance on not `CHECK`-ing for "hardware failure" or "changing driver". These map, roughly, to `DXGI_ERROR_DEVICE_REMOVED`/`DXGI_ERROR_DEVICE_RESET` and I think we should definitely handle that gracefully by `LOG(ERROR)`-ing and arranging for orderly GPU process shutdown like the rest of the code does.

Windows does not really have the analogous of "... kernel API may change without sufficient prior notice...". Windows components tend to be stable in terms the input they receive and error out on. If anything, my experience has been they're overly generous about continuing to accept bad input that was accepted by a previous version of the API.

File services/webnn/dml/utils.cc

Line 370, Patchset 8:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Rafael Cintron . unresolved

The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?

Mingming1 Xu

I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.

Rafael Cintron

Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.

By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.

The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.

The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . unresolved

It would be helpful to LOG the actual reason too via GetDeviceRemovedReason().

Mingming1 Xu

choice 1: add `LOG(ERROR) << logging::SystemErrorCodeToString(hr);` for each case.
choice 2: call `ID3D12Device::GetDeviceRemovedReason` and LOG the result of it. But I am not sure which case indicates device removed, is it only the `DXGI_ERROR_DEVICE_REMOVED`?
choice 3: both 1 and 2
Any ideas? @rafael....@microsoft.com

Mingming1 Xu

Think again, we may needn't call `LOG(ERROR) << logging::SystemErrorCodeToString(hr)` again here, it can be called outside, I want to leave this method only responsible for handling specific context lost error message which will be sent by `OnDestroyed` to renderer.

Bryan Bernhart

Any HR that results in "lost" could leverage GetDeviceRemovedReason(). So at-least, HUNG, REMOVED, RESET, INTERNAL_ERROR, or NOT_CURRENTLY_AVAILABLE. We could also, perhaps more simply, always append the reason from any HR returned by it. This is what WebGPU does [1].
[1] https://source.chromium.org/chromium/chromium/src/+/main:third_party/dawn/src/dawn/native/d3d12/DeviceD3D12.h;drc=eb9fd54dfff30384cac3f208cb1ddaf58afd639a;l=224

Mingming1 Xu

So we don't need the `switch-case` here to get the contest lost message, we can directly call `ID3D12Device::GetDeviceRemovedReason` to get the message and send it to renderer by `OnDestroyed`. That sounds good!
Two questions:
1. if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED` and `DXGI_ERROR_DEVICE_RESET` (@rafael....@microsoft.com mentioned)?
2. I find there is another API called `IDMLDevice::GetDeviceRemovedReason`, what's the difference and should we use this?

Rafael Cintron

1) if `ID3D12Device::GetDeviceRemovedReason` handled all the lost types including E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED and DXGI_ERROR_DEVICE_RESET (@rafael....@microsoft.com mentioned)?

All DML entrypoints which return an HRESULT first call `ID3D12Device::GetDeviceRemovedReason` and return `DXGI_ERROR_DEVICE_REMOVED` on failure. To be expected since DML is heavily tied to D3D12 for command execution and resource allocation/management.

2) I find there is another API called `IDMLDevice::GetDeviceRemovedReason` what's the difference and should we use this?

In the current DML implementation, `IDMLDevice::GetDeviceRemovedReason` calls `ID3D12Device::GetDeviceRemovedReason`. So in today's implementation, DML cannot become removed when D3D12 is still unremoved or vice versa.

I think we should check the return value of all HRESULT-returning DirectX methods and `LOG(ERROR)` right at the point of failure. This will help us better diagnose problems in the wild. We already largely do this with the macros in `services/webnn/dml/error.h`.

I can't think of a good reason to also use the "GetDeviceRemovedReason" APIs, except maybe as a sanity check in special codepaths.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 19, 2024, 10:37:04 PM (10 days ago) Jun 19

to Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 3 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4: // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

Mingming1 Xu

This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.

Mingming1 Xu

I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com

Mingming1 Xu

Updated, PTAL.

File services/webnn/dml/utils.cc

Line 370, Patchset 8:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Rafael Cintron . unresolved

The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?

Mingming1 Xu

I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.

Rafael Cintron

Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.
The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.
The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.

Mingming1 Xu

Thanks! Please see my updates.

Mingming1 Xu

I think we should check the return value of all HRESULT-returning DirectX methods and LOG(ERROR) right at the point of failure. This will help us better diagnose problems in the wild. We already largely do this with the macros in services/webnn/dml/error.h.

Agree, sorry to be unclear, I just mean I don't want to LOG(ERROR) the `logging::SystemErrorCodeToString(hr)` twice. Now I only LOG it in `HandleWebNNContextLost` as you suggested.

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 20, 2024, 4:08:03 PM (9 days ago) Jun 20

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Reilly Grant

For DirectML, the question is how transparently driver quirks are exposed to API consumers. I trust Windows components more than vendor-specific drivers.

Line 389, Patchset 12 (Latest):

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu
Rafael Cintron
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 20, 2024, 8:24:23 PM (9 days ago) Jun 20

to Mingming1 Xu, Reilly Grant, AyeAye, Austin Sullivan, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Austin Sullivan, Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 4 comments

File services/webnn/dml/graph_impl_dml.cc

Rafael Cintron

If history is any indication, I expect we will encounter driver/hardware quirks for incorrect output. I do not expect rogue, new errors to be returned straight from the driver to the application. The reason DML and D3D runtimes exist is to shield applications from this.

Would it make you feel more comfortable if we only `CHECK`-ed certain errors and gracefully handled everything else by declaring "context lost" and emitting a LOG + histogram? With a histogram, we can track the values we get over time to see how bad things are in practice.

File services/webnn/dml/utils.cc

Line 370, Patchset 8:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Rafael Cintron . resolved

The "HandleXFailure" methods which call `HandleWebNNContextLost` have code in common.
To better unify them, I think we should structure `HandleWebNNContextLost` as follows:
1-If the `hr` is `E_OUTOFMEMORY`, set the web developer friendly error message to "out of memory", otherwise, set it to "device removed." Having an "internal error" for the non-device removed ones may give an attacker information about whether the error is exploitable but I don't feel strongly about this.
2-Call the `OnDestroyed` callback.
3-`LOG(ERROR) << "[WebNN]: " << logging::SystemErrorCodeToString(hr)`.
4-`CHECK` that the error is either `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`, or `DXGI_ERROR_DEVICE_RESET`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
WDYT?

Mingming1 Xu

I am not clear for the step 1,3,4, please help me understand.
Currently I handled `E_OUTOFMEMORY`, `DXGI_ERROR_DEVICE_REMOVED`,`DXGI_ERROR_DEVICE_RESET` three ones to call `OnDestroyed`, otherwise I directly return and won't call `OnDestroyed`.
We could also have this function run the callback if we templatize it on the callback type and error type. If that's too complicated, we can have it return the developer friendly string to the caller and have the caller use it to run the callback.
The `OnDestroyed` method is called to send context lost message to renderer, I suppose we needn't LOG(ERROR) the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call `OnDestroyed` with the message in `HandleWebNNContextLost`.

Rafael Cintron

Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.
By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call `OnDestroyed` so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a `CHECK` or graceful termination of the GPU process.
The `OnDestroye`d method is called to send context lost message to renderer, I suppose we needn't `LOG(ERROR)` the context lost message again and let the caller run the callback (`CreateGraphCallback` and `ComputeCallback`)? As current code base, I only want to get the context lost message, and then call OnDestroyed with the message in HandleWebNNContextLost.
The `LOG(ERROR)` gives the about:gpu log additional information about why the error occurred that we are not providing to the renderer process. Also, we can't count on the web developer reporting the error to the end user in a way we can easily diagnose. Having the error in the about:gpu log means we, the WebNN implementation, can control when and what we log, not the web developer.

Mingming1 Xu

Thanks! Please see my updates.

Rafael Cintron

Acknowledged

  CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
        hr == DXGI_ERROR_DEVICE_REMOVED);

Rafael Cintron . unresolved

The last two statements are the same.

Line 391, Patchset 12 (Latest):}

Rafael Cintron . unresolved

@mingmi...@intel.com, apologies for misleading. What you have here so far is good.

If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.

File-level comment, Patchset 12 (Latest):

Related details

Attention is currently required from:

Austin Sullivan
Bryan Bernhart
Mingming1 Xu

Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Austin Sullivan (Gerrit)

unread,

Jun 21, 2024, 11:25:56 AM (8 days ago) Jun 21

to Mingming1 Xu, Reilly Grant, AyeAye, Rafael Cintron, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Austin Sullivan added 1 comment

Patchset-level comments

Austin Sullivan . resolved

moving myself to CC since this CL already has overlapping reviewers!

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 21, 2024, 3:42:54 PM (8 days ago) Jun 21

to Mingming1 Xu, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.

Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Related details

Attention is currently required from:

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 21, 2024, 4:41:46 PM (8 days ago) Jun 21

to Mingming1 Xu, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 1 comment

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.

The destroy method can come as a separate change.

Related details

Attention is currently required from:

Bryan Bernhart

satisfied_requirement

unsatisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 21, 2024, 5:18:07 PM (8 days ago) Jun 21

to Mingming1 Xu, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Reilly Grant added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Line 4619, Patchset 6: HandleWebNNContextLost(hr, context);

Reilly Grant . resolved

Reilly Grant

That makes sense to me.

The only other thing I want to make sure we are clear on is separating errors by how we expect developers to react to them. As I understand it developers should respond to context loss by simply redoing what they did before (i.e. if the dGPU is removed, create a new context for the iGPU and _maybe_ use a different model for the smaller chip). On the other hand there might be graph or buffer creation errors which mean the developer should not try again, or should take some specific corrective action (e.g. trying to load a different model). These should not be context loss errors and we'll need to use a different method for reporting them (e.g. rejecting from `build()`, `dispatch()` or `readBuffer()`).

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Rafael Cintron
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Dwayne Robinson (Gerrit)

unread,

Jun 21, 2024, 7:00:47 PM (8 days ago) Jun 21

to Mingming1 Xu, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Dwayne Robinson added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Dwayne Robinson

(I hadn't seen this CR until Rafael brought it to my attention)

Agreed with `E_INVALIDARG`, as Chromium should not be sending invalid parameters to DML in the first place, and DML would report this error due to bad *caller* parameters, not bubbled up from the driver (which become different error messages).

Some others besides `DXGI_ERROR_DEVICE_REMOVED` to gracefully handle include: `DXGI_ERROR_UNSUPPORTED`, `DXGI_DDI_ERR_UNSUPPORTED`, `DXGI_ERROR_DEVICE_RESET`.

Though, `DXGI_ERROR_DRIVER_INTERNAL_ERROR` is more serious - "An internal issue prevented the driver from carrying out the specified operation. The driver's state is probably suspect, and the application should not continue." 🤔

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Rafael Cintron

Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 21, 2024, 7:27:15 PM (8 days ago) Jun 21

to Mingming1 Xu, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 3 comments

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4: // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . unresolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

Mingming1 Xu

This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.

Mingming1 Xu

I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com

Mingming1 Xu

Updated, PTAL.

Bryan Bernhart

@mingmi...@intel.com

It depends if the HRESULT error will be recoverable or not. In this case, E_OUTOFMEMORY is not recoverable and leaves the context inoperable so a "lost" seems appropriate. In other cases, this is not so [1].

Line 329, Patchset 12 (Latest): HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.

The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

File services/webnn/dml/utils.cc

Line 370, Patchset 12 (Latest):void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Bryan Bernhart . unresolved

Since it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?

Related details

Attention is currently required from:

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 21, 2024, 7:49:45 PM (8 days ago) Jun 21

to Mingming1 Xu, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Dwayne Robinson, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 1 comment

File services/webnn/dml/graph_impl_dml.cc

Rafael Cintron

As I understand it developers should respond to context loss by simply redoing what they did before

@rei...@chromium.org, both WebGL and WebGPU allow the web developer _one_ context lost of grace. After that, no more WebGL/WebGPU for you. Not a great story but is the best we have with the information we're given by the platform about what's really happening.

That said, WebNN is not like WebGPU/WebGL where the platform ingests arbitrary shaders with infinite loops so I expect the platform will be better able to preempt work the web developer submits.

Of the errors Dwayne quoted, `DXGI_ERROR_UNSUPPORTED` and `DXGI_DDI_ERR_UNSUPPORTED` are a temporary implementation detail specific to NPUS when not all of the ops are supported at compile time. I'm double checking with the DML team but I expect/hope that we don't have to completely tear down and restart DML when this happens. OK with me if we handle those particular ones as a separate change.

Related details

Attention is currently required from:

Dwayne Robinson
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 24, 2024, 3:13:29 AM (6 days ago) Jun 24

to Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Dwayne Robinson, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 7 comments

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

File services/webnn/dml/utils.cc

Line 370, Patchset 12:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Bryan Bernhart . unresolved

Since it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?

Mingming1 Xu

Good points, now move it to `ContextImplDml` as `ContextImplDml::HandleContextLost`

Line 377, Patchset 4: } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

Bryan Bernhart . resolved

Mingming1 Xu

Done

Line 389, Patchset 12:

  CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
        hr == DXGI_ERROR_DEVICE_REMOVED);

Rafael Cintron . resolved

The last two statements are the same.

Mingming1 Xu

Done

Line 391, Patchset 12:}

Rafael Cintron . unresolved

@mingmi...@intel.com, apologies for misleading. What you have here so far is good.
If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.

Mingming1 Xu

Updated, PTAL, thanks!

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 13 (Latest): pending_receiver<WebNNContextClient>? context_client_receiver;

Mingming1 Xu . unresolved

@ningx...@intel.com we also need reviewer for this mojom file, could you help invite? Thanks!

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.

Mingming1 Xu

Thanks, open https://issues.chromium.org/issues/348904836 to track.

But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).

Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Related details

Attention is currently required from:

Bryan Bernhart
Dwayne Robinson
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 24, 2024, 12:45:56 PM (5 days ago) Jun 24

to Mingming1 Xu, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Dwayne Robinson, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 2 comments

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?

If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.

Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.

Mingming1 Xu

Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Bryan Bernhart

Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.

Line 354, Patchset 13 (Latest):

Related details

Attention is currently required from:

Dwayne Robinson
Mingming1 Xu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 24, 2024, 9:10:26 PM (5 days ago) Jun 24

to Mingming1 Xu, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Dwayne Robinson, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 4 comments

File services/webnn/dml/context_impl_dml.cc

Line 100, Patchset 13 (Latest): HandleContextLost(hr);

Rafael Cintron . unresolved

Every occurrence of `HandleContextLost` in this CL is preceded by `LOG(ERROR) << "some error"`.

To better streamline how we do error handling, Consider having `HandleContextLost`, itself, accept an error string that it can combine with the existing strings it already outputs.

Line 328, Patchset 13 (Latest): LOG(ERROR) << error_message;

Rafael Cintron . unresolved

Nit: While you're here, please prepend the string "[WebNN] " to this error message so that we know which system it is coming from in logs.

(Same goes for the other places where you're outputting to the error log)

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

  CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
        hr == DXGI_ERROR_DEVICE_REMOVED);

Rafael Cintron . unresolved

These two || expressions are the same.

Related details

Attention is currently required from:

Bryan Bernhart
Dwayne Robinson
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 25, 2024, 1:48:03 AM (5 days ago) Jun 25

to Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 6 comments

File services/webnn/dml/context_impl_dml.cc

Line 100, Patchset 13: HandleContextLost(hr);

Rafael Cintron . resolved

Every occurrence of `HandleContextLost` in this CL is preceded by `LOG(ERROR) << "some error"`.
To better streamline how we do error handling, Consider having `HandleContextLost`, itself, accept an error string that it can combine with the existing strings it already outputs.

Mingming1 Xu

Good idea,done.

Line 328, Patchset 13: LOG(ERROR) << error_message;

Rafael Cintron . resolved

Nit: While you're here, please prepend the string "[WebNN] " to this error message so that we know which system it is coming from in logs.
(Same goes for the other places where you're outputting to the error log)

Mingming1 Xu

Done

Line 354, Patchset 13:

  CHECK(hr == E_OUTOFMEMORY || hr == DXGI_ERROR_DEVICE_REMOVED ||
        hr == DXGI_ERROR_DEVICE_REMOVED);

Rafael Cintron . resolved

These two || expressions are the same.

Mingming1 Xu

Done

File services/webnn/dml/utils.cc

Line 370, Patchset 12:void HandleWebNNContextLost(HRESULT hr, ContextImplDml* context) {

Bryan Bernhart . resolved

Since it's not possible to HandleWebNNContextLost() without a context now and every WebNN object should have access to said context, why must this function be in utils?

Mingming1 Xu

Good points, now move it to `ContextImplDml` as `ContextImplDml::HandleContextLost`

Mingming1 Xu

Done

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.

Mingming1 Xu

Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Bryan Bernhart

Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.

Mingming1 Xu

Thinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 224, Patchset 4:void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {

ningxin hu . unresolved

Should this info be reported to JS code?

Mingming1 Xu

Now I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:
  `std::string context_lost_info;
  if (hr == E_OUTOFMEMORY) {
    context_lost_info = "out of memory.";


  } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {

    context_lost_info = "device removed.";
  } else if (hr == DXGI_ERROR_DEVICE_HUNG) {
    context_lost_info = "device hung.";
  } else if (hr == DXGI_ERROR_DEVICE_RESET) {
    context_lost_info = "device reset.";
  } else {
    return;
  }`

Mingming1 Xu

Discussed a lot in other comments, PTAL, thanks!

Related details

Attention is currently required from:

Bryan Bernhart
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 25, 2024, 11:48:06 AM (4 days ago) Jun 25

to Mingming1 Xu, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron added 1 comment

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . unresolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.

Mingming1 Xu

Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Bryan Bernhart

Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.

Mingming1 Xu

Thinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com

Rafael Cintron

Destroy serves two purposes. One is for testing. The other is for the web developer to free the memory and resources taken up by the context. In the case of hybrid GPUs, that means powering down the discrete GPU and giving hours of battery life back to the end user.

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 25, 2024, 12:28:01 PM (4 days ago) Jun 25

to Mingming1 Xu, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Rafael Cintron, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

Bryan Bernhart

@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.

File-level comment, Patchset 14 (Latest):

Related details

Attention is currently required from:

Mingming1 Xu
Rafael Cintron
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 25, 2024, 8:34:48 PM (4 days ago) Jun 25

to Mingming1 Xu, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron voted and added 3 comments

Votes added by Rafael Cintron

Code-Review

+1

3 comments

Patchset-level comments

Rafael Cintron . resolved

Code change LGTM lone "nit" I just opened.

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

Bryan Bernhart

@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.

Rafael Cintron

My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.

File services/webnn/webnn_graph_impl_unittest.cc

Line 146, Patchset 14 (Latest): bool IsLost() { return is_lost_; }

Rafael Cintron . unresolved

Nit: Please `IsLost` a const method.

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 25, 2024, 11:03:36 PM (4 days ago) Jun 25

to Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Reilly Grant and ningxin hu

Mingming1 Xu added 2 comments

File services/webnn/dml/utils.cc

Line 391, Patchset 12:}

Rafael Cintron . resolved

@mingmi...@intel.com, apologies for misleading. What you have here so far is good.
If the HRESULT is a failure HRESULT, please have the browser additionally call `GetDeviceRemovedReason` and `LOG(ERROR)` that as well with the string prefix "[WebNN] Device Removed Reason: ". `GetDeviceRemovedReason` returns more information about the error than just the generic device removed error.

Mingming1 Xu

Updated, PTAL, thanks!

Mingming1 Xu

Done

File services/webnn/webnn_graph_impl_unittest.cc

Line 146, Patchset 14: bool IsLost() { return is_lost_; }

Rafael Cintron . resolved

Nit: Please `IsLost` a const method.

Mingming1 Xu

Done

Related details

Attention is currently required from:

Bryan Bernhart
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 25, 2024, 11:04:47 PM (4 days ago) Jun 25

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart and Reilly Grant

ningxin hu added 1 comment

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 13: pending_receiver<WebNNContextClient>? context_client_receiver;

Mingming1 Xu . resolved

@ningx...@intel.com we also need reviewer for this mojom file, could you help invite? Thanks!

ningxin hu

Done. Add @aj...@chromium.org for reviewing mojo changes.

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Reilly Grant

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 26, 2024, 2:13:13 AM (4 days ago) Jun 26

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu and Reilly Grant

ningxin hu added 14 comments

Patchset-level comments

ningxin hu . resolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

File services/webnn/dml/context_impl_dml.h

Line 43, Patchset 16: void HandleContextLost(std::string_view error_message, HRESULT hr);

ningxin hu . unresolved

This method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.

It might be worth adding a comment explaining which errors are treated as "context lost".

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . unresolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Line 156, Patchset 16: HandleRecordingError("Failed to close and execute the command list.", hr);

ningxin hu . unresolved

`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.

Line 180, Patchset 16: HandleRecordingError("Failed to download the buffer.", hr);

ningxin hu . unresolved

ditto.

Line 191, Patchset 16: if (FAILED(hr)) {

ningxin hu . unresolved

Why don't we call `HandleContextLost()` for this failure like others?

Line 323, Patchset 16:void ContextImplDml::HandleContextLost(std::string_view error_message,

ningxin hu . unresolved

This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.

Line 341, Patchset 16: OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));

ningxin hu . unresolved

Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?

  LOG(ERROR) << "[WebNN] Device Removed Reason: "
             << logging::SystemErrorCodeToString(
                    adapter_->d3d12_device()->GetDeviceRemovedReason());

ningxin hu . unresolved

Should this log only for device removal error?

File services/webnn/dml/graph_impl_dml.cc

Line 4793, Patchset 16:

  if (hr == E_OUTOFMEMORY) {
    std::move(callback).Run(base::unexpected(CreateError(
        mojom::Error::Code::kUnknownError,
        error_message + " No enough memory resources are available.")));

ningxin hu . unresolved

Should we remove this case? Because "out of memory" is now handled by context lost.

Line 4801, Patchset 16: context->HandleContextLost(error_message, hr);

ningxin hu . unresolved

Where does the `context` come from? Did you forget passing it via parameter?

      std::move(callback).Run(
          base::unexpected(std::move(create_operator_result.error())));

ningxin hu . unresolved

Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.

      HandleGraphCreationFailure("Failed to create identity operator.",
                                 std::move(callback));

ningxin hu . unresolved

This may need to pass `context`.

Line 5915, Patchset 16:

  if (hr == E_OUTOFMEMORY) {
    std::move(callback).Run(ComputeResult::NewError(CreateError(
        mojom::Error::Code::kUnknownError,
        error_message + " No enough memory resources are available.")));

ningxin hu . unresolved

ditto, this case might be unnecessary.

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Mingming1 Xu
Reilly Grant

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 26, 2024, 2:13:51 AM (4 days ago) Jun 26

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu and Reilly Grant

ningxin hu added 1 comment

Patchset-level comments

ningxin hu . unresolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

ningxin hu

Open it for comments.

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Mingming1 Xu
Reilly Grant

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 26, 2024, 4:59:43 AM (3 days ago) Jun 26

to Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Reilly Grant and ningxin hu

Mingming1 Xu added 12 comments

Patchset-level comments

ningxin hu . unresolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

ningxin hu

Open it for comments.

Mingming1 Xu

While the adapter is created before creating a context, if fails, there is no context.

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . unresolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Mingming1 Xu

Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.

Line 156, Patchset 16: HandleRecordingError("Failed to close and execute the command list.", hr);

ningxin hu . unresolved

`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.

Mingming1 Xu

Do you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?

Line 191, Patchset 16: if (FAILED(hr)) {

ningxin hu . resolved

Why don't we call `HandleContextLost()` for this failure like others?

Mingming1 Xu

Yes, I missed it. Added now.

Line 323, Patchset 16:void ContextImplDml::HandleContextLost(std::string_view error_message,

ningxin hu . unresolved

This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.

Mingming1 Xu

This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/

Line 341, Patchset 16: OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));

ningxin hu . unresolved

Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?

Mingming1 Xu

The discussion is here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/ and https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/

  LOG(ERROR) << "[WebNN] Device Removed Reason: "
             << logging::SystemErrorCodeToString(
                    adapter_->d3d12_device()->GetDeviceRemovedReason());

ningxin hu . unresolved

Should this log only for device removal error?

Mingming1 Xu

Agree, maybe we should:
`  HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
  if (FAILED(device_removed_reason)) {


    LOG(ERROR) << "[WebNN] Device Removed Reason: "

               << logging::SystemErrorCodeToString(device_removed_reason);
  }`

File services/webnn/dml/graph_impl_dml.cc

Line 4793, Patchset 16:

  if (hr == E_OUTOFMEMORY) {
    std::move(callback).Run(base::unexpected(CreateError(
        mojom::Error::Code::kUnknownError,
        error_message + " No enough memory resources are available.")));

ningxin hu . resolved

Should we remove this case? Because "out of memory" is now handled by context lost.

Mingming1 Xu

Done

Line 4801, Patchset 16: context->HandleContextLost(error_message, hr);

ningxin hu . resolved

Where does the `context` come from? Did you forget passing it via parameter?

Mingming1 Xu

Done

      std::move(callback).Run(
          base::unexpected(std::move(create_operator_result.error())));

ningxin hu . unresolved

Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.

Mingming1 Xu

`void HandleGraphCreationFailure(
    const std::string& error_message,
    WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
    const std::string& error_message,
    HRESULT hr,
    WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.

      HandleGraphCreationFailure("Failed to create identity operator.",
                                 std::move(callback));

ningxin hu . unresolved

This may need to pass `context`.

Mingming1 Xu

There is no `HRESULT` to be handled by `HandleContextLost`.

Line 5915, Patchset 16:

  if (hr == E_OUTOFMEMORY) {
    std::move(callback).Run(ComputeResult::NewError(CreateError(
        mojom::Error::Code::kUnknownError,
        error_message + " No enough memory resources are available.")));

ningxin hu . resolved

ditto, this case might be unnecessary.

Mingming1 Xu

Done

Line 165, Patchset 19 (Latest):

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Alex Gough (Gerrit)

unread,

Jun 26, 2024, 2:01:28 PM (3 days ago) Jun 26

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Alex Gough added 2 comments

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 19 (Latest): pending_receiver<WebNNContextClient>? context_client_receiver;

Alex Gough . unresolved

don't really need _receiver in the name here - but up to you

File third_party/blink/renderer/modules/ml/ml_context.h

  // Closes the `context_remote_` and `context_client_receiver_` pipes
  // because the context has been lost.
  void OnDestroyed(const String& message) override;

  void OnDisconnected();

Alex Gough . unresolved

I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 26, 2024, 2:12:05 PM (3 days ago) Jun 26

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

Bryan Bernhart

@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.

Rafael Cintron

My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.

Bryan Bernhart

@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".

Related details

Attention is currently required from:

Mingming1 Xu
Rafael Cintron
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 26, 2024, 6:23:26 PM (3 days ago) Jun 26

to Mingming1 Xu, Alex Gough, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Reilly Grant and ningxin hu

Rafael Cintron voted and added 4 comments

Votes added by Rafael Cintron

Code-Review

+1

4 comments

File services/webnn/dml/context_impl_dml.cc

Line 323, Patchset 16:void ContextImplDml::HandleContextLost(std::string_view error_message,

ningxin hu . unresolved

This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.

Mingming1 Xu

This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/

Rafael Cintron

I think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.

The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.

Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

Bryan Bernhart

@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.

Rafael Cintron

My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.

Bryan Bernhart

@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".

Rafael Cintron

D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate E_OUTOFMEMORY. Is that not recoverable?

When Chromium runs in the sandbox and the job limit is reached, a message will be sent to the browser process notifying it. The browser process will turnaround and kill the GPU process. So, while it is technically recoverable, the GPU process is doomed. :-) FWIW, WebGPU also makes E_OUTOFMEMORY turn into WebGPU context lost.

The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".

Device removed, device reset and similar HRESULTs should cause a graceful termination of the GPU process. This CL handles them gracefully but doesn't terminate the GPU process yet. We'll need to hook that up in a followon change similar to existing code in the GPU process.

In my opinion, HRESULTS like E_INVALIDARG should result in a CHECK. This is a bug in the DML backend that we should treat similar to CHECK failures that we're counting on the validation layer have dealt before arriving at DML code.

  LOG(ERROR) << "[WebNN] Device Removed Reason: "
             << logging::SystemErrorCodeToString(
                    adapter_->d3d12_device()->GetDeviceRemovedReason());

ningxin hu . unresolved

Should this log only for device removal error?

Mingming1 Xu

Agree, maybe we should:
`  HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
  if (FAILED(device_removed_reason)) {
    LOG(ERROR) << "[WebNN] Device Removed Reason: "
               << logging::SystemErrorCodeToString(device_removed_reason);
  }`

Rafael Cintron

+1

File services/webnn/dml/graph_impl_dml.cc

      std::move(callback).Run(
          base::unexpected(std::move(create_operator_result.error())));

ningxin hu . unresolved

Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.

Mingming1 Xu

`void HandleGraphCreationFailure(
    const std::string& error_message,
    WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
    const std::string& error_message,
    HRESULT hr,
    WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.

Rafael Cintron

@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.

We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.

OK to solve this as a separate change.

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 26, 2024, 7:20:14 PM (3 days ago) Jun 26

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . unresolved

[1] In certain situations, like command recording, it is not expected to generate a "lost" HRESULT error (hence no TODO), because the error was recoverable.
The problem here is that HandleWebNNContextLost() really only needs to be called from `CommandRecorder::Execute()`, instead of `CommandRecorder::CloseAndExecute()`, where HandleRecordingError() only needs to LOG.

Mingming1 Xu

What do you mean? The `CommandRecorder::CloseAndExecute()` will also call `CommandRecorder::Execute()`. And there is no any bad effect if we call the `HandleWebNNContextLost()` here, right? Just like the `GraphImplDml::HandleDispatchFailure`, `GraphImplDml::HandleComputationFailure` and `HandleGraphCreationFailure`, I handle the failure HRESULT everywhere.

Bryan Bernhart

And there is no any bad effect if we call the HandleWebNNContextLost() here, right?
If we call HandleWebNNContextLost(), the entire context becomes inoperable, even when the HRESULT may not indicate a 'lost' error. Before this CL, any recording HRESULT error was fully handled by re-creating the command recorder.
Either HandleWebNNContextLost() needs to avoid generating a lost on non-fatal HRESULT errors that are recoverable OR we must call HandleWebNNContextLost() ONLY where a lost HRESULT is expected (ex. `CommandRecorder::Execute`).

Rafael Cintron

@bryan.b...@intel.com, can you give me an example of a non-fatal HRESULT error that is recoverable by re-creating the command recorder?

Bryan Bernhart

@rafael....@microsoft.com any HRESULT that doesn't explicitly say the device will be put into a removal state could be recoverable. If the web developer builds N graphs and the N+1 graph build causes `E_INVALIDARG` or `DXGI_ERROR_INVALID_CALL` and so on then the entire app/workload will go down and the web developer won't know what graph caused it.

Rafael Cintron

My opinion, as I've stated in other comment threads in this CL, is that `E_INVALIDARG` and `DXGI_ERROR_INVALID_CALL` are not errors the browser should try to gracefully recover from. We should `CHECK` those and fail fast the GPU process.

Bryan Bernhart

@rafael....@microsoft.com D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate `E_OUTOFMEMORY`. Is that not recoverable? The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".

Rafael Cintron

D3D12 does not guarantee where a certain error comes from: CreateCommandAllocator() could generate E_OUTOFMEMORY. Is that not recoverable?
When Chromium runs in the sandbox and the job limit is reached, a message will be sent to the browser process notifying it. The browser process will turnaround and kill the GPU process. So, while it is technically recoverable, the GPU process is doomed. :-) FWIW, WebGPU also makes E_OUTOFMEMORY turn into WebGPU context lost.
The problem with a "no error or lost" approach is it assumes every error is now fatal. But if you still feel strongly, I will close as "by design".
Device removed, device reset and similar HRESULTs should cause a graceful termination of the GPU process. This CL handles them gracefully but doesn't terminate the GPU process yet. We'll need to hook that up in a followon change similar to existing code in the GPU process.
In my opinion, HRESULTS like E_INVALIDARG should result in a CHECK. This is a bug in the DML backend that we should treat similar to CHECK failures that we're counting on the validation layer have dealt before arriving at DML code.

Bryan Bernhart

Okay, @rafael....@chromium.org I see where you're coming from.

We'll need to hook that up in a followon change similar to existing code in the GPU process.

I agree with you we'll need to fix this at some point. But once we do, E_OUTOFMEMORY will be recoverable and this code shouldn't generate a context lost (due to the reset). But I'm fine with a TODO there if the browser process won't (currently) allow it.

Related details

Attention is currently required from:

Mingming1 Xu
Rafael Cintron
Rafael Cintron
Reilly Grant
ningxin hu

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 26, 2024, 11:01:48 PM (3 days ago) Jun 26

to Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Rafael Cintron, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 4 comments

File services/webnn/dml/context_impl_dml.cc

Mingming1 Xu

We'll need to hook that up in a followon change similar to existing code in the GPU process.

I am not very clear what's the followon change. Where is the existing code, can you paste link here to help me understand? Thanks!

Line 165, Patchset 19 (Latest):

  LOG(ERROR) << "[WebNN] Device Removed Reason: "
             << logging::SystemErrorCodeToString(
                    adapter_->d3d12_device()->GetDeviceRemovedReason());

ningxin hu . resolved

Should this log only for device removal error?

Mingming1 Xu

Agree, maybe we should:
`  HRESULT device_removed_reason = adapter_->d3d12_device()->GetDeviceRemovedReason();
  if (FAILED(device_removed_reason)) {
    LOG(ERROR) << "[WebNN] Device Removed Reason: "
               << logging::SystemErrorCodeToString(device_removed_reason);
  }`

Rafael Cintron

+1

Mingming1 Xu

Done

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 19 (Latest): pending_receiver<WebNNContextClient>? context_client_receiver;

Alex Gough . unresolved

don't really need _receiver in the name here - but up to you

Mingming1 Xu

I just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.

File third_party/blink/renderer/modules/ml/ml_context.h

  // Closes the `context_remote_` and `context_client_receiver_` pipes
  // because the context has been lost.
  void OnDestroyed(const String& message) override;

  void OnDisconnected();

Alex Gough . unresolved

I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.

Mingming1 Xu

`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).

The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.

The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 27, 2024, 2:33:06 AM (3 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu, Rafael Cintron, Rafael Cintron and Reilly Grant

ningxin hu added 8 comments

Patchset-level comments

ningxin hu . unresolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

ningxin hu

Open it for comments.

Mingming1 Xu

While the adapter is created before creating a context, if fails, there is no context.

ningxin hu

You are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . unresolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Mingming1 Xu

Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.

ningxin hu

We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.

This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.

Line 156, Patchset 16: HandleRecordingError("Failed to close and execute the command list.", hr);

ningxin hu . unresolved

`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.

Mingming1 Xu

Do you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?

ningxin hu

Yes, I suppose the user code should get promise rejection from `readBuffer()` method.

Line 323, Patchset 16:void ContextImplDml::HandleContextLost(std::string_view error_message,

ningxin hu . unresolved

This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.

Mingming1 Xu

This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/

Rafael Cintron

I think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.
The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.
Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.

ningxin hu

Perhaps we can better name these to disambiguate; rename message to message_for_promise or something like that.

This helps. Or we can rename `error_message` to `message_for_log`.

Line 341, Patchset 16: OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));

ningxin hu . unresolved

Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?

Mingming1 Xu

The discussion is here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/ and https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/

ningxin hu

I only read @rafael....@microsoft.com's [comment](https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/)

3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a CHECK so that we can get crash dumps and diagnose.

We should LOG(ERROR) any failure we receive from DirectX along with the logging::SystemErrorCodeToString friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.

I didn't find the decision that we should report "everything else" errors to JS code by rejecting `lost` promise. Did I miss anything?

File services/webnn/dml/graph_impl_dml.cc

      std::move(callback).Run(
          base::unexpected(std::move(create_operator_result.error())));

ningxin hu . unresolved

Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.

Mingming1 Xu

`void HandleGraphCreationFailure(
    const std::string& error_message,
    WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
    const std::string& error_message,
    HRESULT hr,
    WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.

Rafael Cintron

@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.
We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.
OK to solve this as a separate change.

ningxin hu

OK to solve this as a separate change.

+1. Worth adding a TODO.

Line 165, Patchset 19 (Latest):

      HandleGraphCreationFailure("Failed to create identity operator.",
                                 std::move(callback));

ningxin hu . unresolved

This may need to pass `context`.

Mingming1 Xu

There is no `HRESULT` to be handled by `HandleContextLost`.

ningxin hu

Same as above, we may need to crash GPU process for operator creation failures. Please add a TODO.

File third_party/blink/renderer/modules/ml/ml_context.h

  // Closes the `context_remote_` and `context_client_receiver_` pipes
  // because the context has been lost.
  void OnDestroyed(const String& message) override;

  void OnDisconnected();

Alex Gough . unresolved

I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.

Mingming1 Xu

`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/

ningxin hu

OnDestroyed should somehow refer to the other side being gone

Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Mingming1 Xu

Rafael Cintron
Rafael Cintron
Reilly Grant

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 27, 2024, 4:54:53 AM (2 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu, Rafael Cintron, Rafael Cintron and Reilly Grant

ningxin hu added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 341, Patchset 16: OnDestroyed(base::StrCat({"WebNN context is lost due to ", message}));

ningxin hu . resolved

Do we want to reject `lost` promise for other "internal errors"? Or we just crash the GPU process by `CHECK` an error is not an "internal error"?

Mingming1 Xu

The discussion is here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/ and https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/

ningxin hu

I only read @rafael....@microsoft.com's [comment](https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/)
3- <Everything else>
For any other error, we should treat it as Chromium passing invalid parameters to the API and do a CHECK so that we can get crash dumps and diagnose.
We should LOG(ERROR) any failure we receive from DirectX along with the logging::SystemErrorCodeToString friendly string, even ones we CHECK for. This will become critical to root causing failures in the wild from web developers. Pooja has already converted these but we should convert others that might have been missed.
I didn't find the decision that we should report "everything else" errors to JS code by rejecting `lost` promise. Did I miss anything?

ningxin hu

Mingming helped me find another [comment](https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/7aa27202_25387285/) that I missed

Currently I handled E_OUTOFMEMORY, DXGI_ERROR_DEVICE_REMOVED, DXGI_ERROR_DEVICE_RESET three ones to call OnDestroyed, otherwise I directly return and won't call OnDestroyed.

By the time we get here, we know we have a failure HRESULT, correct? If so, we should always call OnDestroyed so that the web developer can realize that no more WebNN will happen with the context. The difference is whether the error warrants a CHECK or graceful termination of the GPU process.

I am fine with notifying user JS code with "internal error" via "lost" promise because GPU process is going to crash.

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 27, 2024, 5:08:31 AM (2 days ago) Jun 27

to Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Rafael Cintron, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 12 comments

Patchset-level comments

ningxin hu . unresolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

ningxin hu

Open it for comments.

Mingming1 Xu

While the adapter is created before creating a context, if fails, there is no context.

ningxin hu

You are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.

Mingming1 Xu

https://issues.chromium.org/issues/349640007 tracked.

File services/webnn/dml/context_impl_dml.h

Line 43, Patchset 16: void HandleContextLost(std::string_view error_message, HRESULT hr);

ningxin hu . unresolved

This method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.

It might be worth adding a comment explaining which errors are treated as "context lost".

Mingming1 Xu

Updated, ptal.

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . unresolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Mingming1 Xu

Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.

ningxin hu

We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.

Mingming1 Xu

Updated. Also cc @bryan.b...@intel.com to check the change here.

Line 156, Patchset 16: HandleRecordingError("Failed to close and execute the command list.", hr);

ningxin hu . resolved

`HandleRecordingError()` calls `HandleContextLost()` that may crash at `CHECK` and won't run the following callback.

Mingming1 Xu

Do you mean move the `HandleRecordingError()` below the `std::move(callback).Run...` to let the callback run at first?

ningxin hu

Yes, I suppose the user code should get promise rejection from `readBuffer()` method.

Mingming1 Xu

Done

Line 180, Patchset 16: HandleRecordingError("Failed to download the buffer.", hr);

ningxin hu . resolved

ditto.

Mingming1 Xu

Done

Line 323, Patchset 16:void ContextImplDml::HandleContextLost(std::string_view error_message,

ningxin hu . resolved

This is not an error message passed to user JS code but a log message. I feel it is confusing by passing it just for logging. Could it be moved out from this method? Other helpers, like `RETURN_IF_FAILED` in [error.h](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/error.h;l=14;drc=ae983e94583c5e2d400fa7641439d2e8ca13fd93;bpv=1;bpt=1), already log the error.

Mingming1 Xu

This change is to address @rafael....@microsoft.com 's comments: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/f38abdf6_7c0fbbcd/

Rafael Cintron

I think Ningxin is getting confused by the two, very similarly named strings: one called `error_message` and the other one just called `message`.
The first one is meant to go into logs and provide some context on where the error occurred. The second one is meant to be a simpler error we provide to web developers that avoids revealing too much information about the context of the error to attackers.
Perhaps we can better name these to disambiguate; rename `message` to `message_for_promise` or something like that.

ningxin hu

Perhaps we can better name these to disambiguate; rename message to message_for_promise or something like that.
This helps. Or we can rename `error_message` to `message_for_log`.

Mingming1 Xu

Done

Mingming1 Xu

Leave TODO tracked by https://issues.chromium.org/issues/349640008.

File services/webnn/dml/graph_impl_dml.cc

      std::move(callback).Run(
          base::unexpected(std::move(create_operator_result.error())));

ningxin hu . resolved

Do we handle context lost here? Given the following code handles graph creation failure (it internally handles context lost) if creating an identity node fails.

Mingming1 Xu

`void HandleGraphCreationFailure(
    const std::string& error_message,
    WebNNContextImpl::CreateGraphImplCallback callback)` won't handle context lost, only `void HandleGraphCreationFailure(
    const std::string& error_message,
    HRESULT hr,
    WebNNContextImpl::CreateGraphImplCallback callback, ContextImplDml* context)` will handle context lost, it requires the `HRESULT`.

Rafael Cintron

@ningx...@intel.com brings up a good point that this particular error path does not result in context lost.
We originally had this error code path be graceful because we didn't want to crash the GPU process for unimplemented operators in the DML backend. Now that we have an implementation for every op, might be good to revisit this sooner rather than later.
OK to solve this as a separate change.

ningxin hu

OK to solve this as a separate change.
+1. Worth adding a TODO.

Mingming1 Xu

Leave TODO tracked by https://issues.chromium.org/issues/349649099

      HandleGraphCreationFailure("Failed to create identity operator.",
                                 std::move(callback));

ningxin hu . resolved

This may need to pass `context`.

Mingming1 Xu

There is no `HRESULT` to be handled by `HandleContextLost`.

ningxin hu

Same as above, we may need to crash GPU process for operator creation failures. Please add a TODO.

Mingming1 Xu

Done

File services/webnn/webnn_graph_impl_unittest.cc

Line 108, Patchset 4: FakeWebNNContextImpl(mojo::PendingReceiver<mojom::WebNNContext> receiver,

Bryan Bernhart . resolved

Looks like we have no test cases which generate a context lost (and ensure delivery).
What's our plan there?

Mingming1 Xu

Yes, agree, I should add tests for that, will do in following patchset.

Mingming1 Xu

Added in patchset 10, PTAL.

Bryan Bernhart

What we added is good but I think we'll need to go further and add a CTS. To do that, there needs to be a way to force a context lost from script w/o relying on the OS/driver generating a "lost" HRESULT.
Introducing `MLContext.destroy()` which also loses the context could achieve that. But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed). `MLContextLostInfo` will probably need a field to the know the source of the message to avoid that. WDYT?

Rafael Cintron

Here are the [WebGPU device lost tests](https://gpuweb.github.io/cts/standalone/?q=webgpu:api,operation,device,lost:*). They're implemented with a `destroy` method on the device, which I think we should also have in WebNN.
The destroy method can come as a separate change.

Mingming1 Xu

Thanks, open https://issues.chromium.org/issues/348904836 to track.
But if we treat destroy as lost, the web developer could mistakenly re-create the context (when in fact, it should remain destroyed).
Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?

Bryan Bernhart

Re-creating a new context after lost is as expected, the new context can work after the GPU recovers, right?
Correct but only when the context wasn't destroyed/lost on purpose. I am fine with a TODO.

Mingming1 Xu

Thinking again, what's the purpose and usecase of introducing `MLContext.destroy()` into WebNN? Is it only for testing against forcing context lost from renderer process? @rafael....@microsoft.com @bryan.b...@intel.com @ningx...@intel.com

Rafael Cintron

Destroy serves two purposes. One is for testing. The other is for the web developer to free the memory and resources taken up by the context. In the case of hybrid GPUs, that means powering down the discrete GPU and giving hours of battery life back to the end user.

Mingming1 Xu

Thanks!

File third_party/blink/renderer/modules/ml/ml_context.h

Line 165, Patchset 19: // Closes the `context_remote_` and `context_client_receiver_` pipes


  // because the context has been lost.
  void OnDestroyed(const String& message) override;

  void OnDisconnected();

Alex Gough . unresolved

I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.

Mingming1 Xu

`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/

ningxin hu

OnDestroyed should somehow refer to the other side being gone
Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.

Mingming1 Xu

Rename as `OnLost` now.

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 224, Patchset 4:void MLContext::OnWebNNContextLostCaptured(const String& context_lost_info) {

ningxin hu . resolved

Should this info be reported to JS code?

Mingming1 Xu

Now I reported these info related to adapter error in `src\services\webnn\dml\utils.cc`:

  `std::string context_lost_info;
  if (hr == E_OUTOFMEMORY) {
    context_lost_info = "out of memory.";
  } else if (hr == DXGI_ERROR_DEVICE_REMOVED) {
    context_lost_info = "device removed.";
  } else if (hr == DXGI_ERROR_DEVICE_HUNG) {
    context_lost_info = "device hung.";
  } else if (hr == DXGI_ERROR_DEVICE_RESET) {
    context_lost_info = "device reset.";
  } else {
    return;
  }`

Mingming1 Xu

Discussed a lot in other comments, PTAL, thanks!

Mingming1 Xu

Done

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart

Rafael Cintron
Rafael Cintron
Reilly Grant

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 27, 2024, 1:34:41 PM (2 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Mingming1 Xu, Rafael Cintron, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 2 comments

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . unresolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Mingming1 Xu

Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.

ningxin hu

We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.

Mingming1 Xu

Updated. Also cc @bryan.b...@intel.com to check the change here.

Bryan Bernhart

@mingmi...@intel.com LGTM

Line 329, Patchset 12: HandleWebNNContextLost(hr, this);

Bryan Bernhart . resolved

Bryan Bernhart

@mingmi...@intel.com SGTM

I think Rafael was referring to: https://source.chromium.org/chromium/chromium/src/+/main:sandbox/policy/win/sandbox_win.cc;drc=1f14cc876cc5bf899d13284a12c451498219bb2d;l=1124

I am not very clear what's the followon change.

Perhaps one way is to assign WebNN a lower job limit, which gives us first chance to handle OOM.

File-level comment, Patchset 20 (Latest):

Related details

Attention is currently required from:

Alex Gough
Mingming1 Xu

satisfied_requirement

unsatisfied_requirement

open

diffy

Alex Gough (Gerrit)

unread,

Jun 27, 2024, 3:56:24 PM (2 days ago) Jun 27

to Mingming1 Xu, Alex Gough, Rafael Cintron, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Rafael Cintron, Reilly Grant and ningxin hu

Alex Gough voted and added 3 comments

Votes added by Alex Gough

Code-Review

+1

3 comments

Patchset-level comments

Alex Gough . resolved

lgtm mojom

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 19: pending_receiver<WebNNContextClient>? context_client_receiver;

Alex Gough . unresolved

don't really need _receiver in the name here - but up to you

Mingming1 Xu

I just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.

Alex Gough

totally up to you - feel free to resolve either way.

File third_party/blink/renderer/modules/ml/ml_context.h

Line 165, Patchset 19:

  // Closes the `context_remote_` and `context_client_receiver_` pipes
  // because the context has been lost.
  void OnDestroyed(const String& message) override;

  void OnDisconnected();

Alex Gough . resolved

I'm not sure of the naming of these - it's not super-obvious that it's about the remote side of the context being lost at the moment. OnDisconnected should perhaps refer to the context client being disconnected (not the /this/) and OnDestroyed should somehow refer to the other side being gone.

Mingming1 Xu

`WebNNContextClient` remote in GPU process calls the `OnDestroyed` method explicitly when some errors are captured in GPU process, it means the derived `MLContext` in the renderer process will be destroyed (lost).
The `OnDisconnected` method is called when the GPU process is crashed which will cause `WebNNContextClient` receiver disconnected. `OnDisconnected` calls `OnDestroyed` inside to destroy the `MLContext` since the mojom pipe is disconnected.
The name `OnDestroyed` is also to match WebGPU and better express that the context is unusable, from comments: this.https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/60724d4a_a12552ca/. In the future, we may plan to add another corresponding interface `Destroy()` to destroy the context from the renderer process, see discussion here: https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/d1632baa_7bdad5fb/

ningxin hu

OnDestroyed should somehow refer to the other side being gone
Would `OnContextLost()` or `OnLost()` be more straightforward? Because service-side calls `HandleContextLost()` to send this message and blink-side would resolve the `lost` promise upon receiving this message. "OnDestroyed" sounds like something else.

Mingming1 Xu

Rename as `OnLost` now.

Alex Gough

thanks.

File-level comment, Patchset 20 (Latest):

Related details

Attention is currently required from:

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 27, 2024, 4:45:49 PM (2 days ago) Jun 27

to Mingming1 Xu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Rafael Cintron voted and added 2 comments

Votes added by Rafael Cintron

Code-Review

+1

2 comments

Patchset-level comments

Rafael Cintron . resolved

Still LGTM assuming issues others have raised have been fixed.

File services/webnn/dml/context_impl_dml.cc

Rafael Cintron

I agree with you we'll need to fix this at some point. But once we do, E_OUTOFMEMORY will be recoverable and this code shouldn't generate a context lost (due to the reset).

Perhaps one way is to assign WebNN a lower job limit, which gives us first chance to handle OOM.

@bryan.b...@intel.com. The job limit is for the entire GPU process. There is no way to have a different job limit for WebNN vs. everything else. Attempting to allocate memory with D3D will trigger the job limit (if you're reached it) and cause the browser process to terminate the GPU process. To make large buffer allocations gracefully fail, we'll need to guesstimate whether the process will OOM before making the allocation.

I am not very clear what's the followon change. Where is the existing code, can you paste link here to help me understand? Thanks!

@mingmi...@intel.com the code I had in mind is `GpuServiceImpl::MaybeExitOnContextLost`. If WebNN is the only thing using D3D12 in the GPU process, it can attempt to recover on its own. When several systems use it at the same time, there needs to either be coordination on restarting everything together (difficult) or coordination to self-terminate (easier).

Related details

Attention is currently required from:

Bryan Bernhart
Mingming1 Xu

Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 27, 2024, 5:56:35 PM (2 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Line 225, Patchset 4: // TODO(crbug.com/41492165): generate error using context.

Bryan Bernhart . resolved

Can we handle context lost here as well?

Reilly Grant

Are we assuming that any failure to create a buffer or graph means the context has been lost? That seems strange. I expect that context loss is a signal we specifically get from the adapter.

Bryan Bernhart

Are we assuming that any failure to create a buffer or graph means the context has been lost?
We could, like WebGPU. If we do not then the OOM skews to some, possibly a unrelated call, which does handle context lost which is hard to debug.

Reilly Grant

I don't think we should be reporting a context loss for errors like this. That's a separate concept.

Bryan Bernhart

If this returns E_OUTOFMEMORY, the context will become inoperable and effectively lost. Also, the HR could be any other error result (but unlikely) so I can see it being quite useful to coverage to one error handler.

Mingming1 Xu

This discussion should be merged into above one https://chromium-review.googlesource.com/c/chromium/src/+/5602134/comment/ad47c8af_d3f27f05/: whether we should handle the context lost according to the system error code `HRESULT`.

Mingming1 Xu

I am thinking maybe all the methods that return `HRESULT` inside or outside should be handled by `HandleWebNNContextLost`, if the `HRESULT` is context lost error, we should call `OnDestoryed`. WDYT? @rafael....@microsoft.com

Mingming1 Xu

Updated, PTAL.

Bryan Bernhart

@mingmi...@intel.com
It depends if the HRESULT error will be recoverable or not. In this case, E_OUTOFMEMORY is not recoverable and leaves the context inoperable so a "lost" seems appropriate. In other cases, this is not so [1].

Bryan Bernhart

LGTM.

Related details

Attention is currently required from:

satisfied_requirement

unsatisfied_requirement

open

diffy

Bryan Bernhart (Gerrit)

unread,

Jun 27, 2024, 7:25:25 PM (2 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, ningxin hu, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Mingming1 Xu, Rafael Cintron, Reilly Grant and ningxin hu

Bryan Bernhart added 1 comment

File services/webnn/dml/context_impl_dml.cc

Bryan Bernhart

There are two OOM recovery situations: the case of creating a very large buffer that would otherwise OOM vs needing to free-up existing memory from OOM (ex. command recorder reset).

In latter case, I don't think guestimates help. But if we can recover from large buffer allocations then it'll likely we can recover from other OOM failures too.

satisfied_requirement

unsatisfied_requirement

open

diffy

ningxin hu (Gerrit)

unread,

Jun 27, 2024, 9:49:41 PM (2 days ago) Jun 27

to Mingming1 Xu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Mingming1 Xu, Rafael Cintron and Reilly Grant

ningxin hu voted and added 4 comments

Votes added by ningxin hu

Code-Review

+1

4 comments

Patchset-level comments

File-level comment, Patchset 20 (Latest):

ningxin hu . resolved

Adapter's `HandleAdapterFailure()` of https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/adapter.cc;bpv=0;bpt=1 may also need to handle context lost.

ningxin hu

Open it for comments.

Mingming1 Xu

While the adapter is created before creating a context, if fails, there is no context.

ningxin hu

You are correct, there is no context to lose. The 'lost' error could be reported by rejecting promise returned by `createContext()`. However, my open is should we be consistent and crash GPU process for non 'lost' errors. OK to add a TODO and follow up in a separate CL.

Mingming1 Xu

https://issues.chromium.org/issues/349640007 tracked.

ningxin hu

Acknowledged

ningxin hu . resolved

lgtm % open comments

File services/webnn/dml/context_impl_dml.h

Line 43, Patchset 16: void HandleContextLost(std::string_view error_message, HRESULT hr);

ningxin hu . resolved

This method may crash GPU process. It'd be better to make it explicit: i.e. something like `HandleContextLostOrCrash()`.
It might be worth adding a comment explaining which errors are treated as "context lost".

Mingming1 Xu

Updated, ptal.

ningxin hu

Acknowledged

File services/webnn/dml/context_impl_dml.cc

Line 146, Patchset 16: HandleContextLost("Failed to start recording.", hr);

ningxin hu . resolved

`StartRecordingIfNecessary()` may already fail and call `HandleRecordingError()` which calls `HandleContextLost()` before. It may crash at `CHECK` before running the callback and this `HandleContextLost()`.

Mingming1 Xu

Do you mean we should not call `HandleRecordingError()` which calls`HandleContextLost()` in the `StartRecordingIfNecessary()`? We can call the `HandleRecordingError()` outside if `StartRecordingIfNecessary()` fails.

ningxin hu

We can call the HandleRecordingError() outside if StartRecordingIfNecessary() fails.
This might work. My point is we should ensure the behavior is consistent when error happens. The user code should get promise rejection from `readBuffer()` method.

Mingming1 Xu

Updated. Also cc @bryan.b...@intel.com to check the change here.

Bryan Bernhart

@mingmi...@intel.com LGTM

ningxin hu

lgtm

Related details

Attention is currently required from:

Bryan Bernhart

Mingming1 Xu
Rafael Cintron
Reilly Grant

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 28, 2024, 2:16:55 AM (yesterday) Jun 28

to ningxin hu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, Reilly Grant, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Bryan Bernhart, Rafael Cintron and Reilly Grant

Mingming1 Xu added 1 comment

File services/webnn/public/mojom/webnn_context_provider.mojom

Line 68, Patchset 19: pending_receiver<WebNNContextClient>? context_client_receiver;

Alex Gough . resolved

don't really need _receiver in the name here - but up to you

Mingming1 Xu

I just want to align with the `context_remote`. Maybe the suffix will make codes easier to read. Agree that it's not necessary.

Alex Gough

totally up to you - feel free to resolve either way.

Mingming1 Xu

Thanks!

File-level comment, Patchset 27 (Latest):

Related details

Attention is currently required from:

Bryan Bernhart
Rafael Cintron
Reilly Grant

Submit Requirements:

Code-Coverage
Code-Review

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

open

diffy

Reilly Grant (Gerrit)

unread,

Jun 28, 2024, 5:30:29 PM (23 hours ago) Jun 28

to Mingming1 Xu, Reilly Grant, ningxin hu, Rafael Cintron, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu, Rafael Cintron, Rafael Cintron and ningxin hu

Reilly Grant voted and added 4 comments

Votes added by Reilly Grant

Code-Review

+1

4 comments

Patchset-level comments

Reilly Grant . resolved

LGTM

File services/webnn/webnn_graph_impl_unittest.cc

Line 184, Patchset 27 (Latest):

      context_impl_->OnLost("Context is lost");
      context_impl_ = nullptr;

Reilly Grant . unresolved

This should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.

```suggestion
      context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```

Line 420, Patchset 27 (Latest):

  base::RunLoop().RunUntilIdle();
  EXPECT_TRUE(context_client_impl.IsLost());

Reilly Grant . unresolved

```suggestion
  base::RunUntil([&]() { return context_client_impl.IsLost(); });
```

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 255, Patchset 27 (Latest): if (lost_property_->GetState() == LostProperty::kPending) {

Reilly Grant . unresolved

We can assert that this is true because `OnLost()` can only be called once.

File-level comment, Patchset 27 (Latest):

Related details

Attention is currently required from:

Alex Gough

Bryan Bernhart
Mingming1 Xu
Rafael Cintron

Rafael Cintron
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review

No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Rafael Cintron (Gerrit)

unread,

Jun 28, 2024, 8:28:40 PM (20 hours ago) Jun 28

to Mingming1 Xu, Reilly Grant, ningxin hu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Mingming1 Xu, Rafael Cintron and ningxin hu

Rafael Cintron voted and added 1 comment

Votes added by Rafael Cintron

Code-Review

+1

1 comment

Patchset-level comments

Rafael Cintron . resolved

LGTM sans Reilly issues

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Mingming1 Xu
Rafael Cintron

ningxin hu

Submit Requirements:

Code-Coverage
Code-Review

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 28, 2024, 9:12:25 PM (20 hours ago) Jun 28

to Rafael Cintron, Reilly Grant, ningxin hu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 3 comments

File services/webnn/webnn_graph_impl_unittest.cc

Line 184, Patchset 27:

      context_impl_->OnLost("Context is lost");
      context_impl_ = nullptr;

Reilly Grant . resolved

This should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.
```suggestion
      context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```

Mingming1 Xu

Done

Line 420, Patchset 27:

  base::RunLoop().RunUntilIdle();
  EXPECT_TRUE(context_client_impl.IsLost());

Reilly Grant . unresolved

```suggestion
  base::RunUntil([&]() { return context_client_impl.IsLost(); });
```

Mingming1 Xu

I guess you mean `base::test::RunUntil` from "base/test/run_until.h". PTAL, thanks!

File third_party/blink/renderer/modules/ml/ml_context.cc

Line 255, Patchset 27: if (lost_property_->GetState() == LostProperty::kPending) {

Reilly Grant . resolved

We can assert that this is true because `OnLost()` can only be called once.

Mingming1 Xu

Done

Related details

Attention is currently required from:

Alex Gough
Bryan Bernhart
Rafael Cintron
Reilly Grant
ningxin hu

Submit Requirements:

Code-Coverage
Code-Review
No-Unresolved-Comments

Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings.

Gerrit

satisfied_requirement

unsatisfied_requirement

open

diffy

Mingming1 Xu (Gerrit)

unread,

Jun 28, 2024, 11:39:34 PM (17 hours ago) Jun 28

to Rafael Cintron, Reilly Grant, ningxin hu, Alex Gough, Rafael Cintron, Chromium LUCI CQ, Dwayne Robinson, Austin Sullivan, AyeAye, Bryan Bernhart, Tricium, chromium...@chromium.org, Kentaro Hara, Jiewei Qian, mac-r...@chromium.org, blink-revie...@chromium.org, blink-...@chromium.org, blundell+...@chromium.org, ipc-securi...@chromium.org, jmedle...@chromium.org

Attention needed from Alex Gough, Bryan Bernhart, Rafael Cintron, Reilly Grant and ningxin hu

Mingming1 Xu added 1 comment

File services/webnn/webnn_graph_impl_unittest.cc

Line 184, Patchset 27:

      context_impl_->OnLost("Context is lost");
      context_impl_ = nullptr;

Reilly Grant . unresolved

This should let you remove the `DanglingUntriaged` annotation below. A call to `ExtractAsDangling()` is necessary for self-destruct functions like `OnLost()`.
```suggestion
      context_impl_.ExtractAsDangling()->OnLost("Context is lost");
```

Mingming1 Xu

Done

Mingming1 Xu

If I remove `DanglingUntriaged`, following tests `WebNNGraphImplTest.ValidateDispatchTest` and `WebNNGraphImplTest.ValidateInputsTest` will fail due to `Detected dangling raw_ptr` https://ci.chromium.org/ui/p/chromium/builders/try/win-rel/664189/overview, so I have to revert here to patchset 27. PTAL.