| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Currently, dispatching and reading tensor are scheduled to aNit: remove this "a"
1. If handling context lost is scheduled after reading output
tensor, renderer process can still complete reading the tensor
with an invaild result before the
`MLTensor::OnConnectionError`[3] is invoked.Can we mark a context "is being lost"? A tensor reading can check that and handle it correctly?
2. If handling context lost is scheduled before reading output
tensor, renderer process will throw an error: "Tensor has been
destroyed or context is lost." in `MLTensor::OnConnectionError`.This is the expected behavior, correct?
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Are the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
Even before this change, the broadcast was not guaranteed
due to a race condition between killing the GPU process[4]
on the main thread and invoking OnLost on each context's
owning thread.Does gpu scheduler destruction ensure all tasks are scheduled? @bryan.b...@intel.com
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Are the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
It depends on whether contexts share the same device; only contexts on different devices (GPU vs NPU) see independent reasons.
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
We could store the reason in the provider and have each context pull it locally, possibly using `RunOrPostTaskAndWaitOnSequence` for sequence safety. Otherwise, keeping the reason only for the context that detects the loss is probably fine for now.
Even before this change, the broadcast was not guaranteed
due to a race condition between killing the GPU process[4]
on the main thread and invoking OnLost on each context's
owning thread.Does gpu scheduler destruction ensure all tasks are scheduled? @bryan.b...@intel.com
No. Destroying the GPU scheduler / killing the GPU process does not ensure any pending tasks have been scheduled or executed.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Currently, dispatching and reading tensor are scheduled to aXu, Mingming1Nit: remove this "a"
Done
1. If handling context lost is scheduled after reading output
tensor, renderer process can still complete reading the tensor
with an invaild result before the
`MLTensor::OnConnectionError`[3] is invoked.Can we mark a context "is being lost"? A tensor reading can check that and handle it correctly?
Currently, the GPU process is possible to be killed by ourselves on main thread before `OnLost` task is executed on WebNN's owning thread. The callback of reading tensor will still have no chance to be executed. But that's fine, because the renderer process will also throw an error: "Tensor has been
destroyed or context is lost." in `MLTensor::OnConnectionError`
2. If handling context lost is scheduled before reading output
tensor, renderer process will throw an error: "Tensor has been
destroyed or context is lost." in `MLTensor::OnConnectionError`.This is the expected behavior, correct?
Correct!
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Bernhart, BryanAre the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
It depends on whether contexts share the same device; only contexts on different devices (GPU vs NPU) see independent reasons.
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
We could store the reason in the provider and have each context pull it locally, possibly using `RunOrPostTaskAndWaitOnSequence` for sequence safety. Otherwise, keeping the reason only for the context that detects the loss is probably fine for now.
For some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
Agree, now all contexts even on different web pages share one provider, so I think sharing the lost reason across contexts is not safe.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Add @rafael....@microsoft.com as this CL introduces a behavior change across DML and ORT backends.
DestroyAllContextsAndKillGpuProcess`[2], a race condition canNit: can this part be appended into previous line?
1. If handling context lost is scheduled after reading output
tensor, renderer process can still complete reading the tensor
with an invaild result before the
`MLTensor::OnConnectionError`[3] is invoked.Xu, Mingming1Can we mark a context "is being lost"? A tensor reading can check that and handle it correctly?
Currently, the GPU process is possible to be killed by ourselves on main thread before `OnLost` task is executed on WebNN's owning thread. The callback of reading tensor will still have no chance to be executed. But that's fine, because the renderer process will also throw an error: "Tensor has been
destroyed or context is lost." in `MLTensor::OnConnectionError`
Acknowledged
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Bernhart, BryanAre the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
Xu, Mingming1It depends on whether contexts share the same device; only contexts on different devices (GPU vs NPU) see independent reasons.
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
We could store the reason in the provider and have each context pull it locally, possibly using `RunOrPostTaskAndWaitOnSequence` for sequence safety. Otherwise, keeping the reason only for the context that detects the loss is probably fine for now.
For some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
Agree, now all contexts even on different web pages share one provider, so I think sharing the lost reason across contexts is not safe.
For some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Agreed broadcasting errors across origins are inappropriate. But after this change, we are not able to notify all contexts in one origin the lost reason, correct?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
This sounds like the intended behavior, could we add a TODO when removing the broadcasting code?
Even before this change, the broadcast was not guaranteed
due to a race condition between killing the GPU process[4]
on the main thread and invoking OnLost on each context's
owning thread.Bernhart, BryanDoes gpu scheduler destruction ensure all tasks are scheduled? @bryan.b...@intel.com
No. Destroying the GPU scheduler / killing the GPU process does not ensure any pending tasks have been scheduled or executed.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
DestroyAllContextsAndKillGpuProcess`[2], a race condition canNit: can this part be appended into previous line?
Done
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Bernhart, BryanAre the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
Xu, Mingming1It depends on whether contexts share the same device; only contexts on different devices (GPU vs NPU) see independent reasons.
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
We could store the reason in the provider and have each context pull it locally, possibly using `RunOrPostTaskAndWaitOnSequence` for sequence safety. Otherwise, keeping the reason only for the context that detects the loss is probably fine for now.
Hu, NingxinFor some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
Agree, now all contexts even on different web pages share one provider, so I think sharing the lost reason across contexts is not safe.
For some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Agreed broadcasting errors across origins are inappropriate. But after this change, we are not able to notify all contexts in one origin the lost reason, correct?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
This sounds like the intended behavior, could we add a TODO when removing the broadcasting code?
Agreed broadcasting errors across origins are inappropriate. But after this change, we are not able to notify all contexts in one origin the lost reason, correct?
Correct. While even before this change, the broadcast was not guaranteed due to a race condition between [killing the GPU process](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/webnn_context_provider_impl.cc;l=177) on the main thread and invoking `OnLost` on each context's owning thread.
TODO is added.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
broadcasted across all contexts. The broadcast is unnecessary
and may cause a security issue where contexts created on
different web pages could share the context lost reason.Bernhart, BryanAre the context lost reasons non-recoverable system-wide error, e.g. device-removal? Why does broadcasting them cause security issue?
Xu, Mingming1It depends on whether contexts share the same device; only contexts on different devices (GPU vs NPU) see independent reasons.
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
We could store the reason in the provider and have each context pull it locally, possibly using `RunOrPostTaskAndWaitOnSequence` for sequence safety. Otherwise, keeping the reason only for the context that detects the loss is probably fine for now.
Hu, NingxinFor some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
Agree, now all contexts even on different web pages share one provider, so I think sharing the lost reason across contexts is not safe.
Xu, Mingming1For some non-system-wide errors and independent errors from different devices, the lost reasons should not be broadcasted, right?
Agreed broadcasting errors across origins are inappropriate. But after this change, we are not able to notify all contexts in one origin the lost reason, correct?
Currently, all contexts share a provider, but this will change to per-renderer providers [1], making broadcasting a lost reason safe.
This sounds like the intended behavior, could we add a TODO when removing the broadcasting code?
Agreed broadcasting errors across origins are inappropriate. But after this change, we are not able to notify all contexts in one origin the lost reason, correct?
Correct. While even before this change, the broadcast was not guaranteed due to a race condition between [killing the GPU process](https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/webnn_context_provider_impl.cc;l=177) on the main thread and invoking `OnLost` on each context's owning thread.
TODO is added.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
OnLost(error_message);
DestroyAllContextsAndKillGpuProcess();I see that both places which call DestroyAllContextsAndKillGpuProcess also now call OnLost. Would it ever make sense to call one without the other? If not, consider combining these into one, perhaps better named, helper function.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
OnLost(error_message);
DestroyAllContextsAndKillGpuProcess();I see that both places which call DestroyAllContextsAndKillGpuProcess also now call OnLost. Would it ever make sense to call one without the other? If not, consider combining these into one, perhaps better named, helper function.
I am thinking it's possible to only call `OnLost` for some errors which can be recovered by only re-creating contexts instead of killing GPU process.
See https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/context_impl_dml.cc;l=999, here we only call `OnLost` for non-device-removal errors without killing GPU process.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
OnLost(error_message);
DestroyAllContextsAndKillGpuProcess();Xu, Mingming1I see that both places which call DestroyAllContextsAndKillGpuProcess also now call OnLost. Would it ever make sense to call one without the other? If not, consider combining these into one, perhaps better named, helper function.
I am thinking it's possible to only call `OnLost` for some errors which can be recovered by only re-creating contexts instead of killing GPU process.
See https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/dml/context_impl_dml.cc;l=999, here we only call `OnLost` for non-device-removal errors without killing GPU process.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
WebNN: Schedule `OnLost` task immediately to avoid race conditions
Currently, dispatching and reading tensor are scheduled to each
context's owning thread in sequence. `WebNNContextImpl::OnLost`[1] task
to handle the context lost during dispatching is also scheduled to the
context's owning thread, but since the scheduling is called from the
GPU main thread by
`WebNNContextProviderImpl::DestroyAllContextsAndKillGpuProcess`[2], a
race condition can happen in the following scenario:
dispatch -> handle context lost ->read output tensor
1. If handling context lost is scheduled after reading output tensor,
renderer process can still complete reading the tensor with an invalid
result before the `MLTensor::OnConnectionError`[3] is invoked.
2. If handling context lost is scheduled before reading output tensor,
renderer process will throw an error: "Tensor has been destroyed or
context is lost." in `MLTensor::OnConnectionError`. This is the
expected behavior.
WebNN conformance tests can give flaky results. To address this issue,
this CL decouples `OnLost` from `DestroyAllContextsAndKillGpuProcess`
to schedule `OnLost` task immediately, ensuring the `OnLost` is invoked
before reading an output tensor.
After this change, the context lost reason will not be broadcasted
across all contexts. The broadcast is unnecessary and may cause a
security issue where contexts created on different web pages could
share the context lost reason. Even before this change, the broadcast
was not guaranteed due to a race condition between killing the GPU
process[4] on the main thread and invoking OnLost on each context's
owning thread.
[1] https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/webnn_context_impl.cc;l=366
[2] https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/webnn_context_provider_impl.cc;l=169
[3] https://source.chromium.org/chromium/chromium/src/+/main:third_party/blink/renderer/modules/ml/webnn/ml_tensor.cc;l=327
[4] https://source.chromium.org/chromium/chromium/src/+/main:services/webnn/webnn_context_provider_impl.cc;l=177
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |