I didn't review the detailed CL yet. Trying to see whether we could trigger some corner cases...
media: Report hardware-context-reset as an error if not recoveredThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
within a very short time window, so we should consider this as anHere and in other places in this CL, "within a very short time window" is ambiguous. Does that mean media time or wall time?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
within a very short time window, so we should consider this as anHere and in other places in this CL, "within a very short time window" is ambiguous. Does that mean media time or wall time?
Good point. renamed to `a ver short media time winodw` in the commit msg and code.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
UMA_HISTOGRAM_BOOLEAN(nit: if this histogram isn't reported very often, it's generally preferable to use a function of a macro
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
nit: if this histogram isn't reported very often, it's generally preferable to use a function of a macro
Fixed (to use `base::UmaHistogramBoolean` instead). Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
media: Report hardware-context-reset as an error if not recoveredThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
Since `media_time_on_last_hardware_context_reset` is set only when hardware context reset happens, failing logic with reporting `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET` error for the subsequent error will not be a problem but there is a chance we report the recovery result as NotRecoverd because the media time didn't update due to the pause.
Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc
Yes, it can happen at any time including during start, suspend, resume etc.
Asking because in a paused state the media time won't change
This is a good point since we could end up report an incorrect recovery result since the media time didn't tick with no error.
One reason why I report the recovery result UMA at the end of WMPI lifecycle was I observed WMPI is stalled after the first HCR error and wanted to consider it as an unrecovered HCR case. But, you're right the HCR case with paused video will fall into this category even if it's recovered internally.
Do I need to track the video pause state? Or should we report immediately the obvious unrecovery HCR case in the OnError() but no record for recovered cases? Any thoughts? @xhw...@chromium.org @daoy...@microsoft.com
media: Report hardware-context-reset as an error if not recoveredThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
I do believe so, due to timing or such as calling UpdateVideo in the dual gpu case, even if MF is paused, it would still return invalid hwdrm state
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
media: Report hardware-context-reset as an error if not recoveredDaoyuan LiThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
I do believe so, due to timing or such as calling UpdateVideo in the dual gpu case, even if MF is paused, it would still return invalid hwdrm state
hmmm if you mean if we should consider the case of HCR in paused state, then yes, I do believe it should be considered.
media: Report hardware-context-reset as an error if not recoveredDaoyuan LiThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
Daoyuan LiI do believe so, due to timing or such as calling UpdateVideo in the dual gpu case, even if MF is paused, it would still return invalid hwdrm state
hmmm if you mean if we should consider the case of HCR in paused state, then yes, I do believe it should be considered.
Regardless of video pause state, failing logic with reporting `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET` is independent since it only gets triggered by the second error. And we need the last media time at the first HCR error to determine whether the second error happened within a short media time window.
Examples (exhausively),
- a) video is paused -> 1st HCR -> video recovered (assuming still in pause state). no subsequent error -> no failure will happen.
- a-1) If user unpaused the video, then `RecoveryResult` UMA will be true in the end.
- a-2) However, if user refreshes the tab or video element is removed somehow, `RecoveryResult` UMA will be false since the media time didn't update.
- b) video is paused -> 1st HCR -> video recovered (assuming still in pause state) -> 2nd HCR (moving the video window or hibernation etc) -> fail with HCR pipeline status error since the media time didn't update after the 1st recovery.
- I think this is okay to consider it as a recovery failure case.
- c) video is paused -> 1st HCR -> 2nd error -> fail with HCR pipeline status error.
- d) video is paused -> 1st HCR -> unpaused by user or automatic (media time ticks) -> 2nd HCR error -> no failure will happen if media time is elaspsed > 100ms but consider the 2nd HCR as a new HCR error so we update the last media time.
- e) video is NOT paused -> 1st HCR -> video recovered but paused by user (how fast user can pause? can they pause within 100ms after seeing the recovered video?) -> 2nd HCR -> no failure will happen but consider the 2nd HCR as a new HCR error so we update the last media time.
- f) video is NOT paused -> 1st HCR -> video recovered but paused by user -> unpaused -> no failure will happen.
One case that would report an incorrect `RecoveryResult` would be a-2) above.
If you agree with that the video pause state doesn't affect the HCR recovery failure detection, my question is how accurate should the `RecoveryResult` UMA be?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
media: Report hardware-context-reset as an error if not recoveredDaoyuan LiThe change looks good overall. I am trying to see whether there are corner cases that we could miss. Do you know whether HCR can ever happen when the player is paused? e.g. user paused the video, then drag the window to a different GPU, or close the lid of the laptop etc... Asking because in a paused state the media time won't change...
Also, I'll add @daoy...@microsoft.com here, to see whether there are other (better) signals we could use to detect this.
Daoyuan LiI do believe so, due to timing or such as calling UpdateVideo in the dual gpu case, even if MF is paused, it would still return invalid hwdrm state
Sangbaek Parkhmmm if you mean if we should consider the case of HCR in paused state, then yes, I do believe it should be considered.
Regardless of video pause state, failing logic with reporting `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET` is independent since it only gets triggered by the second error. And we need the last media time at the first HCR error to determine whether the second error happened within a short media time window.
Examples (exhausively),
- a) video is paused -> 1st HCR -> video recovered (assuming still in pause state). no subsequent error -> no failure will happen.
- a-1) If user unpaused the video, then `RecoveryResult` UMA will be true in the end.
- a-2) However, if user refreshes the tab or video element is removed somehow, `RecoveryResult` UMA will be false since the media time didn't update.
- b) video is paused -> 1st HCR -> video recovered (assuming still in pause state) -> 2nd HCR (moving the video window or hibernation etc) -> fail with HCR pipeline status error since the media time didn't update after the 1st recovery.
- I think this is okay to consider it as a recovery failure case.
- c) video is paused -> 1st HCR -> 2nd error -> fail with HCR pipeline status error.
- d) video is paused -> 1st HCR -> unpaused by user or automatic (media time ticks) -> 2nd HCR error -> no failure will happen if media time is elaspsed > 100ms but consider the 2nd HCR as a new HCR error so we update the last media time.
- e) video is NOT paused -> 1st HCR -> video recovered but paused by user (how fast user can pause? can they pause within 100ms after seeing the recovered video?) -> 2nd HCR -> no failure will happen but consider the 2nd HCR as a new HCR error so we update the last media time.
- f) video is NOT paused -> 1st HCR -> video recovered but paused by user -> unpaused -> no failure will happen.One case that would report an incorrect `RecoveryResult` would be a-2) above.
If you agree with that the video pause state doesn't affect the HCR recovery failure detection, my question is how accurate should the `RecoveryResult` UMA be?
agreed that a-2 would be the issue. although I am also not sure how accurate this needs to be as it is a false positive, I would be curious to see how often this false positive may hit (I guess if we can determine the state of the pipeline eg pause/play, then we can know if a-2 is worth differentiating)
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I don't think it would be worth differentiating a-2 case since the handling logic itself works as expected regardless of pause state. Any opinion, @xhw...@chromium.org?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Thanks for listing all the cases, which makes the discussion a lot easier!!!
I actually didn't pay attention to the UMA reporting. My original comment was more about what users will experience in production. Case (b) seems to be problematic since throughout the process there's no no-recoverable failure, but we triggered a MediaError. I don't really know how common the case is, that whether we should consider the playing state in the logic, or just ignore it and always trigger an error. Thoughts?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
For the case b), I've tested with simply tester, shaka-player and Canal+:
I can proto-type by adding `paused_on_hardware_context_reset_` to track the last pause state at the time of HCR. For the 2nd error, if `paused_on_hardware_context_reset_` and the current `paused_` are true, then treat it as 1st error by updating the last media time. But, I still doubt if this complexity is worth it? @xhw...@chromium.org WDYT?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Commit-Queue | +1 |
Added handling logic for video pause cases. Since testing manually is tricky, I've added new unit tests to cover all scenarios:
```
PipelineErrorHardwareContextReset_PlayingVideo_NotRecovered
PipelineErrorHardwareContextReset_PlayingVideo_Recovered
PipelineErrorHardwareContextReset_PausedVideo_NotRecovered
PipelineErrorHardwareContextReset_PausedVideo_Recovered
```
@xhw...@chromium.org, PTAL agian, thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
I like the new approach, which addressed my concern. I have some questions, mostly asking for more clarification. Thanks!
media time has advanced since the reset.It's a bit hard to get a clear understanding of all the cases. Can we combine this with line 9-13 and describe the algorithm as a whole? I guess part of the reason is that 9-13 discusses when it's unrecoverable, but here we discuss when it's recoverable...
return !(time_difference <= time_difference_tolerance);Instead of !<=, just use >?
"Media.PipelineStatus.HardwareContextReset.RecoveryResult", is_recovered);nit: If you follow naming convention, this can just be "IsRecovered".
// of restart. If the video was paused on the first hardware context reset and
// the video is still paused, we cannot conclude that this is anIf you translate this into code, it's
`paused_on_hardware_context_reset_ && paused_`
Bit the check in the code is
`!paused_on_hardware_context_reset_ && !paused_`
whose invert is
`paused_on_hardware_context_reset_ || paused_`
So there's a mismatch here. Could you please help check what the exact behavior should be?
!IsHardwareContextResetRecovered(This function name is a bit misleading. Whether it's recovered depends on not only the media time, but other factors as well (e.g. pause state). If the other comment below makes sense, maybe we don't need this to be a function. Or maybe we can find a better name, e.g. HasMediaTimeSufficentlyElapsed()?
pipeline_controller_->GetMediaTime());Since this is reported in WMPI's dtor; I am not sure what it means to still compare media time...
Does it make sense to just say if we HCR happened (media_time_on_last_hardware_context_reset_.has_value()) but we never send a PIPELINE_ERROR_HARDWARE_CONTEXT_RESET MediaError, then we call it good (recovered)?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
media: Report hardware-context-reset as an error if not recoveredDone
media time has advanced since the reset.It's a bit hard to get a clear understanding of all the cases. Can we combine this with line 9-13 and describe the algorithm as a whole? I guess part of the reason is that 9-13 discusses when it's unrecoverable, but here we discuss when it's recoverable...
Done
return !(time_difference <= time_difference_tolerance);Instead of !<=, just use >?
Done
"Media.PipelineStatus.HardwareContextReset.RecoveryResult", is_recovered);nit: If you follow naming convention, this can just be "IsRecovered".
Done
// of restart. If the video was paused on the first hardware context reset and
// the video is still paused, we cannot conclude that this is anIf you translate this into code, it's
`paused_on_hardware_context_reset_ && paused_`
Bit the check in the code is
`!paused_on_hardware_context_reset_ && !paused_`
whose invert is
`paused_on_hardware_context_reset_ || paused_`
So there's a mismatch here. Could you please help check what the exact behavior should be?
Great finding! Done. Updated the comment and logic.
!IsHardwareContextResetRecovered(This function name is a bit misleading. Whether it's recovered depends on not only the media time, but other factors as well (e.g. pause state). If the other comment below makes sense, maybe we don't need this to be a function. Or maybe we can find a better name, e.g. HasMediaTimeSufficentlyElapsed()?
Done
pipeline_controller_->GetMediaTime());Since this is reported in WMPI's dtor; I am not sure what it means to still compare media time...
Does it make sense to just say if we HCR happened (media_time_on_last_hardware_context_reset_.has_value()) but we never send a PIPELINE_ERROR_HARDWARE_CONTEXT_RESET MediaError, then we call it good (recovered)?
That goes back to the original discussion point where how accurately we want to report this result. PipelineStatus signals come throguh OnError() only. By default, it's PIPELINE_OK. And the problem is once the 1st HCR happened but no subsequent error occurred yet but then we can't tell it is recovered (e.g., the video is stalled and waiting for user action or no streaming data or no new MediaKeySession etc).
Did you mean we want to report the result right away at the time of OnError() call with the 2nd error there and update a flag like `is_hardware_context_reset_recovery_reported=true`? And then dtor() checks this flag to report the recovery result as false if media_time_on_last_hardware_context_reset_.has_value() is true?
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
We had an offline discussion. Other than that there are some nits.
// https://crbug.com/1208618Add the new bug number here as well, since the old bug doesn't catch the new behavior.
const base::TimeDelta& media_time_on_last_hardware_context_reset,This can just be `previous_media_time` now.
const base::TimeDelta& current_media_time) { // we should consider this as an unrecoverable error.The comment belongs to the logic on line 1980. This function only compares time and doesn't need this comment.
const auto time_difference_tolerance = base::Milliseconds(100);Use KTimeDifferenceTolerance style for const names.
// Hardware context reset is not an error. Restart to recover.We had an offline discussion, trying to make this logic easier to follow, and similar to the UMA reporting part. Please let me know when you have an update. Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
pipeline_controller_->GetMediaTime());Sangbaek ParkSince this is reported in WMPI's dtor; I am not sure what it means to still compare media time...
Does it make sense to just say if we HCR happened (media_time_on_last_hardware_context_reset_.has_value()) but we never send a PIPELINE_ERROR_HARDWARE_CONTEXT_RESET MediaError, then we call it good (recovered)?
That goes back to the original discussion point where how accurately we want to report this result. PipelineStatus signals come throguh OnError() only. By default, it's PIPELINE_OK. And the problem is once the 1st HCR happened but no subsequent error occurred yet but then we can't tell it is recovered (e.g., the video is stalled and waiting for user action or no streaming data or no new MediaKeySession etc).
Did you mean we want to report the result right away at the time of OnError() call with the 2nd error there and update a flag like `is_hardware_context_reset_recovery_reported=true`? And then dtor() checks this flag to report the recovery result as false if media_time_on_last_hardware_context_reset_.has_value() is true?
Summary from the offline discussion:
1. Report the new UMA as the last pipeline status after the first hardware context reset.
2. OnError(): if two HCRs occurred within the short period of media time and video is not paused etc, we report the last pipeline status as `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET``. For other error codes, we don't force it to `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET`.
3. On WMPI dtor(): We report the last pipeline status if the UMA was never reported by OnError() with the first HCR.
4. Consider reporting the time delta from the first hardware context reset OnError.
5. For unit tests, if the unit test breaks the infinite loop, it should be enough.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Add the new bug number here as well, since the old bug doesn't catch the new behavior.
Done
const base::TimeDelta& media_time_on_last_hardware_context_reset,This can just be `previous_media_time` now.
Done
Done
The comment belongs to the logic on line 1980. This function only compares time and doesn't need this comment.
Done
const auto time_difference_tolerance = base::Milliseconds(100);Use KTimeDifferenceTolerance style for const names.
Done
// Hardware context reset is not an error. Restart to recover.Sangbaek ParkWe had an offline discussion, trying to make this logic easier to follow, and similar to the UMA reporting part. Please let me know when you have an update. Thanks!
Thanks for the discussions! I'll handle these in the other comment: http://crrev.com/c/7609444/comment/a68f7fa3_6876a115/
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Is this CL ready for another round of review? I saw there's one comment outstanding and some bots failing... Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
Is this CL ready for another round of review? I saw there's one comment outstanding and some bots failing... Thanks!
Sorry, not ready. The bots failing will be fixed. I'm just trying to add unit tests but very tricky. I'll let you know once it's ready. I might want to merge request as well. Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
pipeline_controller_->GetMediaTime());Sangbaek ParkSince this is reported in WMPI's dtor; I am not sure what it means to still compare media time...
Does it make sense to just say if we HCR happened (media_time_on_last_hardware_context_reset_.has_value()) but we never send a PIPELINE_ERROR_HARDWARE_CONTEXT_RESET MediaError, then we call it good (recovered)?
Sangbaek ParkThat goes back to the original discussion point where how accurately we want to report this result. PipelineStatus signals come throguh OnError() only. By default, it's PIPELINE_OK. And the problem is once the 1st HCR happened but no subsequent error occurred yet but then we can't tell it is recovered (e.g., the video is stalled and waiting for user action or no streaming data or no new MediaKeySession etc).
Did you mean we want to report the result right away at the time of OnError() call with the 2nd error there and update a flag like `is_hardware_context_reset_recovery_reported=true`? And then dtor() checks this flag to report the recovery result as false if media_time_on_last_hardware_context_reset_.has_value() is true?
Summary from the offline discussion:
1. Report the new UMA as the last pipeline status after the first hardware context reset.
2. OnError(): if two HCRs occurred within the short period of media time and video is not paused etc, we report the last pipeline status as `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET``. For other error codes, we don't force it to `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET`.
3. On WMPI dtor(): We report the last pipeline status if the UMA was never reported by OnError() with the first HCR.
4. Consider reporting the time delta from the first hardware context reset OnError.
5. For unit tests, if the unit test breaks the infinite loop, it should be enough.
I've reflected our discussion. I've locally tested and verified. But, I've removed newly added tests since I couldn't make them run correctly due to the complexity of handing 2-3 pipeline errors with play/pause/seek. Please review the updated logic first, @xhw...@chromium.org. Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
void ReportLastPipelineStatusForHardwareContextResetRecoveryUMA(nit: This reports two UMAs, maybe ReportHardwareContextResetRecoveryUMAs?
If we do this, then rename `is_pipeline_status_reported_after_hardware_context_reset_` to `has_reported_hardware_context_reset_recovery_umas_`.
<< ", media_time_diff=" << media_time_diff.InMicrosecondsF() << "ms";"ms" stands for millisecond while the number is microseconds.
Or you can just do "<< media_time_diff" which will print the duration in seconds in floating point: https://source.chromium.org/chromium/chromium/src/+/main:base/time/time.cc;drc=10b6082ae57b34c1d623275ed00dae987fb2a590;l=91
// time has advanced since the reset.sufficiently
}This can be written as the following which might be easier to read:
```
bool can_conclude_recovery_result = !(paused_on_hardware_context_reset_ && paused_);
```
Then this makes me wonder about this case (we touched on this on Friday but didn't discuss further into it):
If we follow the currently logic, "can_conclude_recovery_result" should be true. But it seems to be a bit questionable, since time won't change if the new state is "paused".
If that makes sense, then we don't need to track `paused_on_hardware_context_reset_`, and only need to check `paused_`, which will actually make the logic a lot simpler.
WDYT?
}What if we don't report here, and just leave the destructor to report the last pipeline status? It seems it can work, but I am probably missing some cases...
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
void ReportLastPipelineStatusForHardwareContextResetRecoveryUMA(nit: This reports two UMAs, maybe ReportHardwareContextResetRecoveryUMAs?
If we do this, then rename `is_pipeline_status_reported_after_hardware_context_reset_` to `has_reported_hardware_context_reset_recovery_umas_`.
Done
<< ", media_time_diff=" << media_time_diff.InMicrosecondsF() << "ms";"ms" stands for millisecond while the number is microseconds.
Or you can just do "<< media_time_diff" which will print the duration in seconds in floating point: https://source.chromium.org/chromium/chromium/src/+/main:base/time/time.cc;drc=10b6082ae57b34c1d623275ed00dae987fb2a590;l=91
Done. Good catch! I'd blaim VSCode auto complete is very aggressive but my mistake.
// time has advanced since the reset.Sangbaek Parksufficiently
Done
This can be written as the following which might be easier to read:
```
bool can_conclude_recovery_result = !(paused_on_hardware_context_reset_ && paused_);
```Then this makes me wonder about this case (we touched on this on Friday but didn't discuss further into it):
- playing (not paused) when 1st HCR happened
- paused when second HCR happened
- media time didn't change.
If we follow the currently logic, "can_conclude_recovery_result" should be true. But it seems to be a bit questionable, since time won't change if the new state is "paused".
If that makes sense, then we don't need to track `paused_on_hardware_context_reset_`, and only need to check `paused_`, which will actually make the logic a lot simpler.
WDYT?
Done. Actually, I prefer this simpler approach. I've removed `paused_on_hardware_context_reset_`.
What if we don't report here, and just leave the destructor to report the last pipeline status? It seems it can work, but I am probably missing some cases...
What if we don't report here, and just leave the destructor to report the last pipeline status?
Yes
How can we report the recovery is successful here? Because this location is OnError() method which is only called when an error occurs. This means we don't know when to conclude the HCR case is recovered until end. All preivous UMAs by WMPI have been only reported in dtor (UKM is the same by dtor of MediaMetricsProvider which is destroyed by WMPI). We might lose some successful recovery cases but other UMAs/UKMs will see the same issues.
pipeline_controller_->GetMediaTime());Sangbaek ParkSince this is reported in WMPI's dtor; I am not sure what it means to still compare media time...
Does it make sense to just say if we HCR happened (media_time_on_last_hardware_context_reset_.has_value()) but we never send a PIPELINE_ERROR_HARDWARE_CONTEXT_RESET MediaError, then we call it good (recovered)?
Sangbaek ParkThat goes back to the original discussion point where how accurately we want to report this result. PipelineStatus signals come throguh OnError() only. By default, it's PIPELINE_OK. And the problem is once the 1st HCR happened but no subsequent error occurred yet but then we can't tell it is recovered (e.g., the video is stalled and waiting for user action or no streaming data or no new MediaKeySession etc).
Did you mean we want to report the result right away at the time of OnError() call with the 2nd error there and update a flag like `is_hardware_context_reset_recovery_reported=true`? And then dtor() checks this flag to report the recovery result as false if media_time_on_last_hardware_context_reset_.has_value() is true?
Sangbaek ParkSummary from the offline discussion:
1. Report the new UMA as the last pipeline status after the first hardware context reset.
2. OnError(): if two HCRs occurred within the short period of media time and video is not paused etc, we report the last pipeline status as `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET``. For other error codes, we don't force it to `PIPELINE_ERROR_HARDWARE_CONTEXT_RESET`.
3. On WMPI dtor(): We report the last pipeline status if the UMA was never reported by OnError() with the first HCR.
4. Consider reporting the time delta from the first hardware context reset OnError.
5. For unit tests, if the unit test breaks the infinite loop, it should be enough.
I've reflected our discussion. I've locally tested and verified. But, I've removed newly added tests since I couldn't make them run correctly due to the complexity of handing 2-3 pipeline errors with play/pause/seek. Please review the updated logic first, @xhw...@chromium.org. Thanks!
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
The new logic looks good and clean! I just have one question on the timedelta UMA, and one minor comment on comments.
media_time_diff);Reporting this timedelta in OnError makes a lot of sense. We are reporting the time delta between HCR and the next error or HCR.
But reporting it in the destructor feels odd. Destruction can happen any time when there's no error. It feels it could pollute the data? WDYT?
// If the video was paused on hardware context reset, we consider the
// recovery successful only if the video is not paused anymore and the media
// time has advanced sufficiently since the reset.
// The same hardware context reset (or a different error) occurred within a
// very short media time window, so we should consider this as an
// unrecoverable hardware context reset and give up to avoid an infinite loop
// of restart. If the video is paused, we don't conclude recovery, so we
// should not give up but keep trying to restart until we recover or the video
// is unpaused by user.Thanks for the detailed comment. I had some difficulty following the logic in these comments. Here's my try based on my understanding. Does it make sense to you?
An error or another hardware context reset (HCR) occurred within a very short media time window from the previous HCR and the video is not paused (playing). In this case we consider the recovery attempt from the previous HCR failed. To avoid infinite loop of retrying, we will trigger an error and will not ScheduleRestart(). We gate the logic on short elapsed media time to have more confidence that the error (or another HCR) are related to the previous HCR. We check the paused state because if the video is paused, the media time can't advance.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
@xhw...@chromium.org, You will see some trybot failures.
I've verified by adding new unit tests to cover the new behavior. However, the newly added unit tests are failing on trybots. Locally running 20 times doesn't have any issue so looks like there is some timing issue when being destroyed or something else. I'll update here.
Reporting this timedelta in OnError makes a lot of sense. We are reporting the time delta between HCR and the next error or HCR.
But reporting it in the destructor feels odd. Destruction can happen any time when there's no error. It feels it could pollute the data? WDYT?
it could pollute the data
I agree with this. Done not to report the time delta in dtor.
// If the video was paused on hardware context reset, we consider the
// recovery successful only if the video is not paused anymore and the media
// time has advanced sufficiently since the reset.
// The same hardware context reset (or a different error) occurred within a
// very short media time window, so we should consider this as an
// unrecoverable hardware context reset and give up to avoid an infinite loop
// of restart. If the video is paused, we don't conclude recovery, so we
// should not give up but keep trying to restart until we recover or the video
// is unpaused by user.Thanks for the detailed comment. I had some difficulty following the logic in these comments. Here's my try based on my understanding. Does it make sense to you?
An error or another hardware context reset (HCR) occurred within a very short media time window from the previous HCR and the video is not paused (playing). In this case we consider the recovery attempt from the previous HCR failed. To avoid infinite loop of retrying, we will trigger an error and will not ScheduleRestart(). We gate the logic on short elapsed media time to have more confidence that the error (or another HCR) are related to the previous HCR. We check the paused state because if the video is paused, the media time can't advance.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
| Code-Review | +1 |
LGTM! Thanks for all the discussions. I didn't review the unit tests in detail and will leave it to you.
base::UmaHistogramTimes(
"Media.PipelineStatus.HardwareContextResetRecovery."
"TimeDeltaSinceLastHardwareContextReset",
media_time_diff.value());optional: This is only reported in one place, it might be simpler to just pull this out of the function and report directly in the code...
Sangbaek ParkWhat if we don't report here, and just leave the destructor to report the last pipeline status? It seems it can work, but I am probably missing some cases...
What if we don't report here, and just leave the destructor to report the last pipeline status?
Yes
How can we report the recovery is successful here? Because this location is OnError() method which is only called when an error occurs. This means we don't know when to conclude the HCR case is recovered until end. All preivous UMAs by WMPI have been only reported in dtor (UKM is the same by dtor of MediaMetricsProvider which is destroyed by WMPI). We might lose some successful recovery cases but other UMAs/UKMs will see the same issues.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
@xhw...@chromium.org, You will see some trybot failures.
I've verified by adding new unit tests to cover the new behavior. However, the newly added unit tests are failing on trybots. Locally running 20 times doesn't have any issue so looks like there is some timing issue when being destroyed or something else. I'll update here.
I found the root cause and fixed it. It was due to the recent change by http://crrev.com/c/chromium/src/+/7609443, which requires to call WMPI's Shutdown() first before reset(). My local repo was behind this change.
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |
29 is the latest approved patch-set.
The change was submitted with unreviewed changes in the following files:
```
The name of the file: third_party/blink/renderer/core/exported/web_media_player_impl_unittest.cc
Insertions: 58, Deletions: 8.
The diff is too large to show. Please review the diff.
```
media: Report hardware-context-reset as an error if not recovered
If the same hardware context reset (or a different error) occurred
within a very short media time window, so we should consider this as an
unrecoverable hardware context reset and give up to avoid an infinite
loop of restart:
- If the video has been playing during the subsequent error, we will
report PIPELINE_ERROR_HARDWARE_CONTEXT_RESET(23) error code via
MediaError.
- If a subsequent error occurs, we can conclude whether it's recovered
nor not only when the video is not paused. In this case, we update the
last media time of hardware context reset.
Example error message:
```
video.error.code: MEDIA_ERR_DECODE
video.error.message: PipelineStatus::PIPELINE_ERROR_HARDWARE_CONTEXT_RESET: MediaFoundationRenderer error: kFailedToSetOutputRect (Error (0x13D) while retrieving error. (0x8004CD12))
```
Also verified error code 23 (PIPELINE_ERROR_HARDWARE_CONTEXT_RESET) is
reported in this case:
```
- Histogram: Media.PipelineStatus.AudioVideo.H264.MediaFoundationRenderer recorded 1 samples, mean = 23.0 (flags = 0x41) [#]
0 ...
23 -O (1 = 100.0%) {0.0%}
24 ...
```
The new UMA
`Media.PipelineStatus.HardwareContextResetRecovery.LastPipelineStatus`
is reported to tell the last pipeline status after hardware context
reset. This will be PIPELINE_OK if recovered. This will be
PIPELINE_ERROR_HARDWARE_CONTEXT_RESET if unrecovered by a subsequent
error within a very short media time window. If a different error
occurs outside of the time window, that error will be reported.
`*.TimeDeltaSinceLastHardwareContextReset` will be reported together,
indicating the media time difference since the last hardware context
reset error. For example,
```
- Histogram: Media.PipelineStatus.HardwareContextResetRecovery.LastPipelineStatus recorded 2 samples, mean = 11.5 (flags = 0x41) [#]
0 -O (1 = 50.0%)
1 ...
23 -O (1 = 50.0%) {50.0%}
24 ...
- Histogram: Media.PipelineStatus.HardwareContextResetRecovery.TimeDeltaSinceLastHardwareContextReset recorded 2 samples, mean = 0.0 (flags = 0x41) [#]
0 --O (2 = 100.0%)
1 ...
```
Updated the existing unit test for the new behavior:
`WebMediaPlayerImplTest.PipelineErrorHardwareContextReset_Twice`
Added new unit tests:
```
PipelineErrorHardwareContextReset_PlayingVideo_NotRecovered
PipelineErrorHardwareContextReset_PlayingVideo_Recovered
PipelineErrorHardwareContextReset_PausedVideo_InconclusiveRecovery
PipelineErrorHardwareContextReset_PausedVideo_Recovered
```
| Inspect html for hidden footers to help with email filtering. To unsubscribe visit settings. |