Message: Xiyuan, this is what I have in mind for this API. I need your advice regarding the TODO in RuntimeAPI::OnRestartWatchdogTimeout() to determine what we should do when we throttle a restart request.
Description: Add a new app API to enable watchdog behavior in kiosk apps
This CL adds chrome.runtime.restartOnWatchdog(int seconds, callback) to enable watchdog behavior in kiosk apps.
Affected files (+300, -3 lines): A + chrome/test/data/extensions/api_test/runtime/restartOnWatchdog/manifest.json A chrome/test/data/extensions/api_test/runtime/restartOnWatchdog/test.js M extensions/browser/api/runtime/runtime_api.h M extensions/browser/api/runtime/runtime_api.cc M extensions/browser/api/runtime/runtime_apitest.cc M extensions/browser/extension_function_histogram_value.h M extensions/common/api/runtime.json M tools/metrics/histograms/histograms.xml
xiy...@chromium.org
unread,
May 13, 2016, 1:56:16 PM5/13/16
Reply to author
Sign in to reply to author
Forward
Sign in to forward
Delete
You do not have permission to delete messages in this group
Copy link
Report message
Show original message
Either email addresses are anonymous for this group or you need the view member email addresses permission to view the original message
to afakhr...@chromium.org, chromium...@chromium.org, chromium-a...@chromium.org, asvitki...@chromium.org, extension...@chromium.org
https://codereview.chromium.org/1970613003/diff/60001/extensions/browser/api/runtime/runtime_api.cc#newcode343 extensions/browser/api/runtime/runtime_api.cc:343: watchdog_timer_.Stop(); |watchdog_timer_| is a per-profile resource. We probably should not allow multiple extensions to use it at the same time. If it is set by extension A, we should not allow extension B to mess with it. That is, the first extension grabs the resource and other extensions should get an error.
We have to use wall clock base::Time for this. We need to be careful since the wall clock could be skewed and changed. We need to filter out insane values such as long in the past or in the future. And document the behavior in API json.
https://codereview.chromium.org/1970613003/diff/60001/extensions/browser/api/runtime/runtime_api.cc#newcode372 extensions/browser/api/runtime/runtime_api.cc:372: // TODO(afakhry): Figure out what to do here. Do we restart later? We should still reboot but after the minimum reboot time is satisfied. Propose to derive a timeout before we set the |watchdog_timer_|, where we take the function's input and use our last reboot record to extend that value to avoid rebooting too fast.
If we response when the timer fires, we need to clean up the pending calls on subsequent calls that cancels previous timeout. Otherwise, we leave a leak in the renderer.
On 2016/05/13 17:56:15, xiyuan wrote: > |watchdog_timer_| is a per-profile resource. We probably should not allow > multiple extensions to use it at the same time. If it is set by extension A, we > should not allow extension B to mess with it. That is, the first extension grabs > the resource and other extensions should get an error.
On 2016/05/13 17:56:15, xiyuan wrote: > nit: There is another Start on OneShotTimer interface that we can pass the > |this| as receive and avoid using base::Unretained if the OneShotTimer is a > member of |this|.
Yes, the problem with that interface is that I won't be able to bind |extension| and |callback| to the created callback. So I had to use a manual Bind().
On 2016/05/13 17:56:15, xiyuan wrote: > Time is always a trick topic. base::TimeTicks is a tick count and makes no sense > when system reboots [1]. We cannot store that and compare against it. > > We have to use wall clock base::Time for this. We need to be careful since the > wall clock could be skewed and changed. We need to filter out insane values such > as long in the past or in the future. And document the behavior in API json. > > [1] https://www.chromium.org/developers/design-documents/sane-time
Thanks for providing the link. Changed it to base::Time.
On 2016/05/13 17:56:15, xiyuan wrote: > Sorry for flipping on my suggestion. Let's store the reboot record in > extension's state store. See AlarmManager::WriteToStorage and its > OnExtensionLoaded.
On 2016/05/13 17:56:15, xiyuan wrote: > We should still reboot but after the minimum reboot time is satisfied. Propose > to derive a timeout before we set the |watchdog_timer_|, where we take the > function's input and use our last reboot record to extend that value to avoid > rebooting too fast.
Done. Rebooting after the minimum reboot time has passed.
On 2016/05/13 17:56:15, xiyuan wrote: > I thought we would invoke the callback when the watchdog timer is set. Maybe we > should clarify with Matt of what he expects? > > If we response when the timer fires, we need to clean up the pending calls on > subsequent calls that cancels previous timeout. Otherwise, we leave a leak in > the renderer.
Done. I made sure any previously scheduled callback is invoked and cleared before we schedule another one.
On 2016/05/14 01:27:35, afakhry wrote: > On 2016/05/13 17:56:15, xiyuan wrote: > > nit: There is another Start on OneShotTimer interface that we can pass the > > |this| as receive and avoid using base::Unretained if the OneShotTimer is a > > member of |this|. > > Yes, the problem with that interface is that I won't be able to bind |extension| > and |callback| to the created callback. So I had to use a manual Bind().
On 2016/05/16 18:25:17, xiyuan wrote: > nit: slightly prefer to combine |now| and |seconds_from_now| into one > base::Time, say restart_time.
I think this might add a bit of redundancy, since passing |restart_time| to ScheduleDelayedRestart() will make me need to do the following in the caller:
On 2016/05/16 18:25:17, xiyuan wrote: > nit: future_restart_time > last_restart_time ? > future_restart_time - last_restart_time : base::TimeDelta(); > > Just in case the wall clock on the device has changed between reboots and makes > |last_restart_time| in the future.
On 2016/05/16 18:25:17, xiyuan wrote: > We might want to make this happen before attempting to reboot. Chrome only has 3 > seconds or so time and the worry is that it is not enough to persist the data to > disk.
Done. My concern was, I didn't want to store "now" unless I make sure that request is successful. Now that we prevent non-kiosk apps from calling this API from the beginning (and this is the only way the request can fail),then it's ok to assume that the request will always be successful.
I added a NOTREACHED() if our assumption is broken.
On 2016/05/16 18:25:17, xiyuan wrote: > RestartDeviceOnWatchdogTimeout() might call Response() synchronously (e.g. on > error). And we should not call this for such cases.
Done! Thanks for catching that. I made RestartDeviceOnWatchdogTimeout() return a status enum |RestartOnWatchdogStatus|, based on which we either RespondNow() or RespondLater().
On 2016/05/17 21:23:55, xiyuan wrote: > nit: const -> constexpr and maybe update the other "const" as well. > > Learned from my other review, constexpr is preferred style.
On 2016/05/17 21:23:55, xiyuan wrote: > nit: remove "g_" prefix since it is not really a global and add a _"for_test" > suffix. > > g_allow_non_kiost_apps_restart_api -> > allow_non_kiost_apps_restart_api_for_test >
https://codereview.chromium.org/1970613003/diff/140001/extensions/browser/api/runtime/runtime_api.cc#newcode375 extensions/browser/api/runtime/runtime_api.cc:375: storage->GetExtensionValue( The fact that we need to store this value implies to me that this is supposed to carry across device power ons/offs - is that correct? If so, don't we need to read it at startup somewhere? Also, should we make sure that we don't restart right after turning on? If it's not supposed to persist, could we just have this as a property of the runtime api?
On 2016/05/20 17:26:32, Devlin wrote: > If we have a min time of 3 hours, does "as much accuracy as possible" really > matter?
The minimum time is for any two successive restarts resulting from this API; i.e. when an app schedules a restart after 2 seconds from now, and succeeds, the following restart that this app can schedule after restart is 3 hours later. Nothing prevents the app from calling this API once every 1 second for example.
The accuracy here really matters in our browsertests, since we're talking about a couple of seconds. :D
On 2016/05/20 17:26:32, Devlin wrote: > The fact that we need to store this value implies to me that this is supposed to > carry across device power ons/offs - is that correct? If so, don't we need to > read it at startup somewhere? Also, should we make sure that we don't restart > right after turning on? If it's not supposed to persist, could we just have > this as a property of the runtime api?
Yes, it is supposed to persist accross device power ons/off.
I don't think restarting after first turning on matters. From what I understand about the purpose of this API is that it's made for kiosk apps so that they can call it repeatedly as a way to tell Chrome OS that the app is functioning and behaving properly.
Let's say we have a device that was just powered on. We have on it a kiosk app that starts running. The app developer decided to call this API every 10 seconds (for instance). It calls it the first time. few seconds later, it hangs forever. It will miss the next API call. Our API will recognize this as an app failure and will restart the device, hopefully so that the app can be updated automatically. After restart however, this app will not be able to schedule any restarts sooner than 3 hours, even if it hangs again after say 15 seconds from the beginning.
Little bit of history: I was told that we're adding this API to avoid a problem that took place with a kiosk app that caused the whole system to enter a bad state to point, someone had to be sent to every single device running that app to restart it manually. (I'm sorry, I don't have the exact details).
We can raise this issue on the bug thread as well, and ask for clarification regarding the expected behavior.
On 2016/05/20 17:26:32, Devlin wrote: > This strikes me as a little scary, since we'll almost always leak this. Is there > a reason we can't just let the extension know if the restart was properly > scheduled, rather than executing a callback later?
Could you please clarify how we leak it? I tried to make sure, we don't schedule any new restart until we're completely done with the previous one. Also we don't schedule anyone (and hence call RespondLater()) until we're quite sure no errors or invalid arguments or state where encountered.
On 2016/05/20 17:26:33, Devlin wrote: > Are we guaranteed to only have one of these?
This API is only allowed to be called by only one extension (the first to invoke it) and it must be running in kiosk mode. Any subsequent calls to this API will cancel the previous one. So yes, at any point in time, we have only a single "current" restart request.
On 2016/05/20 17:26:33, Devlin wrote: > For consistency, success should be indicated by whether or not lastError is set, > not by a separate return result.
I'm sorry, I don't get this comment. Not sure which lastError or which separate return result?
I think I understand the point of it, my question was why we're storing it. From my understood use cases (which align with what you described), there's no reason to store it persistently rather than just as a property on the API. Am I missing something?
On 2016/05/21 01:13:00, afakhry wrote: > On 2016/05/20 17:26:32, Devlin wrote: > > This strikes me as a little scary, since we'll almost always leak this. Is > there > > a reason we can't just let the extension know if the restart was properly > > scheduled, rather than executing a callback later? > > Could you please clarify how we leak it? I tried to make sure, we don't schedule > any new restart until we're completely done with the previous one. Also we don't > schedule anyone (and hence call RespondLater()) until we're quite sure no errors > or invalid arguments or state where encountered. > > This also helps a lot with the browsertests.
For leaking, I was expecting the restart function to be synchronous. So either we restarted (in which case this never returns, and leaks) or didn't restart (in which case we wait indefinitely). I suppose depending on how often we replace the restart request, it may not be all the time. Even so, this pattern seems really weird. I'd much rather invoke the callback immediately and just have a state of success vs failure in scheduling the restart.
On 2016/05/21 01:13:00, afakhry wrote: > On 2016/05/20 17:26:33, Devlin wrote: > > For consistency, success should be indicated by whether or not lastError is > set, > > not by a separate return result. > > I'm sorry, I don't get this comment. Not sure which lastError or which separate > return result?
chrome.runtime.lastError is used to indicate whether an api call succeeded or failed, and is the canonical way of indicating success/failure in an API method. Here, you're returning a separate parameter ('success') to indicate it. We shouldn't deviate from established patterns.
The minimum time 3 hours between API reboots means if last reboot time is T, the next reboot time must be >= T + 3 hours. I think we would have to persist T since the device would be rebooted and lose all in memory states between the two runs. Right?
Right, but I think we should store that when/if we restart the device, not when we schedule the restart, since the latter could be common, while the former should be quite rare.
The code here is to read the persist value to calculate the time for scheduling the reboot. The SetExtensionValue call is around line 455 just before we reboot the device.
Ah, whoops! Misread this. You're right. Though I think we should still store this time as a member on the API to avoid re-reading it each time we schedule a restart.
On 2016/05/23 16:49:39, Devlin wrote: > On 2016/05/21 01:13:00, afakhry wrote: > > On 2016/05/20 17:26:32, Devlin wrote:
> > > Is constexpr preferred by Chromium style now? > > > > Yes. Please, see: http://chromium-cpp.appspot.com/#core-whitelist > > From there "Don't go out of the way to convert existing code." So I'd say make > the new ones constexpr, but leave these.
On 2016/05/23 16:49:39, Devlin wrote: > On 2016/05/21 01:13:00, afakhry wrote:
> > On 2016/05/20 17:26:33, Devlin wrote: > > > For consistency, success should be indicated by whether or not lastError is > > set, > > > not by a separate return result. > > > > I'm sorry, I don't get this comment. Not sure which lastError or which > separate > > return result? > > chrome.runtime.lastError is used to indicate whether an api call succeeded or > failed, and is the canonical way of indicating success/failure in an API method. > Here, you're returning a separate parameter ('success') to indicate it. We > shouldn't deviate from established patterns.
https://codereview.chromium.org/1970613003/diff/180001/extensions/browser/api/runtime/runtime_apitest.cc#newcode241 extensions/browser/api/runtime/runtime_apitest.cc:241: IN_PROC_BROWSER_TEST_F(RestartOnWatchdogApiTest, RestartOnWatchdogTest) { Now that you've simplified some of the tests a bit (yay!), we can actually do all these in a unittest. See extension_function_test_utils for how to run a function without a JS counterpart. Then we get the advantage of unittest speed + not needing to go back and forth between the two test files to understand what's going on. (Sorry for not mentioning this earlier - it didn't occur to me that we really didn't need the JS part here for what we're testing.)
On 2016/05/25 21:53:21, Devlin wrote: > nit: might be a bit easier to read these if the numbers matched the test number > - maybe something like "responseN"?
On 2016/05/25 21:53:21, Devlin wrote: > API question - should we automatically reschedule for the minimum time, or just > throw an error saying "too soon after last restart"?
Me and Xiyuan agreed to schedule a restart after the minimum time has passed.
On 2016/05/25 21:53:21, Devlin wrote: > Given we call this in both cases (here and on line 416), might just make sense > to move it up to 366 and then return here with a comment "// We already stopped > the running timer."
On 2016/05/25 21:53:22, Devlin wrote: > Now that you've simplified some of the tests a bit (yay!), we can actually do > all these in a unittest. See extension_function_test_utils for how to run a > function without a JS counterpart. Then we get the advantage of unittest speed > + not needing to go back and forth between the two test files to understand > what's going on. (Sorry for not mentioning this earlier - it didn't occur to me > that we really didn't need the JS part here for what we're testing.)
For this tricky and sensitive API, I think we should keep these browser tests for the following reasons: - It makes sense to have this test along side the other runtime api tests here. - Isn't a browsertest is the closest test environment to reality? - I also found it very convenient modifying the separate JS part and re-run the test immediately without having to recompile browsertests agin. - Having the JS part will be helpful to the test team, giving them an example of how the API is used so that they can add their own tests for the Test Review bug here: https://bugs.chromium.org/p/chromium/issues/detail?id=607413
Browser tests are orders of magnitude slower and flakier than unit tests. Additionally, all of the "tricky and sensitive" aspects of this API are confined here, in the browser process, which can be in a normal state in a unittest. The renderer component is just a vanilla extension API with no added logic (as is seen here by the lack of renderer/ modified files), and if our extension API system breaks in the renderer, we'll have hundreds of other browser tests fail. Unit tests should almost always be preferred to browser tests, and I don't think this is an exception.
https://codereview.chromium.org/1970613003/diff/220001/extensions/common/api/runtime.json#newcode279 extensions/common/api/runtime.json:279: "description": "A callback to be invoked when a restart request was successfully rescheduled or when an error occurs in which case runtime.lastError can be checked.", the second part of this ("or when an error occurs...") is probably unnecessary, since that's true of every API.
On 2016/06/01 21:29:15, Devlin wrote: > Spinning for 3 seconds is really innefficient. What if we reset the min delay > to 0 and schedule it for 0 seconds, then reset it back to 3 before the next > call? Or, you could use a test clock, if you'd prefer that approach.
Right. Since we reset (using -1) in the previous request, We can now request with 1 second. Which will be accomplished a second later. I also lowered the minimum time between two successive restarts to 2 seconds instead of 3, so the next restart request (with 1 second) will be throttled to 2 seconds instead.
So now the whole test takes 3 seconds instead of 7.
On 2016/06/01 21:29:15, Devlin wrote: > Can we lazily initialize this?
What's the point? It will be needed in the constructor to set it for the |main_context_|, so lazily ends up being right away. Unless you mean something else?
On 2016/06/01 21:29:15, Devlin wrote: > the second part of this ("or when an error occurs...") is probably unnecessary, > since that's true of every API.
On 2016/06/02 01:43:40, afakhry wrote: > On 2016/06/01 21:29:15, Devlin wrote: > > Can we lazily initialize this? > > What's the point? It will be needed in the constructor to set it for the > |main_context_|, so lazily ends up being right away. Unless you mean something > else?
My thought was that we could move the UserPrefs::Set() call into the initialization of the pref service as well. In practice, though, this probably doesn't matter much. This should be fine.
On 2016/06/03 21:26:47, Devlin wrote: > Could we just inherit from TestRuntimeAPIDelegate and override the methods we > care about?
Done. This was needed for the browser tests as I needed to keep using the *real* delegate, which also had to b used to RemoveUpdateObserver() at the end of the test.
On 2016/06/03 21:26:47, Devlin wrote: > Do we need to wait for this restart, if we already check that a) restarts work > (we checked that with line 167) and b) the scheduled restart time is correct > (checked with line 177)? If not, omit this so we shave off a couple of seconds.
Right. We don't have to wait for it. Done. Tests only takes 1 sec now.
On 2016/06/03 21:26:47, Devlin wrote: > cancelled vs canceled. Typically, we use american spelling (even though it's > wrong! ;)), but it looks like in this case, they are pretty dead even split, so > I have no preference. But I'd like to at least be consistent between line 62 > and 63 :)
I wonder if there's a way we can get around this. What if we did something like: In the delayed restart, if we succeed in scheduling it, assume it will succeed, and persist both the last restart time and a bool that says "restarted due to delayed restart". In the ctor (or, maybe better yet, even lazily based on the first restart on delay request) check the "restarted due to delayed" bit, and, if it's present, we were restarted due to the API and we enforce the throttle. Then clear the bit, and any future calls (including across non-delayed restarts) will not have the bit, and we won't throttle.
On 2016/06/10 00:34:45, Devlin wrote: > This worries me for a couple of reasons: > - We typically want to steer clear of NOTIFICATIONs. > - This means we're blocking shutdown on persisting something to disk, which is > something we really like to avoid. > > I wonder if there's a way we can get around this. What if we did something > like: > In the delayed restart, if we succeed in scheduling it, assume it will succeed, > and persist both the last restart time and a bool that says "restarted due to > delayed restart". > In the ctor (or, maybe better yet, even lazily based on the first restart on > delay request) check the "restarted due to delayed" bit, and, if it's present, > we were restarted due to the API and we enforce the throttle. Then clear the > bit, and any future calls (including across non-delayed restarts) will not have > the bit, and we won't throttle. > > WDYT?
This would not work because the restartAfterDelay is supposed to be called periodically by the app to implement a heart beat. Clearing the flag on first call essentially makes the throttling ineffective.
I think the throttling is overly complicated to implement with the API. We should probably take a step back. The purpose of the throttling is to avoid/break reboot loop for bad apps. Instead of doing this in API, how about we take another approach that we keep track of auto launch times and stop launching the app if we do it too fast (say 5 times in last 15 minutes). This way, we break the potential reboot loop. And we do it on device boot instead of during shutdown. The API would be much simpler without the throttling logic.
Sorry, I was unclear. I meant we should have a member did_restart_due_to_delayed_restart_ on the RuntimeAPI, which is set if the preference bit is set (and then clear the preference bit, but the member stays set for the lifetime of the RuntimeAPI, which is also the lifetime of the profile). I believe that would be sufficient, though, as I said, setting in the RuntimeAPI ctor is also an option.
> I think the throttling is overly complicated to implement with the API. We > should probably take a step back. The purpose of the throttling is to > avoid/break reboot loop for bad apps. Instead of doing this in API, how about we > take another approach that we keep track of auto launch times and stop launching > the app if we do it too fast (say 5 times in last 15 minutes). This way, we > break the potential reboot loop. And we do it on device boot instead of during > shutdown. The API would be much simpler without the throttling logic.
I'm fine with this, too, though I'm not as familiar with the code so don't know how complex it will become. I just don't want to slow down shutdown if we can avoid it. Anything that avoids that is fine with me.
On 2016/06/10 00:34:45, Devlin wrote: > This worries me for a couple of reasons: > - We typically want to steer clear of NOTIFICATIONs. > - This means we're blocking shutdown on persisting something to disk, which is > something we really like to avoid. > > I wonder if there's a way we can get around this. What if we did something > like: > In the delayed restart, if we succeed in scheduling it, assume it will succeed, > and persist both the last restart time and a bool that says "restarted due to > delayed restart". > In the ctor (or, maybe better yet, even lazily based on the first restart on > delay request) check the "restarted due to delayed" bit, and, if it's present, > we were restarted due to the API and we enforce the throttle. Then clear the > bit, and any future calls (including across non-delayed restarts) will not have > the bit, and we won't throttle. > > WDYT?
I didn't get this comment until I read your explanation on IM. Thanks for the clarification over there. Done.
I'm sure you know this, and probably would think it's unnecessary, but the reason is alignment and padding. It's really bad to have a less alignment restrictive type objects in the middle of more restrictive ones.
Assuming Linux and 64-bit architecture: bool is 1 byte ---> alignment requirement is 1 byte (so can be allocated in any address).
The previous member |browser_context_| is a pointer. Pointers on 64-bit machines are 8 bytes. ---> alignment requirements is 8 bytes.
The following member |registrar_| has only one member of type std::vector<>. While the implementation of std::vector may vary, but let's assume the one I have locally in /usr/include/. It has three pinters: pointer _M_start; pointer _M_finish; pointer _M_end_of_storage;
Again 8 byte aligned. Let's roughly sketch the memory layout:
+---+---+---+---+---+---+---+---+ | | | | | | | | | |<----- browser_context_ ------>| | | | | | | | | | +---+---+---+---+---+---+---+---+ | | X | X | X | X | X | X | X | | b | X | X | X | X | X | X | X | | | X | X | X | X | X | X | X | +---+---+---+---+---+---+---+---+ | | | | | | | | | |<--------- _M_start ---------->| | | | | | | | | | +---+---+---+---+---+---+---+---+ | | | | | | | | | |<--------- _M_finish --------->| | | | | | | | | | +---+---+---+---+---+---+---+---+ | | | | | | | | | |<---- _M_end_of_storage ------>| | | | | | | | | | +---+---+---+---+---+---+---+---+
'b' is our bool |dispatch_chrome_updated_event_|, and the X's are useless padding bytes.
So I moved it to the end and placed our two new bools after it, in order to fill two of those padding bytes.
I could have just placed the new bools after |dispatch_chrome_updated_event_| while it was still up here, but I have this personal preference of having the bools at the end of the object.
I didn't want our new bools to introduce extra new padding bytes.
Realistically, since this is a singleton per profile, it's probably never gonna matter, but I'm not going to stand in the way of minor optimizations :) Thanks for the detailed answer!
On 2016/06/14 16:27:21, Devlin wrote: > It looks like all occurrences of RunFunctionGetError() and > RunFunctionAssertNoError() just pass in a new > RuntimeRestartAfterDelayFunction(). Can we omit that argument and just > construct it in the respective functions?
Not all calls though, we have one that passes RuntimeRestartFunction.
On 2016/06/14 16:27:21, Devlin wrote: > When would this be false (exempting cases of e.g. pref corruption)? If this is > false, won't it also cause problems on line 461?
Could be corruption, or the device wall clock has changed somehow, making |last_delayed_restart_time_| in the future.
Thanks for spotting this issue! It should be TimeDelta::Max() to avoid throttling in this case.