Re: Too many leftover ConnectJob::OnTimeout delayed tasks

8 views
Skip to first unread message

Randy Smith

unread,
May 1, 2017, 11:49:54 AM5/1/17
to Matt Menke, Helen Li, net-dev, ss...@chromium.org, fdo...@chromium.org, Taiju Tsuiki, schedu...@chromium.org
[+scheduler-dev]

On Mon, May 1, 2017 at 11:47 AM, Randy Smith <rds...@chromium.org> wrote:


On Mon, May 1, 2017 at 11:44 AM, Matt Menke <mme...@chromium.org> wrote:
Randy:  I'm not sure how well that approach works.  With Lucky Luke, I believe the plan is that there will be more SequencedTaskRunners, which means that timers also need to be scoped to those.  So we'd either need per-SequencedTaskRunner lists (For the aforementioned many SequencedTaskRunners, which solve the problem for net, but in general, not sure how shared we can assume SequencedTaskRunners to be), or a global lock that can post things to any SequencedTaskRunner.

Hmmm.  Good point.  I still think that we're in better shape for this current situation having only one task outstanding per STR, but there might well be other contexts in which we'd be in worse shape.  Which might argue for an alternative timer implementation, or just implementing this type structure inside the network stack :-J.  (I don't really think the global lock approach is a good one, but maybe I'm dismissing it too quickly.)

I'd still evaluate some numbers before throwing this one out.  Currently, AIUI, we have one outstanding task per timer; reducing that to one/STR seems like it would be a definite win, even if not as large a win as compared to the current situation.

-- Randy
 

On Mon, May 1, 2017 at 11:09 AM, Helen Li <xunj...@chromium.org> wrote:
Thanks, Randy. Is the wakeup implemented by a post task with a shorter delay? If so, won't we run into the same problem? (failing to cancel the posted task)

On Mon, May 1, 2017 at 10:58 AM Randy Smith <rds...@chromium.org> wrote:
On Mon, May 1, 2017 at 10:47 AM, Helen Li <xunj...@chromium.org> wrote:
Context is at: https://bugs.chromium.org/p/chromium/issues/detail?id=716187. Ssid@ cc-ed discovered that there can be many ConnectJob::OnTimeout delayed tasks (with posted delays on the order of seconds and even minutes) in MessageLoop::|delayed_work_queue_|. 

ConnectJob and its subclasses post a ConnectJob::OnTimeout delayed task so that we can abort any connection attempts that are taking too long. However, in most situations, because ConnectJobs come and go pretty quickly, we have a rather large backlog in MessageLoop::|delayed_work_queue_|.  

We could rework ConnectJob's timeout logic to not post so many tasks or have so many leftover delayed tasks. I don't see a way to do this cleanly. It will be a major undertaking.  mmenke@ pointed out in an earlier thread that other parts of the codebase also follow this pattern (https://cs.chromium.org/search/?q=%26%5Ba-z:%5D*timeout+package:%5Echromium$&type=cs). We should explore if there is a general solution to this problem. Maybe we can rework MessageLoop's delayed tasks logic to also detect "canceled"/"invalid" tasks? I don't know the code well to see how feasible this is. 

Thoughts? 

As noted on the bug, I think the sweet spot for a rework is in the timer implementation.  The timer interface allows for easy cancellation semantics (if the timer is destroyed, that should cancel the wakeup).  If the extra accounting is done inside timers (central object that timers for a given thread coordinate on, that central object only has one outstanding posted task at a time to wakeup the next timer that needs wakeup, timer deletion notifies that central object so that it doesn't actually schedule the wakeup, all complex accounting is contained within timers).  Is there some aspect of the current requirements for which this solution wouldn't work/isn't optimal?

-- Randy
 



François Doray

unread,
May 1, 2017, 12:56:39 PM5/1/17
to Randy Smith, Matt Menke, Daniel Cheng, Helen Li, net-dev, ss...@chromium.org, Taiju Tsuiki, schedu...@chromium.org

There are 2 problems:
  1. base::Timer doesn't cancel tasks from the MessageLoop/TaskScheduler point of view. Stop() just causes the tasks to become no-op.
  2. MessageLoop does not cleanup cancelled tasks.
  3. TaskScheduler does not cleanup cancelled tasks.
We have solutions to these problems:
  1. gab@ is currently refactoring base::Timer and will make sure that MessageLoop/TaskScheduler can see when its tasks are canceled. https://codereview.chromium.org/2491613004/ 
  2. dcheng@ is working on this. https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!msg/scheduler-dev/nuaxnw64k0A/TIDAtVYLDgAJ 
  3. This has been on our TODO list for a while. I will take care of making this happen soon.

Helen Li

unread,
May 1, 2017, 1:10:59 PM5/1/17
to François Doray, Randy Smith, Matt Menke, Daniel Cheng, net-dev, ss...@chromium.org, Taiju Tsuiki, schedu...@chromium.org
Thanks François! Great to know this is being worked on. Let's carry on the discussion in the public thread: https://groups.google.com/a/chromium.org/forum/?utm_medium=email&utm_source=footer#!topic/scheduler-dev/HLuCdnfy4Fs

Helen Li

unread,
May 1, 2017, 1:14:39 PM5/1/17
to François Doray, Randy Smith, Matt Menke, Daniel Cheng, net-dev, ss...@chromium.org, Taiju Tsuiki, schedu...@chromium.org, alexc...@chromium.org
(Sorry, I meant to direct response from the internal thread here.)
The number of delayed tasks can easily exceed 1 thousand for cricbuzz https://bugs.chromium.org/p/chromium/issues/detail?id=716187

Alex Clarke

unread,
May 1, 2017, 1:17:43 PM5/1/17
to Helen Li, François Doray, Randy Smith, Matt Menke, Daniel Cheng, net-dev, ss...@chromium.org, Taiju Tsuiki, scheduler-dev, alexc...@chromium.org
We saw similar problems in blink, and landing a fast path for cancelled delayed tasks and a system to periodically sweep away them away completely in idle time seemed to help quite a bit.
Reply all
Reply to author
Forward
0 new messages