Rate limiting and retrying

86 views
Skip to first unread message

Mark Mentovai

unread,
Apr 7, 2017, 1:48:32 PM4/7/17
to Crashpad-dev
We currently implement the simplistic one report per hour limit that we inherited from Breakpad on macOS. Chrome’s use of Breakpad on Windows had a limit of 20 reports per day. The only knob that we currently have is --no-rate-limit, which lifts all limits entirely and really viable (we used it for a while in Chrome on Windows, and then stopped). Siggi did an analysis and found that in not allowing more than one near-contemporaneous report to become eligible for upload, there was surprisingly little loss of actionable data.

Now that we’re closing in on recording crash reports for crashpad_handler on all supported platforms, it’s time to think about what this means for our rate limiting strategy. As crash report upload normally immediately follows a crash, it follows that a crash that occurs in crashpad_handler during upload, while it may be recorded, will not normally become eligible for upload. This leaves us with a blind spot in crash coverage for crashpad_handler.

I’d like to close up this hole and fix bug 23. It seems that implementing a 20 report per day limit would be a good start. A two-tiered approach is also interesting: a dual limit of 20 reports per day and 5 per hour would prevent a flood of a full day’s quota of 20 reports from being uploaded all at once.

Rate limiting and retrying are related, and there’s a question of how to apply this policy: should we attempt 20 uploads per day, or should we allow 20 crash reports per day to become eligible for upload? I think I’m leaning toward the latter.

The other question is how often to retry failed uploads, and when to give up. An initial delay of an hour, with exponential backoff to 8 hours is probably sufficient. We could give up on trying to upload a report if we haven’t had any success after 2 days or 5 upload attempts, whichever comes first.

I’d eventually like to get to the point where we’re smart enough to defer an upload attempt if it looks like the network is down, but perhaps try the initial upload regardless of network status (because sometimes when it looks like it’s down, it’s not really). Network-aware upload will become increasingly important as we finish the Android client.

Please let me know what you think of this strategy, and help me tune the constants.

Sigurður Ásgeirsson

unread,
Apr 7, 2017, 2:04:23 PM4/7/17
to Mark Mentovai, Crashpad-dev
On Fri, Apr 7, 2017 at 1:48 PM Mark Mentovai <ma...@chromium.org> wrote:
We currently implement the simplistic one report per hour limit that we inherited from Breakpad on macOS. Chrome’s use of Breakpad on Windows had a limit of 20 reports per day. The only knob that we currently have is --no-rate-limit, which lifts all limits entirely and really viable (we used it for a while in Chrome on Windows, and then stopped). Siggi did an analysis and found that in not allowing more than one near-contemporaneous report to become eligible for upload, there was surprisingly little loss of actionable data.

The analysis was done on un-throttled upload data, but as there wasn't any retry at the time, we don't really know how upload failures behave. I suspect we have systemic blind spots for such things as crashing on entering and/or exiting low power states, where networking is not, or rarely available at the time of crash (on desktop platforms).
 
Now that we’re closing in on recording crash reports for crashpad_handler on all supported platforms, it’s time to think about what this means for our rate limiting strategy. As crash report upload normally immediately follows a crash, it follows that a crash that occurs in crashpad_handler during upload, while it may be recorded, will not normally become eligible for upload. This leaves us with a blind spot in crash coverage for crashpad_handler.

I’d like to close up this hole and fix bug 23. It seems that implementing a 20 report per day limit would be a good start. A two-tiered approach is also interesting: a dual limit of 20 reports per day and 5 per hour would prevent a flood of a full day’s quota of 20 reports from being uploaded all at once.

Randomization might be another viable option, which might give us statistically better coverage. I figure you'd want something like exponentially decaying odds for nominating a dump for upload.
 

Rate limiting and retrying are related, and there’s a question of how to apply this policy: should we attempt 20 uploads per day, or should we allow 20 crash reports per day to become eligible for upload? I think I’m leaning toward the latter.

Yeah, I think making nomination sticky, and then retrying uploads for nominated crashes sounds sane and simple.
 
The other question is how often to retry failed uploads, and when to give up. An initial delay of an hour, with exponential backoff to 8 hours is probably sufficient. We could give up on trying to upload a report if we haven’t had any success after 2 days or 5 upload attempts, whichever comes first.

+1 - sounds good to me.
 

I’d eventually like to get to the point where we’re smart enough to defer an upload attempt if it looks like the network is down, but perhaps try the initial upload regardless of network status (because sometimes when it looks like it’s down, it’s not really). Network-aware upload will become increasingly important as we finish the Android client.

Please let me know what you think of this strategy, and help me tune the constants.

I think starting with something simple, capturing metrics and then refining the approach would be best. Alternatively we can try simulation, but there are enough unknowns that I think that'd be neigh-on nonsensical :).

tomst...@gmail.com

unread,
Jan 24, 2019, 8:07:31 PM1/24/19
to Crashpad-dev, ma...@chromium.org
Hey guys,

Did this ever get implemented? This feature is necessary for us to make use of Crashpad.

Mark Mentovai

unread,
Jan 28, 2019, 12:19:50 PM1/28/19
to Tom Stoffer, Crashpad-dev
Sorry, Tom, it hasn’t been implemented yet. I do have something half-finished that I intend to dust off soon.

Tom Stoffer

unread,
Jan 28, 2019, 2:22:15 PM1/28/19
to Mark Mentovai, Crashpad-dev
Hey Mark,

If you have a branch to share we could look at finishing it off for you?

Tom

Sent from my iPhone
Reply all
Reply to author
Forward
0 new messages