We currently implement the simplistic
one report per hour limit that we inherited from Breakpad on macOS. Chrome’s use of Breakpad on Windows had a limit of 20 reports per day. The only knob that we currently have is
--no-rate-limit, which lifts all limits entirely and really viable (we
used it for a while in Chrome on Windows, and then
stopped). Siggi did an analysis and found that in not allowing more than one near-contemporaneous report to become eligible for upload, there was surprisingly little loss of actionable data.
Now that we’re closing in on
recording crash reports for crashpad_handler on all supported platforms, it’s time to think about what this means for our rate limiting strategy. As crash report upload normally immediately follows a crash, it follows that a crash that occurs in crashpad_handler during upload, while it may be recorded, will not normally become eligible for upload. This leaves us with a blind spot in crash coverage for crashpad_handler.
I’d like to close up this hole and fix
bug 23. It seems that implementing a 20 report per day limit would be a good start. A two-tiered approach is also interesting: a dual limit of 20 reports per day and 5 per hour would prevent a flood of a full day’s quota of 20 reports from being uploaded all at once.
Rate limiting and retrying are related, and there’s a question of how to apply this policy: should we attempt 20 uploads per day, or should we allow 20 crash reports per day to become eligible for upload? I think I’m leaning toward the latter.
The other question is how often to retry failed uploads, and when to give up. An initial delay of an hour, with exponential backoff to 8 hours is probably sufficient. We could give up on trying to upload a report if we haven’t had any success after 2 days or 5 upload attempts, whichever comes first.
I’d eventually like to get to the point where we’re smart enough to defer an upload attempt if it looks like the network is down, but perhaps try the initial upload regardless of network status (because sometimes when it looks like it’s down, it’s not really). Network-aware upload will become increasingly important as we finish the Android client.
Please let me know what you think of this strategy, and help me tune the constants.