CQ retry quota: how do we decide?

Paweł Hajdan, Jr.

unread,

Jan 27, 2016, 2:24:30 PM1/27/16

to infr...@chromium.org

There's been some discussion on https://codereview.chromium.org/1474703002 about possibly changing CQ retry quota.

The change was landed in November last year, and it reduced number of retries after we failed to meet 90% CQ cycle time 4 weeks in a row.

I'm fine with experimenting to change it back if anyone has a strong opinion.

However, I'm not sure if we have well-defined success criteria to evaluate effects of the change.

If there's interest in improving CQ metrics, my suggestion would be to improve our understanding, and ability to simulate effects of changes and verify such simulations/estimations.

One of the possible outcomes might be that the current CQ metrics could be quite noisy and hard to reason about, and we could come up with different metrics.

WDYT?

Paweł

Andrii Shyshkalov

unread,

Jan 27, 2016, 2:33:58 PM1/27/16

to infra-dev

On Wednesday, January 27, 2016 at 8:24:30 PM UTC+1, Paweł wrote:

There's been some discussion on https://codereview.chromium.org/1474703002 about possibly changing CQ retry quota.

The change was landed in November last year, and it reduced number of retries after we failed to meet 90% CQ cycle time 4 weeks in a row.

Well, AFAIR, it didn't obviously help reducing the cycle time, unless of course without the fix it would have been even worse.

That said, 2->1 clearly increases # of times that dev has to click CQ button again. So for these two reasons, I'd say let's flip back to 2.

I'm fine with experimenting to change it back if anyone has a strong opinion.

However, I'm not sure if we have well-defined success criteria to evaluate effects of the change.

Same as when we flipped it 2->1. That's why we couldn't decide whether to flip back or not - there was no tangible improvement in any of monitoring data we collected at the time.

Sergey Berezin

unread,

Jan 27, 2016, 3:09:09 PM1/27/16

to Andrii Shyshkalov, infra-dev

There is plenty of data in event_mon pipeline nowadays, some of it is exported to BQ. With a bit of statistics, I'm sure we can estimate an impact one way on another. Specifically, if the number of CLs with >1 retries would be <10%, it's not likely to affect 90th %-ile metrics. And IIRC, we'd only retry infra failures more than once; regular test failures are retried once regardless. So, collecting infra failure rate may give you a good idea what'll happen. Likewise with the false rejection rate; if you can estimate infra flakiness of try jobs, you can project the impact of more retries. Feel free to ping me on a private channel if you want to bounce ideas (unfortunately I don't have time to write actual queries right now, but you're all good with SQL :-)

The data you cite indicates to me that the cycle time is not due to flakiness or retries, but more likely due to slow bots. Look at the 90th %-ile for trybot runtimes on our dashboards (Services / CQ / show more build cycle time graphs).

Sergey.

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/b2e34469-468d-442a-aded-3fb052e8b980%40chromium.org.

Paweł Hajdan, Jr.

unread,

Jan 28, 2016, 4:25:07 PM1/28/16

to Sergey Berezin, Andrii Shyshkalov, infra-dev

I'd like to tackle the point about slowness being due to slow bots. We do now have graphs for bots, and can also look up past data there.

It's not obvious to me that we just have a slow bot impacting everything. If we do, it should be easy to point to such bot and its cycle time graphs. Can you?

FWIW, I'm going to do another rounds of checks in case I missed something obvious.

For flakiness-related data, this is based on last week's CQ stats:

Patches which eventually land percentiles:

10: 0.2 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

25: 0.5 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

50: 1.1 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

75: 1.5 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

90: 2.1 hrs, 1 attempts, 1 tryjob retries, 1 global retry quota

95: 2.8 hrs, 2 attempts, 1 tryjob retries, 2 global retry quota

99: 4.3 hrs, 2 attempts, 3 tryjob retries, 4 global retry quota

max: 20.0 hrs, 3 attempts, 14 tryjob retries, 15 global retry quota

This means 90% of patches only require 1 tryjob retry and 1 global retry quota. 95% fit within 1 retry and 2 global retry quota.

However, some weeks are worse, e.g. January 11 took 3 global retry quota for 90% of patches (https://groups.google.com/a/chromium.org/d/msg/chromium-dev/KhgMIUTpnwg/Q7snlORuAgAJ):

Patches which eventually land percentiles:

10: 0.1 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

25: 0.4 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

50: 0.8 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

75: 1.3 hrs, 1 attempts, 0 tryjob retries, 1 global retry quota

90: 1.9 hrs, 2 attempts, 1 tryjob retries, 3 global retry quota

95: 2.8 hrs, 2 attempts, 2 tryjob retries, 13 global retry quota

99: 4.3 hrs, 3 attempts, 3 tryjob retries, 56 global retry quota

max: 6.5 hrs, 5 attempts, 9 tryjob retries, 120 global retry quota

Another example is Dec 7 (https://groups.google.com/a/chromium.org/d/msg/chromium-dev/NaQt0h5n5Jw/gpS0x3lWBgAJ):

Patches which eventually land percentiles:

10: 0.2 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

25: 0.7 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

50: 1.3 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

75: 2.2 hrs, 1 attempts, 1 tryjob retries, 1 global retry quota

90: 3.5 hrs, 2 attempts, 2 tryjob retries, 3 global retry quota

95: 4.8 hrs, 2 attempts, 3 tryjob retries, 4 global retry quota

99: 6.9 hrs, 3 attempts, 7 tryjob retries, 10 global retry quota

max: 14.1 hrs, 8 attempts, 21 tryjob retries, 21 global retry quota

Paweł

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAPcKkT%3DjPy5sc8RitqddFqiTFGr8hzHHxuoJze%3DfgV%3DvTUbDQA%40mail.gmail.com.

Sergey Berezin

unread,

Jan 28, 2016, 5:42:36 PM1/28/16

to Paweł Hajdan, Jr., Andrii Shyshkalov, infra-dev

On Thu, Jan 28, 2016 at 1:25 PM Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:

I'd like to tackle the point about slowness being due to slow bots. We do now have graphs for bots, and can also look up past data there.

It's not obvious to me that we just have a slow bot impacting everything. If we do, it should be easy to point to such bot and its cycle time graphs. Can you?

You'd be surprised at non-obviousness of figuring this out :-) See for yourself. The time in queue and max build cycle times sort of correlate, within 10-20 min of each other. I'm not sure though where those extra minutes go... It could be that successful builds tend to have longer times than the failing ones.

Of course, this is only my guess. An actual data would be to find out the last thing CQ was blocked on for each CL, and do statistical analysis of those blockers. As I mentioned, I don't actually have time to generate this right now, but that would be my approach to answer your question.

FWIW, I'm going to do another rounds of checks in case I missed something obvious.

For flakiness-related data, this is based on last week's CQ stats:

Patches which eventually land percentiles:
10: 0.2 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota
25: 0.5 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota
50: 1.1 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota
75: 1.5 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota
90: 2.1 hrs, 1 attempts, 1 tryjob retries, 1 global retry quota
95: 2.8 hrs, 2 attempts, 1 tryjob retries, 2 global retry quota
99: 4.3 hrs, 2 attempts, 3 tryjob retries, 4 global retry quota
max: 20.0 hrs, 3 attempts, 14 tryjob retries, 15 global retry quota

This means 90% of patches only require 1 tryjob retry and 1 global retry quota. 95% fit within 1 retry and 2 global retry quota.

Is that a typical week? If it is, that's your answer right there: more retries will not impact the 90th percentile.

However, some weeks are worse, e.g. January 11 took 3 global retry quota for 90% of patches (https://groups.google.com/a/chromium.org/d/msg/chromium-dev/KhgMIUTpnwg/Q7snlORuAgAJ):

If we have one bad week a month, how about taking a 5w window? Would that still keep 90th %-ile under 1 retry?

Sergey.

Paweł Hajdan, Jr.

unread,

Feb 1, 2016, 12:04:08 PM2/1/16

to Sergey Berezin, Andrii Shyshkalov, infra-dev

On Thu, Jan 28, 2016 at 11:42 PM, Sergey Berezin <sergey...@chromium.org> wrote:

On Thu, Jan 28, 2016 at 1:25 PM Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I'd like to tackle the point about slowness being due to slow bots. We do now have graphs for bots, and can also look up past data there.

It's not obvious to me that we just have a slow bot impacting everything. If we do, it should be easy to point to such bot and its cycle time graphs. Can you?

You'd be surprised at non-obviousness of figuring this out :-) See for yourself. The time in queue and max build cycle times sort of correlate, within 10-20 min of each other. I'm not sure though where those extra minutes go... It could be that successful builds tend to have longer times than the failing ones.

Hm, it's not obvious to me looking at the link provided. Could you point to specific pair of graphs so we can align them next to each other and make the correlation very easy to see?

For different build time depending on whether the build was successful, we do store success status in the metric. One could experiment with a custom query, although my reasoning is that for CQ this should not be making a difference (average build time should be sufficient to detect general slowdown of bots).

In general, I'm referring to your earlier statement that "the data you cite indicates to me that the cycle time is not due to flakiness or retries, but more likely due to slow bots." If that's the case, you should be able to point which bot is the slow one, right? linux_android_rel_ng seems to be the slowest, but that doesn't necessarily make it a regression. It's been also the slowest and exceeding SLO even when CQ was meeting its SLO.

Of course, this is only my guess. An actual data would be to find out the last thing CQ was blocked on for each CL, and do statistical analysis of those blockers. As I mentioned, I don't actually have time to generate this right now, but that would be my approach to answer your question.

Yes, the idea has been around for quite some time.

Suppose we had this (and checking e.g. which trybot completed last for each committed CL is actually easy). How would we know e.g. whether a bot became slower or we have more retries? How would we "separate" or "attribute" effects of bot cycle time and bot success rate and CQ retry strategy? While I'm sure good answers exist to these questions, they don't seem obvious to me at this moment.

This means 90% of patches only require 1 tryjob retry and 1 global retry quota. 95% fit within 1 retry and 2 global retry quota.

Is that a typical week? If it is, that's your answer right there: more retries will not impact the 90th percentile.

Might be typical, not sure. One thing to note is that the set of CLs corresponding to 90% cycle time might be different from the set of CLs corresponding to 90% retries. For each metric the percentiles are calculated independently. It's a non-obvious gotcha that may make reasoning about this somewhat more complicated.

However, some weeks are worse, e.g. January 11 took 3 global retry quota for 90% of patches (https://groups.google.com/a/chromium.org/d/msg/chromium-dev/KhgMIUTpnwg/Q7snlORuAgAJ):

If we have one bad week a month, how about taking a 5w window? Would that still keep 90th %-ile under 1 retry?

Here's the data for the last month. 90% still shows 1 tryjob retry and 1 global retry quota. Looks good so far...

Statistics for project chromium

excluding paths in the following set:

third_party/WebKit

from 2016-01-04 15:17:01.734330 till 2016-02-01 15:10:07.558800 (local time).

CQ users: 607 out of 612 total committers 99.18%

Committed 3942 out of 4033 commits 97.74%.

3970 issues (5192 patches) were tried by CQ, resulting in 4625 attempts.

3404 patches (65.6% of tried patches, 73.6% of attempts) were committed by CQ,

False Rejections:

188 attempts (4.1% of 4625 attempts) were false rejections in 162 committed patches

79 attempts (1.7% of 4625 attempts) were infra false rejections in 69 committed patches

Patches which eventually land percentiles:

10: 0.1 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

25: 0.4 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

50: 0.9 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

75: 1.4 hrs, 1 attempts, 0 tryjob retries, 0 global retry quota

90: 1.8 hrs, 1 attempts, 1 tryjob retries, 1 global retry quota

95: 2.6 hrs, 2 attempts, 1 tryjob retries, 2 global retry quota

99: 4.2 hrs, 3 attempts, 3 tryjob retries, 14 global retry quota

max: 20.0 hrs, 6 attempts, 14 tryjob retries, 120 global retry quota

Per-week stats:

week of 2016-01-04: 1055 attempts; 50% 0.7; 90% 1.6; false rejections 1.0% ( 0.1% infra)

week of 2016-01-11: 1402 attempts; 50% 0.8; 90% 1.9; false rejections 7.3% ( 4.4% infra)

week of 2016-01-18: 971 attempts; 50% 1.1; 90% 2.1; false rejections 2.6% ( 0.5% infra)

week of 2016-01-25: 1195 attempts; 50% 0.8; 90% 1.7; false rejections 3.3% ( 0.5% infra)

By the way, these are top flakes categorized from the last month. For each row, I highlighted the biggest cause of flakiness. Note the high percentage of infra flakes. We could get a list of builds contributing to that and investigate more.

Top flaky builders (which fail and succeed in the same patch):

chromium.mac mac_chromium_rel_ng 390/5950 ( 7%)| 58%| 3%| 33%| 1%| 5%| 1%

chromium.linux linux_chromium_rel_ng 344/6003 ( 6%)| 52%| 2%| 16%| 2%| 5%| 23%

chromium.linux chromium_presubmit 294/6127 ( 5%)| 9%| 0%| 0%| 0%| 15%| 77%

chromium.linux linux_chromium_asan_rel_ng 213/5693 ( 4%)| 64%| 5%| 16%| 0%| 9%| 5%

chromium.android linux_android_rel_ng 202/5707 ( 4%)| 51%| 9%| 18%| 12%| 8%| 2%

chromium.linux linux_chromium_chromeos_rel_ng 194/5691 ( 3%)| 63%| 9%| 14%| 0%| 10%| 4%

chromium.linux linux_chromium_chromeos_ozone_rel_ng 156/5670 ( 3%)| 67%| 13%| 6%| 0%| 12%| 3%

chromium.win win_chromium_rel_ng 129/5656 ( 2%)| 29%| 4%| 41%| 9%| 16%| 2%

chromium.android android_clang_dbg_recipe 123/5603 ( 2%)| 54%| 32%| 0%| 0%| 14%| 1%

chromium.android android_chromium_gn_compile_dbg 107/5564 ( 2%)| 61%| 23%| 0%| 0%| 15%| 1%

chromium.android android_chromium_gn_compile_rel 106/5574 ( 2%)| 64%| 21%| 0%| 0%| 15%| 0%

chromium.android android_arm64_dbg_recipe 101/5547 ( 2%)| 65%| 16%| 0%| 0%| 19%| 0%

chromium.win win_chromium_x64_rel_ng 91/5603 ( 2%)| 20%| 7%| 38%| 11%| 21%| 3%

chromium.android android_compile_dbg 90/5552 ( 2%)| 69%| 12%| 0%| 0%| 19%| 0%

chromium.android cast_shell_android 89/5552 ( 2%)| 75%| 3%| 0%| 0%| 20%| 1%

chromium.mac ios_dbg_simulator_ninja 90/5631 ( 2%)| 59%| 0%| 0%| 0%| 21%| 20%

Paweł

Reply all

Reply to author

Forward