RFC: further improving robustness of recipe autoroller

10 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Sep 13, 2016, 8:42:47 PM9/13/16
to infr...@chromium.org, Erik Staab, Robbie Iannucci, Stephen Martinis
I'd like to make the following changes to recipe autoroller:

1. Keep track of the last trivial and nontrivial roll CL. Always run recipes.py autoroll locally, and land a trivial roll even when there was previously a nontrivial roll CL in flight. This will cover cases when e.g. a revert leads to a trivial roll. Currently this requires manually closing previous nontrivial roll CL, and with this change the roller would automatically handle it.

2. Keep track of spec being rolled. When it's the same as last landed CL, do not upload another one. This would address a race condition with recipe roller generating second roll identical to previous one shortly after it lands.

3. Keep track of when last nontrivial and trivial roll CLs were created. If there's a nontrivial CL still not reviewed+CQ for say 30 mins (exact timing TBD), fail the build.

Any comments/questions before I move forward with a CL?

There would likely be two: one to start recording more state, and one to use it.

Paweł

Stephen Martinis

unread,
Sep 14, 2016, 2:03:06 PM9/14/16
to Paweł Hajdan, Jr., infr...@chromium.org, Erik Staab, Robbie Iannucci
Comments inline.

On Tue, Sep 13, 2016 at 5:42 PM Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I'd like to make the following changes to recipe autoroller:

1. Keep track of the last trivial and nontrivial roll CL. Always run recipes.py autoroll locally, and land a trivial roll even when there was previously a nontrivial roll CL in flight. This will cover cases when e.g. a revert leads to a trivial roll. Currently this requires manually closing previous nontrivial roll CL, and with this change the roller would automatically handle it.
I'm fine with solving the problem you're trying to solve (revert lands which makes non-trivial CL obsolete, and makes it so that everything can be rolled with a trivial CL). 

I don't see why you need to track the last trivial and nontrivial roll CL. Shouldn't always running autoroll locally, and uploading if it's trivial and different from the last spec enough?

2. Keep track of spec being rolled. When it's the same as last landed CL, do not upload another one. This would address a race condition with recipe roller generating second roll identical to previous one shortly after it lands.
I don't understand how the roller would cause this race condition to happen....  Do you have any examples of this happening I could see? I think removing race conditions are good, so in general for this change, but I would like to understand why it happens a bit more.

If this is caused by google storage concretely being a bad datastore (we know it's not the best, but I think it hasn't bitten us so far), then I think we should figure out another state storage mechanism. 

Also, don't we already keep track of the last spec? I know we keep track of the spec somewhere, is that not the spec that's being rolled?

3. Keep track of when last nontrivial and trivial roll CLs were created. If there's a nontrivial CL still not reviewed+CQ for say 30 mins (exact timing TBD), fail the build.
I think we should do something like this. I want to have more granularity than just "the autoroller is failing"; I want to be able to see which repo is behind, and by how much time, etc... I'm working on a recipe status app which I think should be able to solve some of these problems.

I think for now failing the build is fine. We'll need more granularity later, and I don't think digging through buildbot logs is that great of a UI if you want to see system status.

Paweł Hajdan, Jr.

unread,
Sep 14, 2016, 6:14:58 PM9/14/16
to Stephen Martinis, infr...@chromium.org, Erik Staab, Robbie Iannucci
On Wed, Sep 14, 2016 at 11:02 AM, Stephen Martinis <mart...@chromium.org> wrote:
On Tue, Sep 13, 2016 at 5:42 PM Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I'd like to make the following changes to recipe autoroller:

1. Keep track of the last trivial and nontrivial roll CL. Always run recipes.py autoroll locally, and land a trivial roll even when there was previously a nontrivial roll CL in flight. This will cover cases when e.g. a revert leads to a trivial roll. Currently this requires manually closing previous nontrivial roll CL, and with this change the roller would automatically handle it.
I'm fine with solving the problem you're trying to solve (revert lands which makes non-trivial CL obsolete, and makes it so that everything can be rolled with a trivial CL). 

I don't see why you need to track the last trivial and nontrivial roll CL. Shouldn't always running autoroll locally, and uploading if it's trivial and different from the last spec enough?

Uploaded https://chromium-review.googlesource.com/c/385676/ for more specific discussion.

Yes, we'll be always running autoroll.

Tracking last trivial and nontrivial rolls can be useful. For example, it may make sense to close the nontrivial roll CL automatically when uploading a trivial roll. In an extreme case that nontrivial roll CL might be reopened later in case e.g. the nontrivial change lands again.
 
2. Keep track of spec being rolled. When it's the same as last landed CL, do not upload another one. This would address a race condition with recipe roller generating second roll identical to previous one shortly after it lands.
I don't understand how the roller would cause this race condition to happen....  Do you have any examples of this happening I could see? I think removing race conditions are good, so in general for this change, but I would like to understand why it happens a bit more.

If this is caused by google storage concretely being a bad datastore (we know it's not the best, but I think it hasn't bitten us so far), then I think we should figure out another state storage mechanism. 

Also, don't we already keep track of the last spec? I know we keep track of the spec somewhere, is that not the spec that's being rolled?

We used to keep rack of the spec, but this has changed recently.

The race condition wasn't related to storage as far as I know. It's more like an existing roll CL could land after given recipe roller build has updated local repo, and before it checked status of the CL.
 
3. Keep track of when last nontrivial and trivial roll CLs were created. If there's a nontrivial CL still not reviewed+CQ for say 30 mins (exact timing TBD), fail the build.
I think we should do something like this. I want to have more granularity than just "the autoroller is failing"; I want to be able to see which repo is behind, and by how much time, etc... I'm working on a recipe status app which I think should be able to solve some of these problems.

I think for now failing the build is fine. We'll need more granularity later, and I don't think digging through buildbot logs is that great of a UI if you want to see system status.

Agreed. Are you writing a design doc for the app? More status is better, but there could be ways to make this really delightful. For example, maybe we could display all the info recipe roller has in a very friendly way (roll candidates and why they failed among others).

Paweł
Reply all
Reply to author
Forward
0 new messages