Sheriffing hackability

28 views
Skip to first unread message

Scott Graham

unread,
Sep 3, 2014, 1:17:14 AM9/3/14
to hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
We got lucky a lucky shift because of Labo[u]r Day yesterday and I think people were likely on vacation today so the commit rate was unusually low.

I ran into a bunch of things that made the world not-so-hackable today, so I thought I'd mention them here. I'm not sure if these are all known and being addressed and/or require any postmortems. I filed bugs for some. Maybe my co-sheriffs noticed other things or had better experiences.


- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.

- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.

- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.

- I had to do some merging for 37 & 38, and had some difficulty figuring out what CLs where in what branches. This might just be me needing to adapt, or documenting/me reading documentation. Determining that data via a web interface rather than a local git clone is beneficial, again when your local boxes are busy trying to repro/fix something else.


The above 4 issues would make me nervous about sheriffing on busier day (e.g. closer to a branch point) for lack of being able to control the situation effectively. Being able to revert quickly (in the "handful of seconds" range) needs to go hand in hand with the effort to not close the tree.



- I didn't find Sheriff-o-matic usable as a primary tool yet. The biggest problem I had today was that the UI hangs (> 15s) when a large number of tests fail. That happened on some of the slooooow cycling lsan/asan bots so it took the UI out of commission for most of the afternoon. I believe Ojan is looking at that one. I also filed a number of other smaller bugs. Overall I still felt more effective using the old build.chromium.org waterfall.

- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.

- There were some git server problems that took down Linux slaves for a while. https://code.google.com/p/chromium/issues/detail?id=410102 I think this was probably just an unusual transient event, but if it becomes a regular thing, we would want to monitor uptime on our side.

- I've noticed troopers generally aren't in #chromium any more. Is that intentional or just a team shift to elsewhere? It makes coordination a bit slower if you're relaying message back and forth with some on irc, some in IM, some in person, etc.

- There was a build problem that wasn't noticed on any main waterfall bots or trybots, but affected local developer builds, LKGR builders, Official builders, and various other memory.fyi bots: https://code.google.com/p/chromium/issues/detail?id=409940

- There were still a "normal" number of flaky tests, even with our ongoing war :(. I think there were 3 that would have previously been tree closers. It wasn't clear "how flaky" they were, if they should be disabled, or if there was really something obscure that had caused them to become flaky. There are links on the waterfall to "Flakiness dashboard" for things that aren't archived there. (I'm not sure if that's new situation.).



(Sorry for the long rambling brain dump, M-A).

Marc Treib

unread,
Sep 3, 2014, 4:16:18 AM9/3/14
to Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin
On Wed, Sep 3, 2014 at 7:17 AM, Scott Graham <sco...@chromium.org> wrote:
- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.

"Revert patchset" did work for me the one time I used it. HOWEVER, the commit never showed up on the build.chromium.org waterfall, even after it got to origin/master. [For reference, it's this one: https://chromium.googlesource.com/chromium/src/+/0b804d0007f4fc6b6d99f699e160819ab90c1798]
 
- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.

+1 for failure notifications on IRC.

Aaron Gable

unread,
Sep 3, 2014, 5:59:06 AM9/3/14
to Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed Sep 03 2014 at 7:17:15 AM Scott Graham <sco...@chromium.org> wrote:
We got lucky a lucky shift because of Labo[u]r Day yesterday and I think people were likely on vacation today so the commit rate was unusually low.

I ran into a bunch of things that made the world not-so-hackable today, so I thought I'd mention them here. I'm not sure if these are all known and being addressed and/or require any postmortems. I filed bugs for some. Maybe my co-sheriffs noticed other things or had better experiences.


- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.

I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.

- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.

Can you link to the CL that took 25 minutes to get from pending to master? We can look in the numbering daemon logs to see what the total cause of all that time was.

I'm also not sure what you mean by "it landed in pending after a few minutes"; "git cl land" will not return until the CL has successfully been pushed to pending, so it should have landed in pending "immediately". I'd like to know what made you think that wasn't the case.
 
- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.

This is a really good point. The fact that the old SVN drover didn't touch your local checkout had a lot of benefits, and not having to rebuild since file timestamps didn't change is one I think we failed to consider. As "man git-drover" says, we plan to implement a pure git drover, it just hasn't happened yet.
 
- I had to do some merging for 37 & 38, and had some difficulty figuring out what CLs where in what branches. This might just be me needing to adapt, or documenting/me reading documentation. Determining that data via a web interface rather than a local git clone is beneficial, again when your local boxes are busy trying to repro/fix something else. 

Can you explain what was difficult? This is exactly the case that the Cr-Commit-Position footer is supposed to help solve.

Also, note that you don't need to have a ref checked out in order to run 'git log' or 'git show' on it, so checking which CLs are in which branches shouldn't touch your actual checkout. 

The above 4 issues would make me nervous about sheriffing on busier day (e.g. closer to a branch point) for lack of being able to control the situation effectively. Being able to revert quickly (in the "handful of seconds" range) needs to go hand in hand with the effort to not close the tree.



- I didn't find Sheriff-o-matic usable as a primary tool yet. The biggest problem I had today was that the UI hangs (> 15s) when a large number of tests fail. That happened on some of the slooooow cycling lsan/asan bots so it took the UI out of commission for most of the afternoon. I believe Ojan is looking at that one. I also filed a number of other smaller bugs. Overall I still felt more effective using the old build.chromium.org waterfall.

- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.

It would be great to get another bot that watches sheriff-o-matic and posts about new failures in IRC. 

- There were some git server problems that took down Linux slaves for a while. https://code.google.com/p/chromium/issues/detail?id=410102 I think this was probably just an unusual transient event, but if it becomes a regular thing, we would want to monitor uptime on our side.

- I've noticed troopers generally aren't in #chromium any more. Is that intentional or just a team shift to elsewhere? It makes coordination a bit slower if you're relaying message back and forth with some on irc, some in IM, some in person, etc.

No clue what's up with this. I'm in IRC 100% of the time. To be fair, I'm *also* in Munich right now, so that won't help you much. But the current trooper should always be in IRC. Maybe this hasn't been made clear to the newest crop of troopers :) I'll make sure the message gets passed along.
 
- There was a build problem that wasn't noticed on any main waterfall bots or trybots, but affected local developer builds, LKGR builders, Official builders, and various other memory.fyi bots: https://code.google.com/p/chromium/issues/detail?id=409940

- There were still a "normal" number of flaky tests, even with our ongoing war :(. I think there were 3 that would have previously been tree closers. It wasn't clear "how flaky" they were, if they should be disabled, or if there was really something obscure that had caused them to become flaky. There are links on the waterfall to "Flakiness dashboard" for things that aren't archived there. (I'm not sure if that's new situation.).



(Sorry for the long rambling brain dump, M-A).

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CANHK6Rb1utOM67La3EsG_XcNFWe-MCM_yMzq%2BNmF%3D5pnjPbitQ%40mail.gmail.com.

Dominic Mazzoni

unread,
Sep 3, 2014, 11:57:45 AM9/3/14
to Aaron Gable, Scott Graham, hackability-cy, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 2:59 AM, Aaron Gable <aga...@chromium.org> wrote:
I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.

http://crbug.com/384129 seems to imply that it should be adding NOTRY and TBR for recent changes, but I'm not sure how "recent" is defined.  Was it not doing that? Maybe reopen that bug.

In many cases I think the biggest issue is just that the UI is confusing. "Revert patchset" could mean a dozen different things. There should be text next to it that explains what it will do right now, if you press it. I filed http://crbug.com/410315 for that.

- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.

Can you link to the CL that took 25 minutes to get from pending to master? We can look in the numbering daemon logs to see what the total cause of all that time was.

The changes I landed with "git cl land" took 10+ minutes too. I chatted with Vadim and he said it's definitely 407429. It didn't sound like he needed any more debugging info, but he also doesn't really have a sense of how to fix it yet.

Should I be worried that as we approach the next branch point the project is going to grind to a halt?

I'm also not sure what you mean by "it landed in pending after a few minutes"; "git cl land" will not return until the CL has successfully been pushed to pending, so it should have landed in pending "immediately". I'd like to know what made you think that wasn't the case.

When I tried it, it said "Pushing commit to refs/pending/heads/master... It can take a while." and that line did stay up for several minutes. It wasn't "immediately". But the next major step (Waiting for commit to be landed on refs/heads/master...) took even longer.

- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.

This is a really good point. The fact that the old SVN drover didn't touch your local checkout had a lot of benefits, and not having to rebuild since file timestamps didn't change is one I think we failed to consider. As "man git-drover" says, we plan to implement a pure git drover, it just hasn't happened yet.

Is it even going to be possible to have pure-git drover that runs on the client side? A sparse checkout could speed up the revert, but I don't see how you could avoid the initial fetch, which means you need something like 10 G free plus a fast net connection?

My solution so far is to have multiple clones of the whole repository. When sheriffing I keep one free just for doing reverts, but in general I keep a checkout free for one-off work so it doesn't force me to keep rebuilding my main development branch.

On the plus side, once you pay the initial cost of cloning a new repository, "git revert" of a rather large change is much faster, and far more reliable, than svn+drover ever was. I'm starting to wonder if the best option for "git drover" will be to keep a complete checkout locally, exclusively for drovering. That'd require quite a lot of disk space, though!

Marc Treib

unread,
Sep 3, 2014, 12:07:44 PM9/3/14
to Dominic Mazzoni, Aaron Gable, Scott Graham, hackability-cy, Philippe Beaudoin
On Wed, Sep 3, 2014 at 5:57 PM, Dominic Mazzoni <dmaz...@chromium.org> wrote:
On Wed, Sep 3, 2014 at 2:59 AM, Aaron Gable <aga...@chromium.org> wrote:
I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.

http://crbug.com/384129 seems to imply that it should be adding NOTRY and TBR for recent changes, but I'm not sure how "recent" is defined.  Was it not doing that? Maybe reopen that bug.

It did add NOTRY and TBR for me, on a CL that was a few (<5) hours old.
 
In many cases I think the biggest issue is just that the UI is confusing. "Revert patchset" could mean a dozen different things. There should be text next to it that explains what it will do right now, if you press it. I filed http://crbug.com/410315 for that.

+1

Scott Graham

unread,
Sep 3, 2014, 12:23:43 PM9/3/14
to Aaron Gable, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 2:59 AM, Aaron Gable <aga...@chromium.org> wrote:
On Wed Sep 03 2014 at 7:17:15 AM Scott Graham <sco...@chromium.org> wrote:
We got lucky a lucky shift because of Labo[u]r Day yesterday and I think people were likely on vacation today so the commit rate was unusually low.

I ran into a bunch of things that made the world not-so-hackable today, so I thought I'd mention them here. I'm not sure if these are all known and being addressed and/or require any postmortems. I filed bugs for some. Maybe my co-sheriffs noticed other things or had better experiences.


- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.

I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.

It added NOTRY and NOTREESTATUS. This was the CL: https://codereview.chromium.org/531133002/

It's also offered as the first option in the doc the automated sheriffing email sent me to https://docs.google.com/a/chromium.org/document/d/14e0wbAGDbFm5wUnmysE5jonzxbgT2VMOOFWc-WynCQQ/edit so we should decide if it's a sheriff-appropriate button or not, either way.
 

- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.

Can you link to the CL that took 25 minutes to get from pending to master? We can look in the numbering daemon logs to see what the total cause of all that time was.


That reminds me of another thing that caused some confusion yesterday too, the link added there: "Committed to pending queue: https://chromium.googlesource.com/chromium/src/+/ecdbf7a". I understand what it's doing, but it's a bit odd that we tagged this CL with information doesn't end up being useful later. i.e. if I try to use it now I get

[a6eeda8...]d:\src\cr3\src>git log ecdbf7a
fatal: ambiguous argument 'ecdbf7a': unknown revision or path not in the working tree.
Use '--' to separate paths from revisions, like this:
'git <command> [<revision>...] -- [<file>...]'

[a6eeda8...]d:\src\cr3\src>git log ecdbf7a5919f5728d23443113e5660f297d6812f
fatal: bad object ecdbf7a5919f5728d23443113e5660f297d6812f

It seems like a lot of this wackiness is due to the Cr-Commit-Position. I'm sure I missed copious discussion on this feature. Is pending planed for things other than Cr-Commit-Position? Writing tools that answer whatever questions the Position is intended to answer instead, would keep us closer to vanilla git and so be better for speed, complexity, etc.
 

I'm also not sure what you mean by "it landed in pending after a few minutes"; "git cl land" will not return until the CL has successfully been pushed to pending, so it should have landed in pending "immediately". I'd like to know what made you think that wasn't the case.

I meant it took a few minutes to actually complete the git cl land command (the "it can take a while" thing), but yes it's in pending right away once git cl land finishes.
 
 
- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.

This is a really good point. The fact that the old SVN drover didn't touch your local checkout had a lot of benefits, and not having to rebuild since file timestamps didn't change is one I think we failed to consider. As "man git-drover" says, we plan to implement a pure git drover, it just hasn't happened yet.

Yes! I took a look at your CL in progress yesterday but didn't have time to try to push anything along. I would be happy to help with that tool if you're tied up with all the other migration stuff. But anyway, agreed.
 
 
- I had to do some merging for 37 & 38, and had some difficulty figuring out what CLs where in what branches. This might just be me needing to adapt, or documenting/me reading documentation. Determining that data via a web interface rather than a local git clone is beneficial, again when your local boxes are busy trying to repro/fix something else. 

Can you explain what was difficult? This is exactly the case that the Cr-Commit-Position footer is supposed to help solve.

Also, note that you don't need to have a ref checked out in order to run 'git log' or 'git show' on it, so checking which CLs are in which branches shouldn't touch your actual checkout. 

The above 4 issues would make me nervous about sheriffing on busier day (e.g. closer to a branch point) for lack of being able to control the situation effectively. Being able to revert quickly (in the "handful of seconds" range) needs to go hand in hand with the effort to not close the tree.



- I didn't find Sheriff-o-matic usable as a primary tool yet. The biggest problem I had today was that the UI hangs (> 15s) when a large number of tests fail. That happened on some of the slooooow cycling lsan/asan bots so it took the UI out of commission for most of the afternoon. I believe Ojan is looking at that one. I also filed a number of other smaller bugs. Overall I still felt more effective using the old build.chromium.org waterfall.

- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.

It would be great to get another bot that watches sheriff-o-matic and posts about new failures in IRC. 

Is sheriff-o-matic intended to be a data source?
 

- There were some git server problems that took down Linux slaves for a while. https://code.google.com/p/chromium/issues/detail?id=410102 I think this was probably just an unusual transient event, but if it becomes a regular thing, we would want to monitor uptime on our side.

- I've noticed troopers generally aren't in #chromium any more. Is that intentional or just a team shift to elsewhere? It makes coordination a bit slower if you're relaying message back and forth with some on irc, some in IM, some in person, etc.

No clue what's up with this. I'm in IRC 100% of the time. To be fair, I'm *also* in Munich right now, so that won't help you much. But the current trooper should always be in IRC. Maybe this hasn't been made clear to the newest crop of troopers :) I'll make sure the message gets passed along.

Great!

Brett Wilson

unread,
Sep 3, 2014, 12:29:23 PM9/3/14
to Aaron Gable, Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 2:59 AM, Aaron Gable <aga...@chromium.org> wrote:
> On Wed Sep 03 2014 at 7:17:15 AM Scott Graham <sco...@chromium.org> wrote:
>>
>> We got lucky a lucky shift because of Labo[u]r Day yesterday and I think
>> people were likely on vacation today so the commit rate was unusually low.
>>
>> I ran into a bunch of things that made the world not-so-hackable today, so
>> I thought I'd mention them here. I'm not sure if these are all known and
>> being addressed and/or require any postmortems. I filed bugs for some. Maybe
>> my co-sheriffs noticed other things or had better experiences.
>>
>>
>> - Committing via the "Revert patchset" on Rietveld was not useful. I tried
>> it twice. The first time the revert took longer than 45m, and it didn't
>> land: https://code.google.com/p/chromium/issues/detail?id=409966 . The
>> second time it worked because I de-starved it by having the tree closed.
>> Presumably this isn't what we want.
>
>
> I think the "revert patchset" button isn't intended to be a drover
> replacement, as it explicitly is supposed to go through the CQ (whereas
> drover always committed directly). So expecting it to be any faster than a
> normal CL is a false expectation. Maybe it should put NOTRY=True, but I
> think we want the Revert button to be more generally useful to people other
> than Sheriffs too, so we don't want to do that by default.

What is the intended primary use of the "Revert patchset" button then?
The communication I've seen is that it is how sheriffs revert things
instead of drover now. I believe most reverts are from sheriffs trying
to green the tree. What did you have in mind here?

From time to time a TPM will ask somebody to revert something if it's
causing problems on the dev channel or something. But this is much
more rare. And I'm not even sure the current behavior does what I want
in that case anyway. For that case, I think a "Create revert issue"
that doesn't submit it to the queue (so I could manually try it to
control landing, or manually patch it into my tree for testing or
rebasing) would be much more useful.

Brett

Dominic Mazzoni

unread,
Sep 3, 2014, 12:38:37 PM9/3/14
to Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Philippe Beaudoin, Marc Treib
On Wed, Sep 3, 2014 at 9:29 AM, Brett Wilson <bre...@chromium.org> wrote:
in that case anyway. For that case, I think a "Create revert issue"
that doesn't submit it to the queue (so I could manually try it to
control landing, or manually patch it into my tree for testing or
rebasing) would be much more useful.

Stefan Zager

unread,
Sep 3, 2014, 1:43:58 PM9/3/14
to Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 9:29 AM, Brett Wilson <bre...@chromium.org> wrote:

What is the intended primary use of the "Revert patchset" button then?
The communication I've seen is that it is how sheriffs revert things
instead of drover now. I believe most reverts are from sheriffs trying
to green the tree. What did you have in mind here?

I don't actually know the official policy (if there is one), but I think it might help clarify the issue to point out that "Revert patchset" (and the CQ generally) ultimately terminate with a call to 'git cl land'.  So, if you're in a hurry to get something committed, running 'git cl land' on your local machine is just a shortcut to the same operation.

If it takes multiple minutes for a commit to appear in the main tree after running 'git cl land', that is still unfortunately the shortest path to landing a change.  We'll be working hard to get the lag time down.

I'd also point out that there are some clever and obscure ways to get drover-like functionality without touching your working checkout, and without running 'git clone'.  git-drover will probably be the most convenient way to do it, but maybe we should write up some docs for doing this in a 'pure git' way, for those who are interested.

Stefan

John Abd-El-Malek

unread,
Sep 3, 2014, 1:50:21 PM9/3/14
to Stefan Zager, Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 10:43 AM, Stefan Zager <sza...@chromium.org> wrote:
On Wed, Sep 3, 2014 at 9:29 AM, Brett Wilson <bre...@chromium.org> wrote:

What is the intended primary use of the "Revert patchset" button then?
The communication I've seen is that it is how sheriffs revert things
instead of drover now. I believe most reverts are from sheriffs trying
to green the tree. What did you have in mind here?

I don't actually know the official policy (if there is one), but I think it might help clarify the issue to point out that "Revert patchset" (and the CQ generally) ultimately terminate with a call to 'git cl land'.  So, if you're in a hurry to get something committed, running 'git cl land' on your local machine is just a shortcut to the same operation.

I think this misses the point that before we could click the button and have the cl be committed in the svn repo pretty instantaneously.  The experience of sheriffing is that it's very convenient to be able to click a button a cl and have it be reverted.

It seems that if this flow is really slow now, that is a regression and we should prioritize fixing it as part of the git migration.


If it takes multiple minutes for a commit to appear in the main tree after running 'git cl land', that is still unfortunately the shortest path to landing a change.  We'll be working hard to get the lag time down.

I'd also point out that there are some clever and obscure ways to get drover-like functionality without touching your working checkout, and without running 'git clone'.  git-drover will probably be the most convenient way to do it, but maybe we should write up some docs for doing this in a 'pure git' way, for those who are interested.

Stefan

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.

Dominic Mazzoni

unread,
Sep 3, 2014, 1:52:37 PM9/3/14
to Stefan Zager, Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 10:43 AM, Stefan Zager <sza...@chromium.org> wrote:
I'd also point out that there are some clever and obscure ways to get drover-like functionality without touching your working checkout, and without running 'git clone'.  git-drover will probably be the most convenient way to do it, but maybe we should write up some docs for doing this in a 'pure git' way, for those who are interested.

+1, would love to see this - could you give a hint or maybe try it interactively in a terminal and paste it in?

Stefan Zager

unread,
Sep 3, 2014, 1:56:28 PM9/3/14
to John Abd-El-Malek, Stefan Zager, Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
On Wed, Sep 3, 2014 at 10:50 AM, John Abd-El-Malek <j...@chromium.org> wrote:

On Wed, Sep 3, 2014 at 10:43 AM, Stefan Zager <sza...@chromium.org> wrote:

I don't actually know the official policy (if there is one), but I think it might help clarify the issue to point out that "Revert patchset" (and the CQ generally) ultimately terminate with a call to 'git cl land'.  So, if you're in a hurry to get something committed, running 'git cl land' on your local machine is just a shortcut to the same operation.

I think this misses the point that before we could click the button and have the cl be committed in the svn repo pretty instantaneously.  The experience of sheriffing is that it's very convenient to be able to click a button a cl and have it be reverted.

It seems that if this flow is really slow now, that is a regression and we should prioritize fixing it as part of the git migration.

Yes, I strongly agree with this.  I just wanted to point out that for the time being, if the current "Revert patchset" button is not working satisfactorily, or if it has mysterious behavior, sheriffs might want to run 'git cl land' directly.

We should get "Revert patchset" working well ASAP.

Stefan

Dirk Pranke

unread,
Sep 3, 2014, 4:56:16 PM9/3/14
to Dominic Mazzoni, Stefan Zager, Brett Wilson, Aaron Gable, Scott Graham, hackability-cy, Philippe Beaudoin, tr...@chromium.org
I would also be interested.

-- Dirk

v...@google.com

unread,
Sep 3, 2014, 8:53:05 PM9/3/14
to hackabi...@chromium.org, aga...@chromium.org, dmaz...@chromium.org, beau...@google.com, tr...@chromium.org
Anyone? Is there an API?

Eric Seidel

unread,
Sep 4, 2014, 12:05:35 AM9/4/14
to Viet-Trung Luu, hackability-cy, aga...@chromium.org, Dominic Mazzoni, beau...@google.com, tr...@chromium.org
sheriff-o-matic's doesn't do any computation. It's just a memcache
server for json computed by builder_alerts.

The "API" for the data is:
sheriff-o-matic.appspot.com/alerts

Here is the tool which does the builder crawl to produce said json:
https://chromium.googlesource.com/infra/infra.git/+/master/infra/tools/builder_alerts/
> To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/d0dd6077-48f5-46a8-82a9-cbd2a01b9c67%40chromium.org.

Viet-Trung Luu

unread,
Sep 4, 2014, 11:20:21 AM9/4/14
to Eric Seidel, hackability-cy, aga...@chromium.org, Dominic Mazzoni, beau...@google.com, tr...@chromium.org
On Wed, Sep 3, 2014 at 9:05 PM, Eric Seidel <ese...@chromium.org> wrote:
sheriff-o-matic's doesn't do any computation.  It's just a memcache
server for json computed by builder_alerts.

The "API" for the data is:
sheriff-o-matic.appspot.com/alerts

Here is the tool which does the builder crawl to produce said json:
https://chromium.googlesource.com/infra/infra.git/+/master/infra/tools/builder_alerts/


Is there anything like a schema for this output?

Eric Seidel

unread,
Sep 4, 2014, 12:33:09 PM9/4/14
to Viet-Trung Luu, hackability-cy, aga...@chromium.org, Dominic Mazzoni, beau...@google.com, tr...@chromium.org
I'm not really sure what a schema would look like.

There is some documentation in the code. Happy to LGTM patches to add more!

Normally when understanding json output, I just keep using dict.keys()
in python until I understand it. :)

Dirk Pranke

unread,
Sep 4, 2014, 5:08:09 PM9/4/14
to Eric Seidel, Viet-Trung Luu, hackability-cy, aga...@chromium.org, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
I would imagine documentation for the format to look something like the documentation I wrote for the json results format:


Or you could, you know, actually use something like a JSON schema:


You could do something less fancy as well, of course :).

-- Dirk


Ojan Vafai

unread,
Sep 10, 2014, 1:47:05 AM9/10/14
to Dirk Pranke, Eric Seidel, Viet-Trung Luu, hackability-cy, aga...@chromium.org, Dominic Mazzoni, Philippe Beaudoin, tr...@chromium.org
There's no documentation for this json. vtl, lets chat offline and find a solution for this? Filed crbug.com/412647.

FWIW, the big hangs Scott saw in sheriff-o-matic are mostly addressed. It was totally unusable that day. We just had never done any perf testing and having 7600 failures is rare enough that we had never hit it. We had a silly n^2 in the code. 7600^2 is kind of large. :) I need to do a bit more perf testing to make sure that we really scale well enough now.

>> >> - There were still a "normal" number of flaky tests, even with our
>> >> ongoing war :(. I think there were 3 that would have previously been tree
>> >> closers. It wasn't clear "how flaky" they were, if they should be disabled,
>> >> or if there was really something obscure that had caused them to become
>> >> flaky. There are links on the waterfall to "Flakiness dashboard" for things
>> >> that aren't archived there. (I'm not sure if that's new situation.).

We're working on fixing this as a high priority for the CY (crbug.com/379005). This is the data that feeds into both sheriff-o-matic and the flakiness dashboard, so we really need to get it working for as many test types and bots as we can. We're also heading in the direction of adding monitoring for this so that it stops regressing all the time (starting with crbug.com/401353). We're making progress here, but we really could use one more person helping to address these issues.
Reply all
Reply to author
Forward
0 new messages