- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.
- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.
We got lucky a lucky shift because of Labo[u]r Day yesterday and I think people were likely on vacation today so the commit rate was unusually low.I ran into a bunch of things that made the world not-so-hackable today, so I thought I'd mention them here. I'm not sure if these are all known and being addressed and/or require any postmortems. I filed bugs for some. Maybe my co-sheriffs noticed other things or had better experiences.- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.
- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.
- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.
- I had to do some merging for 37 & 38, and had some difficulty figuring out what CLs where in what branches. This might just be me needing to adapt, or documenting/me reading documentation. Determining that data via a web interface rather than a local git clone is beneficial, again when your local boxes are busy trying to repro/fix something else.
The above 4 issues would make me nervous about sheriffing on busier day (e.g. closer to a branch point) for lack of being able to control the situation effectively. Being able to revert quickly (in the "handful of seconds" range) needs to go hand in hand with the effort to not close the tree.- I didn't find Sheriff-o-matic usable as a primary tool yet. The biggest problem I had today was that the UI hangs (> 15s) when a large number of tests fail. That happened on some of the slooooow cycling lsan/asan bots so it took the UI out of commission for most of the afternoon. I believe Ojan is looking at that one. I also filed a number of other smaller bugs. Overall I still felt more effective using the old build.chromium.org waterfall.- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.
- There were some git server problems that took down Linux slaves for a while. https://code.google.com/p/chromium/issues/detail?id=410102 I think this was probably just an unusual transient event, but if it becomes a regular thing, we would want to monitor uptime on our side.
- I've noticed troopers generally aren't in #chromium any more. Is that intentional or just a team shift to elsewhere? It makes coordination a bit slower if you're relaying message back and forth with some on irc, some in IM, some in person, etc.
- There was a build problem that wasn't noticed on any main waterfall bots or trybots, but affected local developer builds, LKGR builders, Official builders, and various other memory.fyi bots: https://code.google.com/p/chromium/issues/detail?id=409940- There were still a "normal" number of flaky tests, even with our ongoing war :(. I think there were 3 that would have previously been tree closers. It wasn't clear "how flaky" they were, if they should be disabled, or if there was really something obscure that had caused them to become flaky. There are links on the waterfall to "Flakiness dashboard" for things that aren't archived there. (I'm not sure if that's new situation.).(Sorry for the long rambling brain dump, M-A).
--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CANHK6Rb1utOM67La3EsG_XcNFWe-MCM_yMzq%2BNmF%3D5pnjPbitQ%40mail.gmail.com.
I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.
- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.Can you link to the CL that took 25 minutes to get from pending to master? We can look in the numbering daemon logs to see what the total cause of all that time was.
I'm also not sure what you mean by "it landed in pending after a few minutes"; "git cl land" will not return until the CL has successfully been pushed to pending, so it should have landed in pending "immediately". I'd like to know what made you think that wasn't the case.
- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.This is a really good point. The fact that the old SVN drover didn't touch your local checkout had a lot of benefits, and not having to rebuild since file timestamps didn't change is one I think we failed to consider. As "man git-drover" says, we plan to implement a pure git drover, it just hasn't happened yet.
On Wed, Sep 3, 2014 at 2:59 AM, Aaron Gable <aga...@chromium.org> wrote:
I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.
http://crbug.com/384129 seems to imply that it should be adding NOTRY and TBR for recent changes, but I'm not sure how "recent" is defined. Was it not doing that? Maybe reopen that bug.
In many cases I think the biggest issue is just that the UI is confusing. "Revert patchset" could mean a dozen different things. There should be text next to it that explains what it will do right now, if you press it. I filed http://crbug.com/410315 for that.
On Wed Sep 03 2014 at 7:17:15 AM Scott Graham <sco...@chromium.org> wrote:
We got lucky a lucky shift because of Labo[u]r Day yesterday and I think people were likely on vacation today so the commit rate was unusually low.I ran into a bunch of things that made the world not-so-hackable today, so I thought I'd mention them here. I'm not sure if these are all known and being addressed and/or require any postmortems. I filed bugs for some. Maybe my co-sheriffs noticed other things or had better experiences.- Committing via the "Revert patchset" on Rietveld was not useful. I tried it twice. The first time the revert took longer than 45m, and it didn't land: https://code.google.com/p/chromium/issues/detail?id=409966 . The second time it worked because I de-starved it by having the tree closed. Presumably this isn't what we want.
I think the "revert patchset" button isn't intended to be a drover replacement, as it explicitly is supposed to go through the CQ (whereas drover always committed directly). So expecting it to be any faster than a normal CL is a false expectation. Maybe it should put NOTRY=True, but I think we want the Revert button to be more generally useful to people other than Sheriffs too, so we don't want to do that by default.
- Landing a revert via git cl land took ~25 mins to actually get to origin/master. It landed in pending after a few minutes, but no one could actually pull it for quite some time. This added noise (people had to keep asking what was going on with X), and I could imagine it masking other failures on busier days. I think this is related to https://code.google.com/p/chromium/issues/detail?id=407429 but I'm not certain. Similarly, a few people complained that their CL had landed but "wasn't showing up" or similar, which meant we had to spend time time pulling/figuring out if something really was broken, etc.Can you link to the CL that took 25 minutes to get from pending to master? We can look in the numbering daemon logs to see what the total cause of all that time was.
I'm also not sure what you mean by "it landed in pending after a few minutes"; "git cl land" will not return until the CL has successfully been pushed to pending, so it should have landed in pending "immediately". I'd like to know what made you think that wasn't the case.
- Using a local git revert dirties your tree, especially if it's not near the revision you're reverting (i.e. you need to rebuild a lot later). This is more disruptive than drover, especially if you're trying to repro another test failure, etc. locally at the same time. A shallow-clone-drover-ish-magic thing would be really helpful here.This is a really good point. The fact that the old SVN drover didn't touch your local checkout had a lot of benefits, and not having to rebuild since file timestamps didn't change is one I think we failed to consider. As "man git-drover" says, we plan to implement a pure git drover, it just hasn't happened yet.
- I had to do some merging for 37 & 38, and had some difficulty figuring out what CLs where in what branches. This might just be me needing to adapt, or documenting/me reading documentation. Determining that data via a web interface rather than a local git clone is beneficial, again when your local boxes are busy trying to repro/fix something else.Can you explain what was difficult? This is exactly the case that the Cr-Commit-Position footer is supposed to help solve.Also, note that you don't need to have a ref checked out in order to run 'git log' or 'git show' on it, so checking which CLs are in which branches shouldn't touch your actual checkout.The above 4 issues would make me nervous about sheriffing on busier day (e.g. closer to a branch point) for lack of being able to control the situation effectively. Being able to revert quickly (in the "handful of seconds" range) needs to go hand in hand with the effort to not close the tree.- I didn't find Sheriff-o-matic usable as a primary tool yet. The biggest problem I had today was that the UI hangs (> 15s) when a large number of tests fail. That happened on some of the slooooow cycling lsan/asan bots so it took the UI out of commission for most of the afternoon. I believe Ojan is looking at that one. I also filed a number of other smaller bugs. Overall I still felt more effective using the old build.chromium.org waterfall.- trungl-bot doesn't announce failures on IRC any more because they're not tree closings. This was generally how I got notified of what to do next when sheriffing, so I didn't notice one set of bots until after they'd been red for a bit.It would be great to get another bot that watches sheriff-o-matic and posts about new failures in IRC.
- There were some git server problems that took down Linux slaves for a while. https://code.google.com/p/chromium/issues/detail?id=410102 I think this was probably just an unusual transient event, but if it becomes a regular thing, we would want to monitor uptime on our side.
- I've noticed troopers generally aren't in #chromium any more. Is that intentional or just a team shift to elsewhere? It makes coordination a bit slower if you're relaying message back and forth with some on irc, some in IM, some in person, etc.No clue what's up with this. I'm in IRC 100% of the time. To be fair, I'm *also* in Munich right now, so that won't help you much. But the current trooper should always be in IRC. Maybe this hasn't been made clear to the newest crop of troopers :) I'll make sure the message gets passed along.
in that case anyway. For that case, I think a "Create revert issue"that doesn't submit it to the queue (so I could manually try it to
control landing, or manually patch it into my tree for testing or
rebasing) would be much more useful.
What is the intended primary use of the "Revert patchset" button then?
The communication I've seen is that it is how sheriffs revert things
instead of drover now. I believe most reverts are from sheriffs trying
to green the tree. What did you have in mind here?
On Wed, Sep 3, 2014 at 9:29 AM, Brett Wilson <bre...@chromium.org> wrote:
What is the intended primary use of the "Revert patchset" button then?
The communication I've seen is that it is how sheriffs revert things
instead of drover now. I believe most reverts are from sheriffs trying
to green the tree. What did you have in mind here?I don't actually know the official policy (if there is one), but I think it might help clarify the issue to point out that "Revert patchset" (and the CQ generally) ultimately terminate with a call to 'git cl land'. So, if you're in a hurry to get something committed, running 'git cl land' on your local machine is just a shortcut to the same operation.
If it takes multiple minutes for a commit to appear in the main tree after running 'git cl land', that is still unfortunately the shortest path to landing a change. We'll be working hard to get the lag time down.I'd also point out that there are some clever and obscure ways to get drover-like functionality without touching your working checkout, and without running 'git clone'. git-drover will probably be the most convenient way to do it, but maybe we should write up some docs for doing this in a 'pure git' way, for those who are interested.Stefan
--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CAHOQ7J9bWNbk18vSgw9Ew89jq5jAaypyQOY7U_iVcOqybzdeHQ%40mail.gmail.com.
I'd also point out that there are some clever and obscure ways to get drover-like functionality without touching your working checkout, and without running 'git clone'. git-drover will probably be the most convenient way to do it, but maybe we should write up some docs for doing this in a 'pure git' way, for those who are interested.
On Wed, Sep 3, 2014 at 10:43 AM, Stefan Zager <sza...@chromium.org> wrote:
I don't actually know the official policy (if there is one), but I think it might help clarify the issue to point out that "Revert patchset" (and the CQ generally) ultimately terminate with a call to 'git cl land'. So, if you're in a hurry to get something committed, running 'git cl land' on your local machine is just a shortcut to the same operation.I think this misses the point that before we could click the button and have the cl be committed in the svn repo pretty instantaneously. The experience of sheriffing is that it's very convenient to be able to click a button a cl and have it be reverted.It seems that if this flow is really slow now, that is a regression and we should prioritize fixing it as part of the git migration.
sheriff-o-matic's doesn't do any computation. It's just a memcache
server for json computed by builder_alerts.
The "API" for the data is:
sheriff-o-matic.appspot.com/alerts
Here is the tool which does the builder crawl to produce said json:
https://chromium.googlesource.com/infra/infra.git/+/master/infra/tools/builder_alerts/
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CA%2Bbb4fkBg-RaLdWBxCYduviS1w%2BgvG%2BO59sdNc5TNavj_m_B3A%40mail.gmail.com.