--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2fXj0vjP5bL3K7QzqtbT7pYs0ha28j5ZAFBnJYzk%3DiG_w%40mail.gmail.com.
I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+unsubscribe@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+unsubscribe@chromium.org.
To post to this group, send email to infr...@chromium.org.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2fXj0vjP5bL3K7QzqtbT7pYs0ha28j5ZAFBnJYzk%3DiG_w%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/98edf73c-a3df-4c29-8c43-e0b5177fe1ee%40chromium.org.
On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.
+chromium-devContext: The blink CQ was >20% flaky for 16 hours yesterday because flaky downloads gtests.On Tue, Nov 24, 2015 at 2:07 PM Julie Parent <jpa...@chromium.org> wrote:I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.+1. The sheriff shouldn't care how bad the flake is. The flake should be handled the same way regardless, which is to revert the offending patch if it's straightforward to figure out and disable the test + assign appropriate owners to the bug otherwise.
On Tue, Nov 24, 2015 at 10:52 PM, Kenneth Russell <k...@chromium.org> wrote:On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.Of course; but I would argue that if the tryserver is broken, a test should be disabled first before investigating for hours what might cause this. If you look at crbug.com/560329, you can see that the flaky test was known 15 hours before the test was disabled. If you think back at times when we still closed the tree for failing tests, a sheriff would also try to get the tree green again as soon as possible.
On Wed, Nov 25, 2015 at 1:15 AM, Adrian Kuegel <aku...@chromium.org> wrote:On Tue, Nov 24, 2015 at 10:52 PM, Kenneth Russell <k...@chromium.org> wrote:On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.Of course; but I would argue that if the tryserver is broken, a test should be disabled first before investigating for hours what might cause this. If you look at crbug.com/560329, you can see that the flaky test was known 15 hours before the test was disabled. If you think back at times when we still closed the tree for failing tests, a sheriff would also try to get the tree green again as soon as possible.Agree that tryservers should not be left broken for hours, but an initial attempt should be made to find a CL that made a test flaky rather than reflexively disabling previously working tests. In particular, our team's tests are one of the few that launch the entire browser rather than a smaller test harness, and it's been the case several times in the past that intermittent crashes affecting the browser were caught only by our tests. In situations like these an attempt should be made to find the cause of the crashes and revert it, rather than immediately disabling the tests and pushing the effort of diagnosis on the team owning the tests.
--On Tue, Nov 24, 2015 at 10:41 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:--I've heard the Flakiness sub-team of CQ FA starts investigating chromium flakiness exceeding 10% and blink exceeding 20%. Today's blink false rejection rate is 24.1%, and so I'm looking forward to your insights.This seems easy - cycle time has also regressed, and I've noticed we had a series of flakes on linux_chromium_chromeos_rel_ng (https://code.google.com/p/chromium/issues/detail?id=560329). Automated systems did react, see e.g. https://code.google.com/p/chromium/issues/detail?id=560264 .Action seems to have happened with a rather large delay (8-18 hours) - see https://codereview.chromium.org/1467183004 disabling the tests.I'm working on a CL for the test launcher not to bail out early without the patch (i.e. calculate the threshold for broken tests differently when we only run a subset of them), as happened in e.g. http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/133462 .I wouldn't like to exaggerate importance of this regression. On the other hand, it gives us a specific case to study and improve on.Paweł
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2edV4E3-3Wf5J4vKQ8qCdv0ybTjXGm-sXLoMcG0NB77KA%40mail.gmail.com.
On Wed, Nov 25, 2015 at 2:31 PM, Kenneth Russell <k...@chromium.org> wrote:On Wed, Nov 25, 2015 at 1:15 AM, Adrian Kuegel <aku...@chromium.org> wrote:On Tue, Nov 24, 2015 at 10:52 PM, Kenneth Russell <k...@chromium.org> wrote:On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.Of course; but I would argue that if the tryserver is broken, a test should be disabled first before investigating for hours what might cause this. If you look at crbug.com/560329, you can see that the flaky test was known 15 hours before the test was disabled. If you think back at times when we still closed the tree for failing tests, a sheriff would also try to get the tree green again as soon as possible.Agree that tryservers should not be left broken for hours, but an initial attempt should be made to find a CL that made a test flaky rather than reflexively disabling previously working tests. In particular, our team's tests are one of the few that launch the entire browser rather than a smaller test harness, and it's been the case several times in the past that intermittent crashes affecting the browser were caught only by our tests. In situations like these an attempt should be made to find the cause of the crashes and revert it, rather than immediately disabling the tests and pushing the effort of diagnosis on the team owning the tests.Right. If a failure isn't severe enough to close the tree (and test flakes by definition aren't), then standard operating procedure should be to spend at least some time trying to diagnose and fix the problem, rather than simply disabling things. I would tend to define "some time" as something in the 15-60 minute range, depending on the severity of the problem. We should also pretty much always favor reverting a CL over suppressing test failures, and usually you can get a pretty good guess on what CL might be the culprit in the 15-60 minute window.Clearly (to me, at least), in this case, the sheriffs fell down on the job and should've done something much earlier.
---- Dirk--On Tue, Nov 24, 2015 at 10:41 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:--I've heard the Flakiness sub-team of CQ FA starts investigating chromium flakiness exceeding 10% and blink exceeding 20%. Today's blink false rejection rate is 24.1%, and so I'm looking forward to your insights.This seems easy - cycle time has also regressed, and I've noticed we had a series of flakes on linux_chromium_chromeos_rel_ng (https://code.google.com/p/chromium/issues/detail?id=560329). Automated systems did react, see e.g. https://code.google.com/p/chromium/issues/detail?id=560264 .Action seems to have happened with a rather large delay (8-18 hours) - see https://codereview.chromium.org/1467183004 disabling the tests.I'm working on a CL for the test launcher not to bail out early without the patch (i.e. calculate the threshold for broken tests differently when we only run a subset of them), as happened in e.g. http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/133462 .I wouldn't like to exaggerate importance of this regression. On the other hand, it gives us a specific case to study and improve on.Paweł
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2edV4E3-3Wf5J4vKQ8qCdv0ybTjXGm-sXLoMcG0NB77KA%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAEoffTDysYp4qz%2BWFvmN2dAg_arDLC2yUduuYw4XP-U%3DieCdAA%40mail.gmail.com.
On Wed, Nov 25, 2015 at 2:43 PM, Dirk Pranke <dpr...@chromium.org> wrote:On Wed, Nov 25, 2015 at 2:31 PM, Kenneth Russell <k...@chromium.org> wrote:On Wed, Nov 25, 2015 at 1:15 AM, Adrian Kuegel <aku...@chromium.org> wrote:On Tue, Nov 24, 2015 at 10:52 PM, Kenneth Russell <k...@chromium.org> wrote:On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.Of course; but I would argue that if the tryserver is broken, a test should be disabled first before investigating for hours what might cause this. If you look at crbug.com/560329, you can see that the flaky test was known 15 hours before the test was disabled. If you think back at times when we still closed the tree for failing tests, a sheriff would also try to get the tree green again as soon as possible.Agree that tryservers should not be left broken for hours, but an initial attempt should be made to find a CL that made a test flaky rather than reflexively disabling previously working tests. In particular, our team's tests are one of the few that launch the entire browser rather than a smaller test harness, and it's been the case several times in the past that intermittent crashes affecting the browser were caught only by our tests. In situations like these an attempt should be made to find the cause of the crashes and revert it, rather than immediately disabling the tests and pushing the effort of diagnosis on the team owning the tests.Right. If a failure isn't severe enough to close the tree (and test flakes by definition aren't), then standard operating procedure should be to spend at least some time trying to diagnose and fix the problem, rather than simply disabling things. I would tend to define "some time" as something in the 15-60 minute range, depending on the severity of the problem. We should also pretty much always favor reverting a CL over suppressing test failures, and usually you can get a pretty good guess on what CL might be the culprit in the 15-60 minute window.Clearly (to me, at least), in this case, the sheriffs fell down on the job and should've done something much earlier.https://www.chromium.org/developers/tree-sheriffs/sheriff-details-chromium has a shorter, less eloquent summary of the above. Perhaps Dirk/Ken can update it with this.
On Wed, Nov 25, 2015 at 1:07 AM, Ojan Vafai <oj...@chromium.org> wrote:+chromium-devContext: The blink CQ was >20% flaky for 16 hours yesterday because flaky downloads gtests.On Tue, Nov 24, 2015 at 2:07 PM Julie Parent <jpa...@chromium.org> wrote:I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.+1. The sheriff shouldn't care how bad the flake is. The flake should be handled the same way regardless, which is to revert the offending patch if it's straightforward to figure out and disable the test + assign appropriate owners to the bug otherwise.Good point. Even if the flake occurs not so frequently, it is still hurting some developers who want to get their patch landed. So should I remove the link to the viceroy graphs, or just change the text that this can be used as a FYI if a Sheriff is curious?
It appears that the sheriff noticed the bug almost immediately, but noone disabled the tests for 14 hours until jam@ asked for the tests to be disabled. We need to figure out why this happened. Can you followup with the chromium sheriffs on duty at that time to see what went wrong?
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CANMdWTuk%2BanaPHSG27Z%3DH9GEHFJN3oF%2BV9V62LVHFMAbNVyKJg%40mail.gmail.com.