I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.
You mention the trooper failing to notify the sheriff of the builder success rate dropping low as a core problem. I agree, but we don't want to depend on humans to do this. This is why we are building tooling. We should make it either dead simple, or automatic, for it to happen. What if the alert for linux_chromium_chromeos_rel_ng success rate had been added directly to the sheriff's view in s-o-m, no trooper involvement needed? This seems like the sort of issue that should be surfaced for both trooper and sheriff. Or, if we still want troopers to vet it first, make it incredibly easy to verify and send it to s-o-m for the sheriff.On Tue, Nov 24, 2015 at 1:52 PM, Kenneth Russell <k...@chromium.org> wrote:--On Tue, Nov 24, 2015 at 3:19 AM, Adrian Kuegel <aku...@chromium.org> wrote:It seems the problem in this case was that the Sheriffs didn't realize how bad these flaky tests were affecting the Tryserver. I added a link to our Monitoring Page with some small explanation to https://sites.google.com/a/chromium.org/dev/developers/tree-sheriffs/sheriffing-bug-queuesWe are also considering linking to the monitoring page from Sheriff-o-matic directly.In any case, for this specific case our tooling already worked: we got alerts by the monitoring pipeline that the success rate of linux_chromium_chromeos_rel_ng had dropped (also clearly visible in the Builder success rate Graph), automatic flakiness bugs were filed, and we got alerts from buildbucket. What didn't work: the trooper didn't notify the Sheriffs about the alerts, so the Sheriff only saw the flakiness bugs, and probably didn't know about how to investigate the effect of that flakiness. Also, the instructions for Sheriffs regarding flakiness just said to disable tests after "a couple of hours" (which is maybe not clear enough). I added:If a builder has dropped to a low success rate because of flaky tests, those tests should be disabled as soon as possible.If previously-reliable tests suddenly become flaky, then investigation should be done immediately to see whether a recently committed CL is likely the cause, and if one is found, it should be reverted.On Tue, Nov 24, 2015 at 10:41 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:--I've heard the Flakiness sub-team of CQ FA starts investigating chromium flakiness exceeding 10% and blink exceeding 20%. Today's blink false rejection rate is 24.1%, and so I'm looking forward to your insights.This seems easy - cycle time has also regressed, and I've noticed we had a series of flakes on linux_chromium_chromeos_rel_ng (https://code.google.com/p/chromium/issues/detail?id=560329). Automated systems did react, see e.g. https://code.google.com/p/chromium/issues/detail?id=560264 .Action seems to have happened with a rather large delay (8-18 hours) - see https://codereview.chromium.org/1467183004 disabling the tests.I'm working on a CL for the test launcher not to bail out early without the patch (i.e. calculate the threshold for broken tests differently when we only run a subset of them), as happened in e.g. http://build.chromium.org/p/tryserver.chromium.linux/builders/linux_chromium_chromeos_rel_ng/builds/133462 .I wouldn't like to exaggerate importance of this regression. On the other hand, it gives us a specific case to study and improve on.Paweł
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2fXj0vjP5bL3K7QzqtbT7pYs0ha28j5ZAFBnJYzk%3DiG_w%40mail.gmail.com.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+unsubscribe@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+unsubscribe@chromium.org.
To post to this group, send email to infr...@chromium.org.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPZ-%3Day7e0%2BV6y4JigD68wLUBybQKs38Mjq6x8rG1BX%3DaQ%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAMYvS2fXj0vjP5bL3K7QzqtbT7pYs0ha28j5ZAFBnJYzk%3DiG_w%40mail.gmail.com.
--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/98edf73c-a3df-4c29-8c43-e0b5177fe1ee%40chromium.org.
+chromium-devContext: The blink CQ was >20% flaky for 16 hours yesterday because flaky downloads gtests.On Tue, Nov 24, 2015 at 2:07 PM Julie Parent <jpa...@chromium.org> wrote:I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.+1. The sheriff shouldn't care how bad the flake is. The flake should be handled the same way regardless, which is to revert the offending patch if it's straightforward to figure out and disable the test + assign appropriate owners to the bug otherwise.
On Wed, Nov 25, 2015 at 1:07 AM, Ojan Vafai <oj...@chromium.org> wrote:+chromium-devContext: The blink CQ was >20% flaky for 16 hours yesterday because flaky downloads gtests.On Tue, Nov 24, 2015 at 2:07 PM Julie Parent <jpa...@chromium.org> wrote:I don't think it makes sense to ask sheriffs to look at the viceroy graphs. Sheriffs use s-o-m as a one stop shop.+1. The sheriff shouldn't care how bad the flake is. The flake should be handled the same way regardless, which is to revert the offending patch if it's straightforward to figure out and disable the test + assign appropriate owners to the bug otherwise.Good point. Even if the flake occurs not so frequently, it is still hurting some developers who want to get their patch landed. So should I remove the link to the viceroy graphs, or just change the text that this can be used as a FYI if a Sheriff is curious?
It appears that the sheriff noticed the bug almost immediately, but noone disabled the tests for 14 hours until jam@ asked for the tests to be disabled. We need to figure out why this happened. Can you followup with the chromium sheriffs on duty at that time to see what went wrong?
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CANMdWTuk%2BanaPHSG27Z%3DH9GEHFJN3oF%2BV9V62LVHFMAbNVyKJg%40mail.gmail.com.