> On Thursday 2012-08-30 18:45 -0400, Zack Weinberg wrote:
>> My process is: only ask for review *after* the patch is green on
>> try. Until then, for all I know I'm going to need major
>> architectural changes just to make the testsuite happy, and there's
>> no point wasting reviewers' time, which is a *far* scarcer resource
>> in this organization than CPU hours.
> But there's also the opposite problem, which is that in some cases
> reviewers might require changes that will invalidate (or make
> unnecessary) the work that's been done to get the patch green.
> I don't think there's a single correct solution here. Testing and
> peer review are both tools we use to improve the quality of our
> code; they don't necessarily belong in a particular order.
Usually when I'm working on a bug, my goal is to get the patch landed as soon as possible and move on to other work. I have a hard time when I have a lot of pending patches (more than 5 really) since they incur a cognitive load for me as I always have to keep thinking about them. On average, the single biggest thing which gets in the way of me landing the patch is waiting for the review. In many cases, all of the other steps in fixing a bug (understanding the bug, thinking of a solution, coding it, testing it, pushing to the try server and waiting for results) takes less time than it takes for the reviewer to start looking at my patch. Because of this reason, I optimize for attaching the patch to the bug and asking for review *as soon as possible*.
> On 08/30/2012 03:27 PM, Jeff Hammel wrote:
>> I believe for our code base that so much depends on so much that this
>> sort of division will not work. While I would love to see our test
>> bucketized, such that if a change in (say) layout would only run
>> layout tests, I think just figuring out what tests would have to be
>> run for what files would be very hard. I also think the answer, with
>> probably a few special cases, is that most changes could, at least in
>> theory, affect most tests.
> Maybe you're right. But one thing I want to make clear -- this doesn't
> require changes in layout to only run layout tests. This is *only* based
> on after-the-fact observation of what tests break from layout changes.
> So if they run tests outside of layout, but those tests never break as a
> result of layout changes, then this wouldn't run them. But if those
> tests *did* break, then we would. There is nothing in this algorithm
> that uses the location of tests for anything. (The source tree-based
> "buckets" I was referring to were only for code.)
> The source location of tests is probably correlated with the test jobs
> that we have defined (M1 vs Moth vs xpcshell tests etc), and those are
> probably not the ideal buckets for this purpose. But they're probably
> not horrible, and that's what I imagine the easy-to-gather data is based
> on, so perhaps they're good enough. If the data is available, it's easy
> enough to test -- generate the matrix, and look for zeroes. If there
> aren't many, then the bucketing on one or the other axis is useless.
> Sorry; maybe that's what you meant in the first place. All I'm saying is
> that "...most changes could, at least in theory, affect most tests"
> doesn't matter. What matters is what happens *in practice*.
> I strongly suspect that *some* particular partitioning would work well
> for this, even if it didn't end up making a whole lot of sense just
> looking at it. If we had data on exact tests that failed, we could
> automatically generate a good partitioning. But that would mean
> shuffling around tests between test suites, which would kinda suck.
To give you a concrete example of why this kind of stuff does not work, when I was working on bug 157681 (which was a layout optimization), I came across a single browser-chrome test failure happening only on Mac, which seemed pretty unrelated to my changes at first, but it turned out that it actually uncovered a subtle bug in my patch which none of the other layout tests that we have managed to catch.
This kind of stuff is rare, true, but it happens frequently enough that it really matters. I don't think we can seriously consider bucketing tests based on which files have changed in a patch without losing this important aspect of catching bugs in patches -- except perhaps for extremely localized components of the code.
> On 08/31/12 12:39 AM, Robert O'Callahan wrote:
>> If we could run Linux functional tests on AWS, then maybe we could keep the
>> Linux build/functional-test backlog at zero and encourage people to try the
>> Linux non-functional tests before every non-trivial commit to inbound. It
>> seems to me that would greatly reduce bustage. (I suppose we have enough
>> data for someone to compute the fraction of bustage-inducing pushes that
>> did not break Linux functional tests.)
> I know this is in the works (sorry, I don't which bug is happening in),
> but we can't quite run all of our unit tests on AWS. Anything that
> depends on a GPU (reftest, some crashtests, and even some mochitests
> I've heard) can't run there. We definitely want to move everything we
> can to the cloud, though.
I think that only mochitests which test canvas fall into that category, which means mochitest-1 (and we could bucket up those tests into a separate suite if needed). The rest should be possible to be pushed to the cloud.
> philor (who knows as much about this stuff as anyone) just mentioned
> the following on IRC:
> "did anyone point out that we take 60 minutes to run Win xpcshell,
> when locally it takes 7 minutes, or that we build and test desktop on
> pushes that only touch mobile/ or b2g/?"
> Sounds like two pieces of large, low-hanging fruit.
Except that as I understand things, we don't have a reliable way to handle them, since our infrastructure is only capable of looking at the tip of a push, not every changeset in it.
> On 12-08-30 11:48 PM, Nicholas Nethercote wrote:
>> philor (who knows as much about this stuff as anyone) just mentioned
>> the following on IRC:
>> "did anyone point out that we take 60 minutes to run Win xpcshell,
>> when locally it takes 7 minutes, or that we build and test desktop on
>> pushes that only touch mobile/ or b2g/?"
>> Sounds like two pieces of large, low-hanging fruit.
> Except that as I understand things, we don't have a reliable way to
> handle them, since our infrastructure is only capable of looking at the
> tip of a push, not every changeset in it.
That's just how it's currently implemented; it's certainly changeable with enough effort.
Are the win xpcshell test times something to be concerned about? Are there other tests that are taking unreasonably long?
> On 31/08/12 11:15 AM, Ehsan Akhgari wrote:
>> On 12-08-30 11:48 PM, Nicholas Nethercote wrote:
>>> philor (who knows as much about this stuff as anyone) just mentioned
>>> the following on IRC:
>>> "did anyone point out that we take 60 minutes to run Win xpcshell,
>>> when locally it takes 7 minutes, or that we build and test desktop on
>>> pushes that only touch mobile/ or b2g/?"
>>> Sounds like two pieces of large, low-hanging fruit.
>> Except that as I understand things, we don't have a reliable way to
>> handle them, since our infrastructure is only capable of looking at the
>> tip of a push, not every changeset in it.
> That's just how it's currently implemented; it's certainly changeable
> with enough effort.
Good point! -> bug 787449
> Are the win xpcshell test times something to be concerned about? Are
> there other tests that are taking unreasonably long?
Absolutely! Filed bug 787448 for the investigation on why this happens. I don't know if the same problem happens with other tests as well.
> Looking at the data, we're still at around 2/3 Linux32 on Fx13/14
Are you looking at the "release" channel there or at the "default" channel as well? Distro builds are usually on the "default" channel so that we don't provide updates (as the distro does that).
> On 12-08-30 8:48 PM, Steve Fink wrote:
>> On 08/30/2012 03:27 PM, Jeff Hammel wrote:
>>> I believe for our code base that so much depends on so much that this
>>> sort of division will not work. While I would love to see our test
>>> bucketized, such that if a change in (say) layout would only run
>>> layout tests, I think just figuring out what tests would have to be
>>> run for what files would be very hard. I also think the answer, with
>>> probably a few special cases, is that most changes could, at least in
>>> theory, affect most tests.
>> Maybe you're right. But one thing I want to make clear -- this doesn't
>> require changes in layout to only run layout tests. This is *only* based
>> on after-the-fact observation of what tests break from layout changes.
>> So if they run tests outside of layout, but those tests never break as a
>> result of layout changes, then this wouldn't run them. But if those
>> tests *did* break, then we would. There is nothing in this algorithm
>> that uses the location of tests for anything. (The source tree-based
>> "buckets" I was referring to were only for code.)
>> The source location of tests is probably correlated with the test jobs
>> that we have defined (M1 vs Moth vs xpcshell tests etc), and those are
>> probably not the ideal buckets for this purpose. But they're probably
>> not horrible, and that's what I imagine the easy-to-gather data is based
>> on, so perhaps they're good enough. If the data is available, it's easy
>> enough to test -- generate the matrix, and look for zeroes. If there
>> aren't many, then the bucketing on one or the other axis is useless.
>> Sorry; maybe that's what you meant in the first place. All I'm saying is
>> that "...most changes could, at least in theory, affect most tests"
>> doesn't matter. What matters is what happens *in practice*.
>> I strongly suspect that *some* particular partitioning would work well
>> for this, even if it didn't end up making a whole lot of sense just
>> looking at it. If we had data on exact tests that failed, we could
>> automatically generate a good partitioning. But that would mean
>> shuffling around tests between test suites, which would kinda suck.
> To give you a concrete example of why this kind of stuff does not > work, when I was working on bug 157681 (which was a layout > optimization), I came across a single browser-chrome test failure > happening only on Mac, which seemed pretty unrelated to my changes at > first, but it turned out that it actually uncovered a subtle bug in my > patch which none of the other layout tests that we have managed to catch.
> This kind of stuff is rare, true, but it happens frequently enough > that it really matters. I don't think we can seriously consider > bucketing tests based on which files have changed in a patch without > losing this important aspect of catching bugs in patches -- except > perhaps for extremely localized components of the code.
> Ehsan
I can't argue convincingly without concrete data, but this sounds wrong to me.
You give an example of where the test restrictions would fail due to the bucketing, but you also say "This kind of stuff is rare...". So when something like this happens, you wouldn't get a test build and wouldn't see the failure until several pushes later when the test *did* get run. So we get bad coalescing in rare cases.
In return, we lower the infrastructure load across the board, resulting in less coalescing in the common case.
I think the tradeoff is likely to be worth it, but it totally depends on the numbers. And predicting how much coalescing will be reduced, but only during busy times when it matters, based on a certain reduction in test load, is Hard.
Your example does point out that we'd also want test suppression to be relative to current load -- no need to suppress any tests during off hours, and in fact you'd probably want to set the threshold based on current activity/backlog. Perhaps that makes it more palatable: "we're overloaded and can't run everything, so what jobs would be least harmful if we suppressed them?"
> On 08/31/2012 08:04 AM, Ehsan Akhgari wrote:
>> On 12-08-30 8:48 PM, Steve Fink wrote:
>>> On 08/30/2012 03:27 PM, Jeff Hammel wrote:
>>>> I believe for our code base that so much depends on so much that this
>>>> sort of division will not work. While I would love to see our test
>>>> bucketized, such that if a change in (say) layout would only run
>>>> layout tests, I think just figuring out what tests would have to be
>>>> run for what files would be very hard. I also think the answer, with
>>>> probably a few special cases, is that most changes could, at least in
>>>> theory, affect most tests.
>>> Maybe you're right. But one thing I want to make clear -- this doesn't
>>> require changes in layout to only run layout tests. This is *only* based
>>> on after-the-fact observation of what tests break from layout changes.
>>> So if they run tests outside of layout, but those tests never break as a
>>> result of layout changes, then this wouldn't run them. But if those
>>> tests *did* break, then we would. There is nothing in this algorithm
>>> that uses the location of tests for anything. (The source tree-based
>>> "buckets" I was referring to were only for code.)
>>> The source location of tests is probably correlated with the test jobs
>>> that we have defined (M1 vs Moth vs xpcshell tests etc), and those are
>>> probably not the ideal buckets for this purpose. But they're probably
>>> not horrible, and that's what I imagine the easy-to-gather data is based
>>> on, so perhaps they're good enough. If the data is available, it's easy
>>> enough to test -- generate the matrix, and look for zeroes. If there
>>> aren't many, then the bucketing on one or the other axis is useless.
>>> Sorry; maybe that's what you meant in the first place. All I'm saying is
>>> that "...most changes could, at least in theory, affect most tests"
>>> doesn't matter. What matters is what happens *in practice*.
>>> I strongly suspect that *some* particular partitioning would work well
>>> for this, even if it didn't end up making a whole lot of sense just
>>> looking at it. If we had data on exact tests that failed, we could
>>> automatically generate a good partitioning. But that would mean
>>> shuffling around tests between test suites, which would kinda suck.
>> To give you a concrete example of why this kind of stuff does not
>> work, when I was working on bug 157681 (which was a layout
>> optimization), I came across a single browser-chrome test failure
>> happening only on Mac, which seemed pretty unrelated to my changes at
>> first, but it turned out that it actually uncovered a subtle bug in my
>> patch which none of the other layout tests that we have managed to catch.
>> This kind of stuff is rare, true, but it happens frequently enough
>> that it really matters. I don't think we can seriously consider
>> bucketing tests based on which files have changed in a patch without
>> losing this important aspect of catching bugs in patches -- except
>> perhaps for extremely localized components of the code.
>> Ehsan
> I can't argue convincingly without concrete data, but this sounds wrong
> to me.
> You give an example of where the test restrictions would fail due to the
> bucketing, but you also say "This kind of stuff is rare...". So when
> something like this happens, you wouldn't get a test build and wouldn't
> see the failure until several pushes later when the test *did* get run.
> So we get bad coalescing in rare cases.
> In return, we lower the infrastructure load across the board, resulting
> in less coalescing in the common case.
> I think the tradeoff is likely to be worth it, but it totally depends on
> the numbers. And predicting how much coalescing will be reduced, but
> only during busy times when it matters, based on a certain reduction in
> test load, is Hard.
OK, thinking more about this, I see your point now. And I definitely agree that this is the sort of thing which is hard to evaluate without the data.
> Your example does point out that we'd also want test suppression to be
> relative to current load -- no need to suppress any tests during off
> hours, and in fact you'd probably want to set the threshold based on
> current activity/backlog. Perhaps that makes it more palatable: "we're
> overloaded and can't run everything, so what jobs would be least harmful
> if we suppressed them?"
>> 5. Or go the other way, and make more tests runnable in parallel. More
>> efficient than #4 because it avoids the VM overhead, much harder to
>> implement, would also improve testing locally. (Though making it easy to
>> set up test VMs could help local testing too.) Needing window focus will
>> again bite us here.
> I'm not convinced that this is feasible in the short to middle term > for any of our graphical test suites.
The <audio> and <video> mochitests have a pretty simple test manager [1] written in JS which can run multiple sub-tests in parallel. The level of parallelsim can be cranked up or down by changing a simple parameter [2]. Not all mochitests could be written like this (fullscreen mochitests couldn't for example), but some of the slower running tests may be able to be refactored to use techniques like this.
> On 08/31/2012 08:04 AM, Ehsan Akhgari wrote:
>> To give you a concrete example of why this kind of stuff does not work, when I was working on bug 157681 (which was a layout optimization), I came across a single browser-chrome test failure happening only on Mac, which seemed pretty unrelated to my changes at first, but it turned out that it actually uncovered a subtle bug in my patch which none of the other layout tests that we have managed to catch.
>> This kind of stuff is rare, true, but it happens frequently enough that it really matters. I don't think we can seriously consider bucketing tests based on which files have changed in a patch without losing this important aspect of catching bugs in patches -- except perhaps for extremely localized components of the code.
> I can't argue convincingly without concrete data, but this sounds wrong to me.
> You give an example of where the test restrictions would fail due to the bucketing, but you also say "This kind of stuff is rare...". So when something like this happens, you wouldn't get a test build and wouldn't see the failure until several pushes later when the test *did* get run. So we get bad coalescing in rare cases.
> In return, we lower the infrastructure load across the board, resulting in less coalescing in the common case.
I'd also be perfectly okay with saying that changes someone like Ehsan makes to something like layout are gonna run the full suite every time. Layout pushes in general are likely to touch surprising things. But even granting that, Steve's suggestions could help firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the way by running subsets.
We could label whole directories as "touching this ends the world, test everything" and be pretty liberal about where we apply that label because at the moment, we effectively apply it to everything.
So who's gonna volunteer to do the strawman test-bucket vs code location matrix? :)
J
---
Johnathan Nightingale
Sr. Director of Firefox Engineering
@johnath
This thread has wandered far, far away from the original purpose (surprise) which was to assess whether we still needed/wanted Linux64 as both a build and test platform.
Aside from the expected "OMGCHANGE" reactions, there were valid arguments for keeping Linux64. We should invest the effort to get bug 527907 fixed.
However...
I'm not feeling a lot of love for 32-bit linux. Many people suggested turning off linux32 instead if we needed to make a choice.
Would we consider stopping builds and tests on linux32 instead of linux64, or at least putting some sort of horizon on how long we would plan to support 32-bit linux as a tier 1 platform?
Again, no one is (necessarily) talking in absolutes here. We can continue to run both linux platforms, we could demote linux32 to tier 2, etc.
While it would obviously help unburden release engineering to reduce the number of build/test environments we support, our primary goal here is to make sure we're expending effort on relevant platforms and architecture.
On Thursday, 30 August 2012 21:43:06 UTC+1, Mike Connor wrote:
> If we're advocating increased try use to keep inbound greener, I think > it's a sign we've lost sight of the original point of having > mozilla-inbound, which was a place to reduce effort for devs by having a > tree that was explicitly _allowed_ to break, and didn't carry the heavy > individual and collective cost of breaking mozilla-central. If we start > treating mozilla-inbound as a "must be protected from bustage" tree, > there's little point in having it as an additional step. So I'm > completely clear:
> Breaking mozilla-inbound should 100% acceptable, and trivial for a > sheriff or others to fix.
> The entire point is to enable a developer workflow of "this should be > good, pushing to -inbound, if it stays green I'm completely done" which > we can't have on try, and makes try more focused on "I think this might > break" patches rather than routine validation of patches.
This was never the purpose of mozilla-inbound.
The idea was to:
a) Have a tree where people did not have to watch their pushes for 4-6+ hours, since someone would keep an eye on it. (The primary dev incentive).
b) Mean that other branches could confidently pull from mozilla-central, knowing that it would be green.
c) Reduce the number of push races when merging other repos into mozilla-central (which can be more of a pain to rebase than normal sized pushes), since the traffic is lower.
d) Give us a way to not tie up mozilla-central if we end up with extreme bustage on mozilla-inbound. We also gained the ability to reset (mozilla-inbound) to a previous revision without reverting merges from other repos.
If people feel that it would be preferable (either from infra load or workflow) to change this policy, please can they start a dev.{platform,planning} discussion proposing a change - but in the meantime I would prefer it if they don't ignore the tree rules - since it results in very sadfaces sheriffs :-(
> This thread has wandered far, far away from the original purpose (surprise)
> which was to assess whether we still needed/wanted Linux64 as both a build
> and test platform.
> Aside from the expected "OMGCHANGE" reactions, there were valid arguments
> for keeping Linux64. We should invest the effort to get bug 527907 fixed.
> However...
> I'm not feeling a lot of love for 32-bit linux. Many people suggested
> turning off linux32 instead if we needed to make a choice.
> Would we consider stopping builds and tests on linux32 instead of linux64,
> or at least putting some sort of horizon on how long we would plan to
> support 32-bit linux as a tier 1 platform?
> Again, no one is (necessarily) talking in absolutes here. We can continue to
> run both linux platforms, we could demote linux32 to tier 2, etc.
> While it would obviously help unburden release engineering to reduce the
> number of build/test environments we support, our primary goal here is to
> make sure we're expending effort on relevant platforms and architecture.
I tried to make that point in the previous thread:
The problem with dropping a platform is not just that that platform
may be worth keeping, it is also that the fact that you feel the need
to drop a platform is probably a consequence of a deeper problem which
is the right thing to fix: our testing is too expensive. How do we
make it less expensive? People discussed possible ideas in the other
thread, including running tests less often and/or skipping part of the
tests depending on what didn't change.
> The problem with dropping a platform is not just that that platform
> may be worth keeping, it is also that the fact that you feel the need
> to drop a platform is probably a consequence of a deeper problem which
> is the right thing to fix: our testing is too expensive.
I agree.
But if we *were* to consider dropping a platform to tier 2, we should make that decision with data to back it up, which for Linux should also include data regarding the Firefox x86/x64 split in the major distros, since most roll their own Firefox packages which we don't track.
> On Aug 31, 2012, at 12:55 PM, Steve Fink wrote:
>> On 08/31/2012 08:04 AM, Ehsan Akhgari wrote:
>>> To give you a concrete example of why this kind of stuff does not work, when I was working on bug 157681 (which was a layout optimization), I came across a single browser-chrome test failure happening only on Mac, which seemed pretty unrelated to my changes at first, but it turned out that it actually uncovered a subtle bug in my patch which none of the other layout tests that we have managed to catch.
>>> This kind of stuff is rare, true, but it happens frequently enough that it really matters. I don't think we can seriously consider bucketing tests based on which files have changed in a patch without losing this important aspect of catching bugs in patches -- except perhaps for extremely localized components of the code.
>> I can't argue convincingly without concrete data, but this sounds wrong to me.
>> You give an example of where the test restrictions would fail due to the bucketing, but you also say "This kind of stuff is rare...". So when something like this happens, you wouldn't get a test build and wouldn't see the failure until several pushes later when the test *did* get run. So we get bad coalescing in rare cases.
>> In return, we lower the infrastructure load across the board, resulting in less coalescing in the common case.
> I'd also be perfectly okay with saying that changes someone like Ehsan makes to something like layout are gonna run the full suite every time. Layout pushes in general are likely to touch surprising things. But even granting that, Steve's suggestions could help firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the way by running subsets.
> We could label whole directories as "touching this ends the world, test everything" and be pretty liberal about where we apply that label because at the moment, we effectively apply it to everything.
> So who's gonna volunteer to do the strawman test-bucket vs code location matrix? :)
This makes sense. Do you wanna file a bug in Core::Build Config and assign it to Steve? ;-)
I _suspect_ you both may be saying the basically the same thing, although differing on exactly where the line is...
I think it's true that m-i isn't a playground; developers should be surprised if a m-i push fails, and not just expect it to flush out problems. At a _minimum_, developers should have at least built and run relevant tests locally. No push'n'pray.
I also think it's true that using Try is a best-practice. It's easy and helps to spot the unexpected without causing work for other people. It's even essentially _required_ if you're doing things that have a history of being touchy -- C++ magic that various compilers might dislike, invasive build system changes, platform-specific changes that you can't check yourself, etc.
But in-between we're trusting developers to use their best judgement. Trivial, well-understood changes might not need Try at all. When Try is used, we ask them to use TryChooser to limit resource usage by doing what's needed. Not everything needs a full Talos run + debug + opt + all tests + all platforms (+ multiple runs to ensure no new random orange is added or you got lucky with a random green).
Assuming this is all true, it seems what we might really want here are some better guidelines for helping developers tune/improve their "best judgement". Some of the quote-unquote-obvious things raised this thread would be a good start.
> I _suspect_ you both may be saying the basically the same thing,
> although differing on exactly where the line is...
> I think it's true that m-i isn't a playground; developers should be
> surprised if a m-i push fails, and not just expect it to flush out
> problems. At a _minimum_, developers should have at least built and run
> relevant tests locally. No push'n'pray.
> [etc]
Indeed. My understanding has always been that the expected "patch quality" for m-i is the same as for m-c. I wouldn't push a patch to inbound unless I believe that patch is ready for mozilla-central. If I have any significant level of doubt about this, I'd push to tryserver first to verify whatever tests/platforms/etc I'm concerned about.
Of course, I may misjudge this sometimes, in which case our faithful sheriffs will rescue the tree by backing me out. But the primary reason for me to land on inbound rather than m-c is simply that it frees me from tree-watching responsibilities -- not that it lets me push stuff that I feel is too risky for m-c.
> On 12-09-03 12:59 PM, Johnathan Nightingale wrote:
>> I'd also be perfectly okay with saying that changes someone like
>> Ehsan makes to something like layout are gonna run the full suite
>> every time. Layout pushes in general are likely to touch surprising
>> things. But even granting that, Steve's suggestions could help
>> firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the
>> way by running subsets.
>> We could label whole directories as "touching this ends the world,
>> test everything" and be pretty liberal about where we apply that
>> label because at the moment, we effectively apply it to everything.
>> So who's gonna volunteer to do the strawman test-bucket vs code
>> location matrix? :)
> This makes sense. Do you wanna file a bug in Core::Build Config and
> assign it to Steve? ;-)
I'd be fine with that, though I also wouldn't get to it for a while unless I make it through a couple of other projects faster than I have been so far.
Then again... ok, here's v1, in bash:
echo "run everything"
or in Python
print("run everything")
Now, who can hook this into buildbot? I'll patch it from there. :-)
Except I'm not kidding. I can go to town on some crazy algorithm, but I've no clue about the code or the process for getting anything actually hooked in and deployed.
Btw, upon further reflection, my previously sketched-out algorithm is all wrong. You don't just want to have a per-push trigger that says "what tests should we kick off for this push?" You really want a job completion trigger that says "what could I do with this now-available machine that would give me the most information, given what I currently know?" Or maybe that's unit of information per machine-minute, I'm not sure. But that formulation gives way more possibilities -- it might choose to bisect a past coalesced failure rather than just kicking off an almost-certain-to-be-useless test for the latest push. And as long as you don't hint to it that intermittent *greens* are possible, it'll naturally decay to running everything for every push if resources are available. (If you don't limit it, it'll also use idle resources to rerun every failure forever to make sure it's not intermittent, too.) Welcome to our new overlord, the Robosheriff!
> Steve Fink wrote:
>> On Tue 04 Sep 2012 02:20:37 PM PDT, Ehsan Akhgari wrote:
>>> On 12-09-03 12:59 PM, Johnathan Nightingale wrote:
>>>> I'd also be perfectly okay with saying that changes someone like
>>>> Ehsan makes to something like layout are gonna run the full suite
>>>> every time. Layout pushes in general are likely to touch surprising
>>>> things. But even granting that, Steve's suggestions could help
>>>> firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the
>>>> way by running subsets.
>>>> We could label whole directories as "touching this ends the world,
>>>> test everything" and be pretty liberal about where we apply that
>>>> label because at the moment, we effectively apply it to everything.
>>>> So who's gonna volunteer to do the strawman test-bucket vs code
>>>> location matrix? :)
>>> This makes sense. Do you wanna file a bug in Core::Build Config and
>>> assign it to Steve? ;-)
>> I'd be fine with that, though I also wouldn't get to it for a while
>> unless I make it through a couple of other projects faster than I
>> have been so far.
>> Then again... ok, here's v1, in bash:
>> echo "run everything"
>> or in Python
>> print("run everything")
>> Now, who can hook this into buildbot? I'll patch it from there. :-)
> So, I discussed this idea briefly with catlee today. Here's the
> gist. Doing this is not as easy as I thought it would be, since it is
> the build machine which schedules the test jobs once the build is
> finished, and buildbot is not involved in the decision. However, it
> is the buildbot who knows which files have changed in a given push.
> So, we need to stream that information into the builder somehow so
> that it can make the call on which test suites to run.
How does coalescing happen? Does the build machine always request the full set of tests, and then buildbot ignores the request if it's overloaded? Or does the build machine actually know something about the overload state? If the former, then plainly the build machine can continue doing exactly what it's doing, and whatever is currently aware of the overload would just need to be given information on the changes made so that it could selectively suppress jobs. But I somehow doubt it's that simple.
The pie-in-the-sky optimal interface would integrate more deeply, and might require a bit of rearchitecting. It really wants to be a daemon monitoring these notifications:
- job completion, with status
- new slave available (probably because it completed a job, but also when adding to the pool or rebooting or whatever)
- changes pushed, with a way of knowing what's in that change
- star comment added
The "new slave available" notification might actually be a synchronous call, since it would be the only thing kicking off new jobs. Optionally, this daemon could cancel known-to-be-bad jobs, trigger clobbers, and auto-star in limited cases.
Oh, and it wants to be able to distinguish regular pushes from merges and backouts, because failure probabilities are totally different across those. But a regex match is good enough for that.
In other words, it kind of wants to be the global scheduler. It would maintain state. Version 1 would watch incoming pushes and queue up all the build jobs. When a build job completed, it would queue up the test jobs, only it wouldn't be a linear queue because when another build came in it would need to reimplement the current coalescing strategy. When a slave became available, it would throw a job at it. Ignoring the (enormous) buildbot architectural questions, this should be pretty quick and straightforward to implement.
Later versions would be maintaining state to quickly and correctly answer the question, when a new slave is available, "what is the most useful job to run on this machine?" Usually that would be grabbing one of the test jobs from the most recent build, but could be bisecting coalesced failures or retriggering possibly intermittent failures.
To correctly answer the "most useful job" question, it would need to maintain estimates of the probability of any given job failing, as well as an estimate of the current state of every type of job in the tree (eg M1 is (85% probability) failing from one of the last 3 pushes, or (15% probability) is a not-yet-starred intermittent failure; M2 is totally happy with respect to the latest push.) That means it could eventually provide a sheriff's dashboard, enumerating the possible causes of the current horrific breakage and its plan for figuring out what's going on (which of course can be overridden at any time via manual retriggers or whatever.) It could even give its logic for why it picked each upcoming job. It should be written to be reactive, though, so it doesn't depend on anything following its advice.
In fact, an alternative implementation route would be implement the dashboard with all the crazy estimation stuff first, but not give it any ability to start/stop/star jobs. Then it could be validated on actual data before giving it the reins.
This would not want live on the builders, though. It needs global visibility.
> How does coalescing happen? Does the build machine always request the
> full set of tests, and then buildbot ignores the request if it's
> overloaded? Or does the build machine actually know something about the
> overload state? If the former, then plainly the build machine can
> continue doing exactly what it's doing, and whatever is currently aware
> of the overload would just need to be given information on the changes
> made so that it could selectively suppress jobs. But I somehow doubt
> it's that simple.
tl;dr - builds and tests are greedy - a machine will grab all pending work of the same type when it starts a build/test
Coalescing happens on the buildbot master at the time when a build starts. Once a machine is available to start a job the default behaviour is to grab all other pending jobs of the same type. The primary exceptions to this are try jobs where coalescing is disabled completely.
For builds, this turns into something like this when we're running at full capacity:
* push A -> pending build requests for win32, linux64, etc.
* push B -> pending build requests for win32, linux64, etc.
* push C -> pending build requests for win32, linux64, etc.
* win32 build slave becomes available. build master coalesces pending requests for win32 A,B,C into a single job, and tells slave to checkout/build the latest code (C).
* push D -> pending build requests for win32, linux64, etc.
* push E -> pending build requests for win32, linux64, etc.
* win32 build slave becomes available. build master coalesces pending requests for win32 D,E into a single job, and tells slave to checkout/build the latest code (E).
At this point, the build master has a lot of information about the changes going into A,B,C,D,E, including which files have changed. This data isn't currently communicated to the build slave, nor does it influence decisions about what should be built in most cases.
For each build platform, when the builds of C,E finish, they trigger tests by notifying the build master of a few pieces of data: branch, revision, platform as well as urls to the builds, tests, and symbols. This results in the pending queue for tests looking like:
* win32 mozilla-central C mochitests-1 http://....
* win32 mozilla-central C mochitests-2 http://....
...
* win32 mozilla-central E mochitests-1 http://....
These are subject to the same coalescing behaviours as the builds. So all the "mochitest-1" jobs for win32 mozilla-central will be coalesced the next time a slave is free. Note that the pending requests for test jobs only include the revision, not the list of files that were changed for the build. Also note that the test requests give no indication of how many pushes were coalesced into one build. pushes A,B,D never existed as far as tests are concerned.
This isn't to say that we *can't* change which tests are run in response to which files are changed, rather that it's a significant change from the current implementation.
> The pie-in-the-sky optimal interface would integrate more deeply, and
> might require a bit of rearchitecting. It really wants to be a daemon
> monitoring these notifications:
> - job completion, with status
> - new slave available (probably because it completed a job, but also
> when adding to the pool or rebooting or whatever)
> - changes pushed, with a way of knowing what's in that change
> - star comment added
> The "new slave available" notification might actually be a synchronous
> call, since it would be the only thing kicking off new jobs. Optionally,
> this daemon could cancel known-to-be-bad jobs, trigger clobbers, and
> auto-star in limited cases.
> Oh, and it wants to be able to distinguish regular pushes from merges
> and backouts, because failure probabilities are totally different across
> those. But a regex match is good enough for that.
> In other words, it kind of wants to be the global scheduler. It would
> maintain state. Version 1 would watch incoming pushes and queue up all
> the build jobs. When a build job completed, it would queue up the test
> jobs, only it wouldn't be a linear queue because when another build came
> in it would need to reimplement the current coalescing strategy. When a
> slave became available, it would throw a job at it. Ignoring the
> (enormous) buildbot architectural questions, this should be pretty quick
> and straightforward to implement.
Indeed the architectural issues there are enormous...intelligent scheduling is really tricky to get right, and then trickier to implement in buildbot. We've been working a few approaches that may help: one is to basically dump out all of the relevant state and events from the buildbot master to make it consumable from external processes. These processes can then inject new work into the system at their own pace.
This should make it easier to implement schedulers that require more state, or that may require some "expensive" operations to figure out what to do next (e.g. looking up past results in a DB, checking starred status, etc.)
One big difference from what you describe is that buildbot doesn't generate work in response to slave availability; it instead keeps a list of pending work, and which slaves are eligible to do it, and assigns the work out when slaves are free.
On Fri, Aug 31, 2012 at 3:16 PM, Ben Hearsum <bhear...@mozilla.com> wrote:
> I know this is in the works (sorry, I don't which bug is happening in),
> but we can't quite run all of our unit tests on AWS. Anything that
> depends on a GPU (reftest, some crashtests, and even some mochitests
> I've heard) can't run there. We definitely want to move everything we
> can to the cloud, though.
Would it be possible to use llvmpipe to make a CPU-only config on AWS
that to Firefox looks like a config with a GPU?
> On Fri, Aug 31, 2012 at 3:16 PM, Ben Hearsum <bhear...@mozilla.com> wrote:
>> I know this is in the works (sorry, I don't which bug is happening in),
>> but we can't quite run all of our unit tests on AWS. Anything that
>> depends on a GPU (reftest, some crashtests, and even some mochitests
>> I've heard) can't run there. We definitely want to move everything we
>> can to the cloud, though.
> Would it be possible to use llvmpipe to make a CPU-only config on AWS
> that to Firefox looks like a config with a GPU?
Would testing like that constitute a valid test? We've been pretty
insistent that we test on real-world things in the past.
On Tue, Sep 18, 2012 at 3:11 PM, Ben Hearsum <bhear...@mozilla.com> wrote:
> On 09/18/12 07:57 AM, Henri Sivonen wrote:
>> On Fri, Aug 31, 2012 at 3:16 PM, Ben Hearsum <bhear...@mozilla.com> wrote:
>>> I know this is in the works (sorry, I don't which bug is happening in),
>>> but we can't quite run all of our unit tests on AWS. Anything that
>>> depends on a GPU (reftest, some crashtests, and even some mochitests
>>> I've heard) can't run there. We definitely want to move everything we
>>> can to the cloud, though.
>> Would it be possible to use llvmpipe to make a CPU-only config on AWS
>> that to Firefox looks like a config with a GPU?
> Would testing like that constitute a valid test? We've been pretty
> insistent that we test on real-world things in the past.
To the extent there are already Linux distros that run Gnome Shell on
llvmpipe when suitable GPU OpenGL drivers are missing and Ubuntu is
moving to running the 3D version of Unity on llvmpipe when suitable
GPU OpenGL drivers are missing, I'd expect running on top of llvmpipe
to correspond to one kind of real-world situation, though I don't
actually know how Firefox sees the OpenGL stack when running with
Gnome Shell/llvmpipe or Unity/llvmpipe.