On Thu 06 Sep 2012 07:16:32 PM PDT, Ehsan Akhgari wrote:
> Steve Fink wrote:
>> On Tue 04 Sep 2012 02:20:37 PM PDT, Ehsan Akhgari wrote:
>>> On 12-09-03 12:59 PM, Johnathan Nightingale wrote:
>>>>
>>>> I'd also be perfectly okay with saying that changes someone like
>>>> Ehsan makes to something like layout are gonna run the full suite
>>>> every time. Layout pushes in general are likely to touch surprising
>>>> things. But even granting that, Steve's suggestions could help
>>>> firefox, toolkit, mobile, js, nss, webgl, &c pushes get out of the
>>>> way by running subsets.
>>>>
>>>> We could label whole directories as "touching this ends the world,
>>>> test everything" and be pretty liberal about where we apply that
>>>> label because at the moment, we effectively apply it to everything.
>>>>
>>>> So who's gonna volunteer to do the strawman test-bucket vs code
>>>> location matrix? :)
>>>
>>> This makes sense. Do you wanna file a bug in Core::Build Config and
>>> assign it to Steve? ;-)
>>
>> I'd be fine with that, though I also wouldn't get to it for a while
>> unless I make it through a couple of other projects faster than I
>> have been so far.
>>
>> Then again... ok, here's v1, in bash:
>>
>> echo "run everything"
>>
>> or in Python
>>
>> print("run everything")
>>
>> Now, who can hook this into buildbot? I'll patch it from there. :-)
>
> So, I discussed this idea briefly with catlee today. Here's the
> gist. Doing this is not as easy as I thought it would be, since it is
> the build machine which schedules the test jobs once the build is
> finished, and buildbot is not involved in the decision. However, it
> is the buildbot who knows which files have changed in a given push.
> So, we need to stream that information into the builder somehow so
> that it can make the call on which test suites to run.
How does coalescing happen? Does the build machine always request the
full set of tests, and then buildbot ignores the request if it's
overloaded? Or does the build machine actually know something about the
overload state? If the former, then plainly the build machine can
continue doing exactly what it's doing, and whatever is currently aware
of the overload would just need to be given information on the changes
made so that it could selectively suppress jobs. But I somehow doubt
it's that simple.
The pie-in-the-sky optimal interface would integrate more deeply, and
might require a bit of rearchitecting. It really wants to be a daemon
monitoring these notifications:
- job completion, with status
- new slave available (probably because it completed a job, but also
when adding to the pool or rebooting or whatever)
- changes pushed, with a way of knowing what's in that change
- star comment added
The "new slave available" notification might actually be a synchronous
call, since it would be the only thing kicking off new jobs. Optionally,
this daemon could cancel known-to-be-bad jobs, trigger clobbers, and
auto-star in limited cases.
Oh, and it wants to be able to distinguish regular pushes from merges
and backouts, because failure probabilities are totally different across
those. But a regex match is good enough for that.
In other words, it kind of wants to be the global scheduler. It would
maintain state. Version 1 would watch incoming pushes and queue up all
the build jobs. When a build job completed, it would queue up the test
jobs, only it wouldn't be a linear queue because when another build came
in it would need to reimplement the current coalescing strategy. When a
slave became available, it would throw a job at it. Ignoring the
(enormous) buildbot architectural questions, this should be pretty quick
and straightforward to implement.
Later versions would be maintaining state to quickly and correctly
answer the question, when a new slave is available, "what is the most
useful job to run on this machine?" Usually that would be grabbing one
of the test jobs from the most recent build, but could be bisecting
coalesced failures or retriggering possibly intermittent failures.
To correctly answer the "most useful job" question, it would need to
maintain estimates of the probability of any given job failing, as well
as an estimate of the current state of every type of job in the tree (eg
M1 is (85% probability) failing from one of the last 3 pushes, or (15%
probability) is a not-yet-starred intermittent failure; M2 is totally
happy with respect to the latest push.) That means it could eventually
provide a sheriff's dashboard, enumerating the possible causes of the
current horrific breakage and its plan for figuring out what's going on
(which of course can be overridden at any time via manual retriggers or
whatever.) It could even give its logic for why it picked each upcoming
job. It should be written to be reactive, though, so it doesn't depend
on anything following its advice.
In fact, an alternative implementation route would be implement the
dashboard with all the crazy estimation stuff first, but not give it any
ability to start/stop/star jobs. Then it could be validated on actual
data before giving it the reins.
This would not want live on the builders, though. It needs global
visibility.