Changes to the chromium buildbots and build.chromium.org

8 views
Skip to first unread message

Nicolas Sylvain

unread,
Oct 26, 2010, 9:42:00 PM10/26/10
to chromium-dev
Hi,

  We will make important changes to the build.chromium.org tonight.  Those changes should be
quick to roll out, but we plan to close the tree for up to 1 hour starting around 8 PM (Pacific Time) in
case we run into issues.

  Once this is done, I will update this thread with more information about what changes we've made 
and what to expect from build.chromium.org in the foreseeable future.

  Thank you,

  Nicolas

Nicolas Sylvain

unread,
Oct 27, 2010, 2:52:59 AM10/27/10
to chromium-dev
Hi,

Tonight we have transitioned the chromium buildbots to a new datacenter where we will be able to grow
our infrastructure significantly faster.

This was the second step in a series of improvements we plan to make over the next few months.

A number of changes were brought live today, and I'd like to mention a few to make sure no one gets confused
with the new layout.

1. Performance Tests

The perf bots have been migrated off the main console page.   This does not mean they are not important. This only means that there was enough of them to justify a new waterfall.
The status of this waterfall is now displayed at the top of the main console page.  (The third line). 
Until you hear more from us about this, the sheriffs should continue looking at the perf bots like they do today.  Since they are slightly less visible, Chase and I will be monitoring them more closely for the next few weeks. We will then decide if the current situation is sustainable, or if we need to create a perf sheriff rotations. (like we did for memory).

2. modules bots

We used to have dedicated machines building and testing net and base. This was a waste of resources and was taking a lot of valuable space on the console page.  The new console does not have dedicated module bots, and base and net are now running on the same testers where ui_tests and browser_tests are running.  We will monitor the cycle time and adjust as needed.


We've changed considerably how the urls are constructed, but there should not be any impact yet because we added redirects for the old url format to the new url format.

If you have any bookmarks or hardcoded links to build.chromium.org, I suggest you update them to the new urls you get redirected to. We plan to delete the redirects in a few months.

4. Flakiness. 

We spent the last year aggressively marking tests as flaky when they randomly fails on our old buildbots.  Now that we have brand new hardware, it slightly changes the timing of the tests, and we are starting to see a little increase in flakiness on the new machines.  If you see new flakiness, please add the correct flag to them.  If you think this might be a problem with the bot, please let me know and I will take a look.

5. Memory waterfall

The team working on the memory tests has been doing a great job helping make it green, but there is still some redness left.  This should take another day or two before we can update the expectation file correctly.

6. New webkit canary waterfall

The webkit canaries used to be on the FYI waterfall. This is not the case anymore. We now have a dedicated waterfall where all the canaries are : 

7. New "Google Chrome" waterfall

We used to have a few bots building the "Google Chrome" version of chromium on the FYI waterfall. They have also been moved to their own waterfall:


8. The FYI Waterfall

The FYI waterfall is where we usually put all the bots that don't really impact the team when they go red.  Webkit and Google Chrome were the exceptions, and they are now out of the FYI waterfall.

We will go and make sure each bot on the FYI waterfall has an owner that cares about making sure it stays green.  Right now there is a lot of redness, but this is expected. I haven't had time to look at all the failures one by one, so if you see something and you know how to fix it, please let me know.


As always, if you see anything weird, please let us know.

Thanks

Nicolas, and the chrome infrastructure team.

Paweł Hajdan, Jr.

unread,
Oct 27, 2010, 3:29:42 AM10/27/10
to nsyl...@chromium.org, chromium-dev
On Wed, Oct 27, 2010 at 08:52, Nicolas Sylvain <nsyl...@chromium.org> wrote:
2. modules bots

We used to have dedicated machines building and testing net and base. This was a waste of resources and was taking a lot of valuable space on the console page.  The new console does not have dedicated module bots, and base and net are now running on the same testers where ui_tests and browser_tests are running.  We will monitor the cycle time and adjust as needed.

Sounds good! Actually, there are less bots and there are groups for webkit and chromeos. I think it's easier to watch now. Thanks!
 
4. Flakiness. 

We spent the last year aggressively marking tests as flaky when they randomly fails on our old buildbots.  Now that we have brand new hardware, it slightly changes the timing of the tests, and we are starting to see a little increase in flakiness on the new machines.  If you see new flakiness, please add the correct flag to them.  If you think this might be a problem with the bot, please let me know and I will take a look.


Please remember about filing a good bug, finding an owner, and proper bug label (Tests-Flaky, Tests-Fails, or Tests-Disabled). We are losing coverage for all of those test categories, so it's important that owner of affected test areas know what's going on.
 
Nicolas, and the chrome infrastructure team.

I have a few more ideas for you. :-)

Can we have a few dedicated bots (FYI waterfall, dedicated "flakiness" waterfall, or something totally different) that would run tests based on the lkgr? Let me explain a bit more.

On the real tree, or with trybots, it's sometimes hard to tell whether a test is flaky, or whether it was a real failure, even if it's not obvious how the patch could cause it. This means changing the prefixes based on the main waterfall or try jobs is risky, and I have avoided doing that.

On the other hand, when Marc-Antoine started submitting daily empty try jobs to the Windows server, it was very easy to spot flakiness. The base revision used by try servers is lkgr, so we expect all tests to pass. The bots reboot between builds, so any failure is very unlikely to be a bot issue.

Moreover, we get a limited number of test runs on the main waterfall bots. The dedicated "flakiness" bots could run tests a lot more frequently, leading to detecting the flakiness faster.

As a next step, we could use an out-of-process launcher for all tests on those dedicated bots (to guard against crashing or hanging tests), and run all disabled tests. This would allow us to drop different test prefixes that are confusing people, have a detailed information also about disabled tests (I know that's one of the goals of the Testing Task Force), and finally, we wouldn't be running unstable tests on the main waterfall.

Finally, it'd be great to enable Slavelastic on the main waterfall, to boost the cycle time. I think there is a very good correlation between short cycle time and greener and open tree.

Ojan Vafai

unread,
Oct 27, 2010, 11:59:35 AM10/27/10
to nsyl...@chromium.org, chromium-dev
On Tue, Oct 26, 2010 at 11:52 PM, Nicolas Sylvain <nsyl...@chromium.org> wrote:
6. New webkit canary waterfall

The webkit canaries used to be on the FYI waterfall. This is not the case anymore. We now have a dedicated waterfall where all the canaries are : 

Awesome. For webkit gardeners, this means we now have a console page where you can clearly see whether a given revision has completed on all the bots: http://build.chromium.org/p/chromium.webkit/console 
Reply all
Reply to author
Forward
0 new messages