eventually providing coverage for master.chromium waterfall

35 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Nov 3, 2014, 2:00:59 PM11/3/14
to infr...@chromium.org, hackability-cy
I'm thinking about actions required to provide CQ coverage for each master on main waterfall.

Obviously this will be prioritized by data from sheriff-o-matic, but when some bugs are blocked on capacity, we may be able to make progress in other areas, even if it's just building consensus what is the right thing to do. In most cases it's just providing trybots.

In case of master.chromium waterfall (http://build.chromium.org/p/chromium/waterfall) there are some differences from most other waterfalls:

1. It does clobber builds. Some of them can take over an hour. It's not obvious to me yet why they take so long even with goma. I plan to take a closer look.

2. It runs "sizes" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/sizes/logs/stdio). It'd need a bit of work so we could retry it without patch on the trybot.

3. It runs "archive build" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/archive_build/logs/stdio) which we can't do the same way on trysever for obvious reasons. We may decide to just skip it or to upload to a different, temporary bucket (mostly to verify that all the files we expect are present). Another approach would be just not to test that step on trybots.

4. Then there are other steps like "uploading perf_expectations.json" which are not obvious how to handle - similar to the one above.

While I'm not suggesting specific solutions yet, how about we rethink the role of master.chromium waterfall. What is its purpose? Can we merge some of its steps somewhere else, or move to FYI? What would be the best way to provide tryserver coverage for the remaining bits?

Paweł

John Abd-El-Malek

unread,
Nov 19, 2014, 8:27:59 PM11/19/14
to Paweł Hajdan, Jr., infr...@chromium.org, hackability-cy
On Mon, Nov 3, 2014 at 11:00 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
I'm thinking about actions required to provide CQ coverage for each master on main waterfall.

Obviously this will be prioritized by data from sheriff-o-matic, but when some bugs are blocked on capacity, we may be able to make progress in other areas, even if it's just building consensus what is the right thing to do. In most cases it's just providing trybots.

In case of master.chromium waterfall (http://build.chromium.org/p/chromium/waterfall) there are some differences from most other waterfalls:

1. It does clobber builds. Some of them can take over an hour. It's not obvious to me yet why they take so long even with goma. I plan to take a closer look.

Even with goma, mac/win clobber builds on VMs are very slow. To build all, it takes nearly an hour.
 

2. It runs "sizes" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/sizes/logs/stdio). It'd need a bit of work so we could retry it without patch on the trybot.

How often does sizes turn red on the main waterfall for legitimate reasons? 

 

3. It runs "archive build" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/archive_build/logs/stdio) which we can't do the same way on trysever for obvious reasons. We may decide to just skip it or to upload to a different, temporary bucket (mostly to verify that all the files we expect are present). Another approach would be just not to test that step on trybots.

Same question as above..
 

4. Then there are other steps like "uploading perf_expectations.json" which are not obvious how to handle - similar to the one above.

While I'm not suggesting specific solutions yet, how about we rethink the role of master.chromium waterfall. What is its purpose? Can we merge some of its steps somewhere else, or move to FYI? What would be the best way to provide tryserver coverage for the remaining bits?

Perhaps it would be good to ask these questions on chromium-dev to see what others feel about this bot?
 

Paweł

--
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPaGLQRu66xth2Njt4sKB_YD%2BpHmhmMbh8P0zP-3g%3DA5EA%40mail.gmail.com.

Paweł Hajdan, Jr.

unread,
Nov 20, 2014, 6:11:11 AM11/20/14
to John Abd-El-Malek, chromium-dev, infr...@chromium.org, hackability-cy
On Thu, Nov 20, 2014 at 2:27 AM, John Abd-El-Malek <j...@chromium.org> wrote:
On Mon, Nov 3, 2014 at 11:00 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
1. It does clobber builds. Some of them can take over an hour. It's not obvious to me yet why they take so long even with goma. I plan to take a closer look.
Even with goma, mac/win clobber builds on VMs are very slow. To build all, it takes nearly an hour.

I have yet to take the closer look, but are you suggesting this would be too slow anyway to add such a trybot to default CQ set? 

This would likely mean we'd have to remove master.chromium from main waterfall, right?
 
2. It runs "sizes" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/sizes/logs/stdio). It'd need a bit of work so we could retry it without patch on the trybot.

How often does sizes turn red on the main waterfall for legitimate reasons? 


I don't think we had a month without failures, and sometimes the bots are failing for several days.
 
3. It runs "archive build" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/archive_build/logs/stdio) which we can't do the same way on trysever for obvious reasons. We may decide to just skip it or to upload to a different, temporary bucket (mostly to verify that all the files we expect are present). Another approach would be just not to test that step on trybots.

Same question as above..

This failed just 23 in 2014 so far. Most are gsutil/network flakes.

Note this step has side effects that are undesired if run as-is on the tryserver (would upload to production GS). See original post for some ideas how to handle it.


chromium_utils.CopyFileToDir(src/out/Release/remoting-webapp.v2.zip, /b/build/slave/Linux_x64/chrome_staging/, dest_fn=remoting-webapp.v2.zip)
Traceback (most recent call last):
  File "../../../scripts/slave/chromium/archive_build.py", line 760, in <module>
    sys.exit(main(None))
  File "../../../scripts/slave/chromium/archive_build.py", line 756, in main
    return s.ArchiveBuild()
  File "../../../scripts/slave/chromium/archive_build.py", line 571, in ArchiveBuild
    os.path.join(self._staging_dir, stage_subdir), dest_fn=stage_fn)
  File "/b/build/scripts/common/chromium_utils.py", line 485, in CopyFileToDir
    raise PathNotFound('Unable to find file %s' % src_path)
common.chromium_utils.PathNotFound: Unable to find file src/out/Release/remoting-webapp.v2.zip
program finished with exit code 1
elapsedTime=6.219438


Traceback (most recent call last):
  File "..\..\..\scripts\slave\chromium\archive_build.py", line 760, in <module>
    sys.exit(main(None))
  File "..\..\..\scripts\slave\chromium\archive_build.py", line 755, in main
    s = StagerByChromiumRevision(options)
  File "..\..\..\scripts\slave\chromium\archive_build.py", line 667, in __init__
    StagerBase.__init__(self, options, None)
  File "..\..\..\scripts\slave\chromium\archive_build.py", line 177, in __init__
    os.path.dirname(self._chrome_dir)) # src/ instead of src/chrome
  File "C:\b\build\scripts\slave\slave_utils.py", line 120, in GetHashOrRevision
    raise NotAnyWorkingCopy(wc_dir)
slave.slave_utils.NotAnyWorkingCopy: C:\b\build\slave\Win\build\src
program finished with exit code 1
elapsedTime=0.562000


The command line is too long.
program finished with exit code 1
elapsedTime=0.124000

4. Then there are other steps like "uploading perf_expectations.json" which are not obvious how to handle - similar to the one above.

While I'm not suggesting specific solutions yet, how about we rethink the role of master.chromium waterfall. What is its purpose? Can we merge some of its steps somewhere else, or move to FYI? What would be the best way to provide tryserver coverage for the remaining bits?

Perhaps it would be good to ask these questions on chromium-dev to see what others feel about this bot?

Adding chromium-dev then.

Paweł

Scott Graham

unread,
Nov 20, 2014, 12:58:01 PM11/20/14
to Paweł Hajdan, Jr., John Abd-El-Malek, chromium-dev, infr...@chromium.org, hackability-cy
I think we need to keep a clobber bot, as we've had various problems over time that only show up on clobber, and might not caught for a long time on incremental builds.
 

Paweł

--
You received this message because you are subscribed to the Google Groups "Chromium Hackability Code Yellow" group.
To unsubscribe from this group and stop receiving emails from it, send an email to hackability-c...@chromium.org.
To post to this group, send email to hackabi...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/hackability-cy/CAATLsPayfq7Jm2_JsC8onq3aKrs7hVgL0E8hiCxKsP1ZnyfxOw%40mail.gmail.com.

Paweł Hajdan, Jr.

unread,
Nov 21, 2014, 10:11:38 AM11/21/14
to Scott Graham, John Abd-El-Malek, chromium-dev, infr...@chromium.org, hackability-cy
This is a good point. Do you have examples of these incremental build problems just for reference?

Now obviously people would prefer to test everything everywhere for every change, and for CQ-landed changes not to break these tests, and for CQ to be fast. It's not always possible/feasible to accomplish all of that together at given point in time.

Looks like we'd have to make some kind of tradeoff here. Possibilities I see:

1) add clobber builders to CQ; this actually shouldn't slow down CQ cycle time, which based on https://groups.google.com/a/chromium.org/d/msg/chromium-dev/wbjXEfbNZVE/NI4CXGkk00oJ is 55-57 minutes, while the slowest clobber build, Mac, takes about 40 minutes according to https://build.chromium.org/p/chromium/stats

2) remove clobber builders from main waterfall

3) accept the fact that CQ might occasionally break the clobber builders; explicitly exclude them from "CQ matches main waterfall" CY pillar

Please let me know if you see some other options, I'd like to consider them as well, and let's discuss which one to pick.

Paweł

Dirk Pranke

unread,
Nov 21, 2014, 12:45:58 PM11/21/14
to Paweł Hajdan, Jr., Scott Graham, John Abd-El-Malek, chromium-dev, infr...@chromium.org, hackability-cy
We should bear in mind the fact that we are currently resource constrained and that that's keeping us from doing other things we want to do, like the Blink merge.

I would much rather take a machine on every change to run the layout tests than I would to do a clobber build, for example.

On Fri, Nov 21, 2014 at 7:11 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
This is a good point. Do you have examples of these incremental build problems just for reference?

Now obviously people would prefer to test everything everywhere for every change, and for CQ-landed changes not to break these tests, and for CQ to be fast. It's not always possible/feasible to accomplish all of that together at given point in time.

Looks like we'd have to make some kind of tradeoff here. Possibilities I see:

1) add clobber builders to CQ; this actually shouldn't slow down CQ cycle time, which based on https://groups.google.com/a/chromium.org/d/msg/chromium-dev/wbjXEfbNZVE/NI4CXGkk00oJ is 55-57 minutes, while the slowest clobber build, Mac, takes about 40 minutes according to https://build.chromium.org/p/chromium/stats

2) remove clobber builders from main waterfall

3) accept the fact that CQ might occasionally break the clobber builders; explicitly exclude them from "CQ matches main waterfall" CY pillar

I strongly vote for 3 :).

-- Dirk
 
You received this message because you are subscribed to the Google Groups "infra-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to infra-dev+...@chromium.org.
To post to this group, send email to infr...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/infra-dev/CAATLsPbyO6pVqkOUfA%3DZZBdjJbbNwsa22XSBy9JeadMTH%2BNqyQ%40mail.gmail.com.

Scott Graham

unread,
Nov 21, 2014, 1:02:47 PM11/21/14
to Dirk Pranke, Paweł Hajdan, Jr., John Abd-El-Malek, chromium-dev, infr...@chromium.org, hackability-cy
On Fri, Nov 21, 2014 at 9:45 AM, Dirk Pranke <dpr...@chromium.org> wrote:
We should bear in mind the fact that we are currently resource constrained and that that's keeping us from doing other things we want to do, like the Blink merge.

I would much rather take a machine on every change to run the layout tests than I would to do a clobber build, for example.

On Fri, Nov 21, 2014 at 7:11 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
This is a good point. Do you have examples of these incremental build problems just for reference?

Now obviously people would prefer to test everything everywhere for every change, and for CQ-landed changes not to break these tests, and for CQ to be fast. It's not always possible/feasible to accomplish all of that together at given point in time.

Looks like we'd have to make some kind of tradeoff here. Possibilities I see:

1) add clobber builders to CQ; this actually shouldn't slow down CQ cycle time, which based on https://groups.google.com/a/chromium.org/d/msg/chromium-dev/wbjXEfbNZVE/NI4CXGkk00oJ is 55-57 minutes, while the slowest clobber build, Mac, takes about 40 minutes according to https://build.chromium.org/p/chromium/stats

2) remove clobber builders from main waterfall

3) accept the fact that CQ might occasionally break the clobber builders; explicitly exclude them from "CQ matches main waterfall" CY pillar

I strongly vote for 3 :).

I think in this case #3 is OK as far as "clobber is broken".

We should endeavour to include any other steps (e.g. "sizes") that happen to incidentally on be on master.chromium somewhere in CQ though.

John Abd-El-Malek

unread,
Nov 21, 2014, 3:28:24 PM11/21/14
to Scott Graham, Dirk Pranke, Paweł Hajdan, Jr., chromium-dev, infr...@chromium.org, hackability-cy
On Fri, Nov 21, 2014 at 10:02 AM, Scott Graham <sco...@chromium.org> wrote:


On Fri, Nov 21, 2014 at 9:45 AM, Dirk Pranke <dpr...@chromium.org> wrote:
We should bear in mind the fact that we are currently resource constrained and that that's keeping us from doing other things we want to do, like the Blink merge.

I would much rather take a machine on every change to run the layout tests than I would to do a clobber build, for example.

On Fri, Nov 21, 2014 at 7:11 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
This is a good point. Do you have examples of these incremental build problems just for reference?

Now obviously people would prefer to test everything everywhere for every change, and for CQ-landed changes not to break these tests, and for CQ to be fast. It's not always possible/feasible to accomplish all of that together at given point in time.

Looks like we'd have to make some kind of tradeoff here. Possibilities I see:

1) add clobber builders to CQ; this actually shouldn't slow down CQ cycle time, which based on https://groups.google.com/a/chromium.org/d/msg/chromium-dev/wbjXEfbNZVE/NI4CXGkk00oJ is 55-57 minutes, while the slowest clobber build, Mac, takes about 40 minutes according to https://build.chromium.org/p/chromium/stats

The builders on these bots are bare metal machines. Doing clobber builds on VMs, at least with the current VM configuration, will be slower.


2) remove clobber builders from main waterfall

3) accept the fact that CQ might occasionally break the clobber builders; explicitly exclude them from "CQ matches main waterfall" CY pillar

I strongly vote for 3 :).

+1. this happens pretty infrequently, and the cost to add it to CQ will be very large.
 

I think in this case #3 is OK as far as "clobber is broken".

We should endeavour to include any other steps (e.g. "sizes") that happen to incidentally on be on master.chromium somewhere in CQ though.

I thought sizes doesn't work correctly with incremental builds, and needs clobber? 
 

-- Dirk
 

Please let me know if you see some other options, I'd like to consider them as well, and let's discuss which one to pick.

Paweł

On Thu, Nov 20, 2014 at 6:58 PM, Scott Graham <sco...@chromium.org> wrote:


On Thu, Nov 20, 2014 at 3:10 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:

On Thu, Nov 20, 2014 at 2:27 AM, John Abd-El-Malek <j...@chromium.org> wrote:
On Mon, Nov 3, 2014 at 11:00 AM, Paweł Hajdan, Jr. <phajd...@chromium.org> wrote:
1. It does clobber builds. Some of them can take over an hour. It's not obvious to me yet why they take so long even with goma. I plan to take a closer look.
Even with goma, mac/win clobber builds on VMs are very slow. To build all, it takes nearly an hour.

I have yet to take the closer look, but are you suggesting this would be too slow anyway to add such a trybot to default CQ set? 

This would likely mean we'd have to remove master.chromium from main waterfall, right?
 
2. It runs "sizes" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/sizes/logs/stdio). It'd need a bit of work so we could retry it without patch on the trybot.

How often does sizes turn red on the main waterfall for legitimate reasons? 


I don't think we had a month without failures, and sometimes the bots are failing for several days.

how many of those failures are legitimate, vs code growing slowly and then hitting previous limit?
 
 
3. It runs "archive build" step (e.g. http://build.chromium.org/p/chromium/builders/Linux/builds/54958/steps/archive_build/logs/stdio) which we can't do the same way on trysever for obvious reasons. We may decide to just skip it or to upload to a different, temporary bucket (mostly to verify that all the files we expect are present). Another approach would be just not to test that step on trybots.

Same question as above..

This failed just 23 in 2014 so far. Most are gsutil/network flakes. 

Note this step has side effects that are undesired if run as-is on the tryserver (would upload to production GS). See original post for some ideas how to handle it.

seems like no need to worry about this then. 
Reply all
Reply to author
Forward
0 new messages