Bandwidth issues in Chrome OS Lab (crbug.com/524814)

17 views
Skip to first unread message

Don Garrett

unread,
Aug 31, 2015, 2:26:24 PM8/31/15
to tele...@chromium.org
TL;DR We need help understanding what cause bandwidth spikes
Excessive Details:

The Chrome OS hardware lab is having problems because a substantial increase in bandwidth usage hit our limits, and started breaking the lab.
After some effort, we'be pretty much proved it start with a Chrome change in the following range:

  last unmolested build: 46.0.2483.0
  first problematic build: 46.0.2490.3

We have currently pinned to Chrome to 46.0.2483.0 as a work around.
We have no good way to reproduce anything, outside of our production builds, and reproducing breaks most of our builds randomly.

It seems like a reasonable guess that this is a telemetry change of some kind, so.... can anyone make reasonable guesses about the cause?
Any major increase in data transferred between the DUT and the test server, or communication between the DUT and any external server could be responsible.

This graph shows relevant bandwidth usage. We see problems with the bandwidth hits (roughly) 1G in usage. We pinned Chrome on Friday, and stopped seeing issues. Before that, they were happening sorta/kinda every 8 hours, which is when our Canaries run.

Graph

Achuith Bhandarkar

unread,
Aug 31, 2015, 2:29:38 PM8/31/15
to Don Garrett, telemetry, Xiaoqian Dai, Albert Bodenhamer, Cheng-yu Lee
524814 tracks the bug

Don Garrett

unread,
Aug 31, 2015, 2:53:23 PM8/31/15
to Achuith Bhandarkar, Richard Barnette, telemetry, Xiaoqian Dai, Albert Bodenhamer, Cheng-yu Lee

Don Garrett

unread,
Aug 31, 2015, 6:05:30 PM8/31/15
to Achuith Bhandarkar, Richard Barnette, telemetry, Xiaoqian Dai, Albert Bodenhamer, Cheng-yu Lee
Richard has a strong suspicion that we are seeing spikes in the traffic between telemetry on DUTs and Devservers. But doesn't know anything about the nature of that traffic.

Can anyone explain what that traffic is, and now it's used? It doesn't make sense to me that telemetry would be talking to a devserver, only to a drone, but I don't know much about how this stuff works. 

Richard Barnette

unread,
Aug 31, 2015, 6:11:53 PM8/31/15
to Don Garrett, Achuith Bhandarkar, telemetry, Xiaoqian Dai, Albert Bodenhamer, Cheng-yu Lee, Dan Shi
On 8/31/15 11:53 AM, Don Garrett wrote:
> +Richard Barnette <mailto:jrbar...@google.com>
>
TL;DR: The devservers in the test lab are used during
telemetry testing. There's traffic to and from those
servers that we don't measure, and that I don't fully
understand. I believe that _that_ traffic is a possible
source of the problem, and we need to figure out how to
rule it in or out.


Point #1:
The problem started with the test run for this canary build:

https://uberchromegw.corp.google.com/i/chromeos/builders/Canary%20master/builds/1220
The previous canary build was fine. After the first occurrence,
the problem repeated with every canary build that followed, until
we pinned Chrome back to the version prior to the problem.

Conclusion #1:
The problem was caused by changes in the Chrome source base.
This could include telemetry changes.


Point #2:
The only metric we've been able to observe that clearly shows the
problem is the one showing bandwidth in and out of one of the
Destiny test lab. The graphs show that we're saturating outgoing
bandwidth from Destiny during every canary run.

We have metrics that measure the total data from the DUTs through
various known channels. Most notably, we know the total size of
test results over time, and that value is largely unchanged before
and after the incident. The "total test results" number includes
Chrome crashes, Chrome logs, and test logs.

Conclusion #2:
Whatever is driving the extra traffic, it's apparently not something
tracked by our existing metrics. In particular, it's probably not
Chrome crashes, Chrome logs on the DUT, or test output.


Point #3:
As noted, there's telemetry related traffic between the devservers
and the DUTs. I don't fully understand that traffic, but I know
it's not tracked. This data clearly falls into the "not covered
by existing metrics" category, and because it's telemetry, it can
change every time telemetry changes.

Conclusion #3:
The telemetry code running on the devserver is necessarily a suspect,
and needs to be ruled in or out of the search.


> On Mon, Aug 31, 2015 at 11:29 AM Achuith Bhandarkar
> <ach...@chromium.org <mailto:ach...@chromium.org>> wrote:
>
> 524814
> <https://code.google.com/p/chromium/issues/detail?id=524814> tracks
> <http://salus.prodmon.global.ls.google.com:3350/nebgua.html#borgmon=0.network.borgmon.netops.ih.borg.google.com&graph_type=graph&title=us-mtv-2081-labsw1-2-1_mtv%3AGigabitEthernet0_29%20%20%20%20%20Interface%20traffic%2C%20Bit%2Fsec&grid=xtics%20ytics&key=top%20left&ar=5m&yformat=%25.1s%25c&yrange=%5B0%3A%5D&hist_staleness=15m&xformat=%25H%3A%25M&duration=7d&with_0=lines&format_0=in&with_1=lines&format_1=out&expr=%7Bvar%3D%22irate_bps_adjusted%22%2Cjob%3D%22interfaceStats%22%2Cinstance%3D%22us-mtv-2081-labsw1-2-1_mtv%22%2Cinterface%3D%22GigabitEthernet0_29%22%2Cshard%3D%22us-mtv-2081%22%7D%3B%7Bvar%3D%22orate_bps_adjusted%22%2Cjob%3D%22interfaceStats%22%2Cinstance%3D%22us-mtv-2081-labsw1-2-1_mtv%22%2Cinterface%3D%22GigabitEthernet0_29%22%2Cshard%3D%22us-mtv-2081%22%7D>
>
>

--
"For your convenience, an elevator is located in CHINA"
seen in Dillard's department store, Omaha, NE
Reply all
Reply to author
Forward
0 new messages