reducing excessive gtest output

7 views
Skip to first unread message

Paweł Hajdan, Jr.

unread,
Oct 14, 2016, 12:15:37 PM10/14/16
to chromium-dev, infr...@chromium.org
In https://codereview.chromium.org/2406243004 I introduced a 2.5 MB per-test output limit. This is part of follow-up after an outage (https://bugs.chromium.org/p/chromium/issues/detail?id=634418).

The way it works is we truncate the output and fail the test.

This limit would prevent above mentioned outage, but not if every test started to print huge output.

I agree with Dirk's assessment from that CL, that no test should need more than few KBs of output during normal operation.

I'm considering doing some work to reduce this limit. I'd look into which tests produce huge output, and either reach out to their authors/maintainers, or send some CLs myself.

Would I have your support for this? It's not really possible to reduce the limit until there's even one test which exceeds it. Good thing is, from that point failing tests acts as a ratchet, preventing noisy ones from being checked in.

Oh, any other feedback about how this output limit works would also be welcome. If discussion about that gets complex enough, I could write a short doc.

Paweł

Ojan Vafai

unread,
Oct 14, 2016, 1:22:35 PM10/14/16
to phajd...@chromium.org, chromium-dev, infr...@chromium.org
Can you run a script over the last couple days worth of log files to generate a histogram of the per-test output sizes? From there it should be pretty easy to pick a reasonable limit.

In general, I think having some reasonable upper bound make sense.

--
--
Chromium Developers mailing list: chromi...@chromium.org
View archives, change email options, or unsubscribe:
http://groups.google.com/a/chromium.org/group/chromium-dev

Nico Weber

unread,
Oct 14, 2016, 2:35:55 PM10/14/16
to Paweł Hajdan Jr., chromium-dev, infr...@chromium.org
I thank that the general goal should be that tests should have no output unless they fail, in which case they only should have output related to the failure ("foo failed, <stack trace>"). Tests with lots of output are difficult to work with for devs as well as for bots.

Due to temporary debug logging, this won't be achievable even in an ideal world, but I agree with dpranke that 10MB seems pretty high.

--

Scott Hess

unread,
Oct 14, 2016, 4:15:49 PM10/14/16
to Paweł Hajdan, Jr., chromium-dev, infr...@chromium.org
Ratcheting the limit down is an excellent idea, it is very frustrating
to try to debug something only to find out that there are tens of
thousands of lines of unrelated output which isn't easy to bisect with
because it changes content and ordering on every run. Perhaps this
relates to the VLOG problem, where we could have a bankruptcy policy
on the order of turning down anything older than 180 days. You can
turn them back on, you just have to assert ownership by landing a CL
to do so.

WRT the CL, I would suggest distinguishing the threshold for failure
from the truncation amount, so that the failure is easy to debug
without needing multiple passes.

Also, it may be useful to post-process logs to distinguish common
lines from uncommon lines. Having 10MB of output from a piece of
setup infrastructure distributed across 1000 tests in 10kb chunks
isn't great.

-scott

Paweł Hajdan, Jr.

unread,
Oct 18, 2016, 1:12:32 PM10/18/16
to Scott Hess, chromium-dev, infr...@chromium.org
I've uploaded https://codereview.chromium.org/2428923002 to get the histograms. See tryjob results for details, and end of this email for an example. A possible suggestion for target limit based on that could be 10 KB.

Currently for debugging we preserve the tail of output snippet up to the limit, and to get full output one can run outside of the launcher (--single-process-tests/--single_process). Does that sound good?

For post-processing, I was considering that. I came to conclusion that some examples would be useful to see how effective such heuristics are. I'm not sure if the lines would be identical - timestamps and PIDs would probably vary. In some cases the postprocessing might remove something important from the logs - it seems prudent to keep original version, but then it could defeat the point of deduping.


Collections of histograms for TestLauncher.TestOutputSize
Histogram: TestLauncher.TestOutputSize recorded 1086 samples, mean = 3710.5
0        ... 
76       O                                                                         (2 = 0.2%) {0.0%}
105      ---O                                                                      (17 = 1.6%) {0.2%}
146      ---O                                                                      (19 = 1.7%) {1.7%}
203      O                                                                         (2 = 0.2%) {3.5%}
282      O                                                                         (3 = 0.3%) {3.7%}
391      O                                                                         (2 = 0.2%) {4.0%}
542      --O                                                                       (12 = 1.1%) {4.1%}
752      ------------------------------------------------------------------------O (454 = 41.8%) {5.2%}
1043     --------------------------------------------------O                       (313 = 28.8%) {47.1%}
1447     ---------------O                                                          (94 = 8.7%) {75.9%}
2008     ---------O                                                                (57 = 5.2%) {84.5%}
2786     ------O                                                                   (36 = 3.3%) {89.8%}
3865     ----O                                                                     (23 = 2.1%) {93.1%}
5362     ---O                                                                      (18 = 1.7%) {95.2%}
7439     ---O                                                                      (22 = 2.0%) {96.9%}
10321    -O                                                                        (7 = 0.6%) {98.9%}
14319    O                                                                         (0 = 0.0%) {99.5%}
19866    O                                                                         (3 = 0.3%) {99.5%}
27562    O                                                                         (1 = 0.1%) {99.8%}
38240    ... 
1945260  O                                                                         (1 = 0.1%) {99.9%}
2698871  ... 
Paweł

Scott Hess

unread,
Oct 18, 2016, 1:39:40 PM10/18/16
to Paweł Hajdan, Jr., chromium-dev, infr...@chromium.org
On Tue, Oct 18, 2016 at 10:12 AM, Paweł Hajdan, Jr.
<phajd...@chromium.org> wrote:
> I've uploaded https://codereview.chromium.org/2428923002 to get the
> histograms. See tryjob results for details, and end of this email for an
> example. A possible suggestion for target limit based on that could be 10
> KB.

Serious eye-rolling at the ~2M case.

> Currently for debugging we preserve the tail of output snippet up to the
> limit, and to get full output one can run outside of the launcher
> (--single-process-tests/--single_process). Does that sound good?

It's fine for cases where the log spew is cross-platform, but
sometimes/often it can be a substantial investment to build something
on a specific platform if it's not one of the core platforms you
routinely build on. If I make a change and it tips a test from just
under the threshold to just over the threshold, then I'm more likely
to do the right thing if it's quick and easy to do it immediately
without adding barriers.

My understanding is that storing many megabytes of log spew is a
scalability issue for the steady state, where we're storing it for the
successful runs. There should be orders of magnitude fewer failures
due to excessive logging, so keeping more in that case shouldn't be a
huge infrastructure issue.

Also, this is more of an issue when we tighten the screws. If we keep
100k of data, in all likelihood there's more than enough there to
identify major culprits. If we keep 10k, it's less likely. If we
kept 2k it's very possible that the majority offender doesn't even
show up, or isn't obvious. You can work around it by commenting out
the causes of that 2k and repeating until you get traction, but doing
it in one pass would be much happier.

> For post-processing, I was considering that. I came to conclusion that some
> examples would be useful to see how effective such heuristics are. I'm not
> sure if the lines would be identical - timestamps and PIDs would probably
> vary. In some cases the postprocessing might remove something important from
> the logs - it seems prudent to keep original version, but then it could
> defeat the point of deduping.

Just to be clear - I don't mean de-duping, which just encourages
people to leave things alone. I'm thinking more in terms of
cross-test analysis. If someone is logging 15 lines into EVERY test,
then we should probably visit whether we can remove it entirely or
condense it into one line. We definitely shouldn't elide it in any
way.

Maybe another angle on this is to have thresholds for per-test log
output, and additional thresholds per test target. We might need to
loosen the threshold per test target as new tests are added, but that
loosening could provide the signal to analyze for common log lines
again.

-scott
Reply all
Reply to author
Forward
0 new messages