On Tuesday, January 16, 2018 at 6:41:00 PM UTC+5, Nico Weber wrote:
> As far as I know our PGO bots were internal-only. (Seb, is that true? Or do we have public profiles somewhere?)
> We didn't see any difference in memory use in our telemetry data. I'm pretty surprised you're seeing this -- since the two compilers are ABI-compatible (i.e. same struct layout etc), can you think of reasons why memory use should be different? I would expect memory use to be identical (and that's what we saw in our data). What are your population sizes? Have you tried comparing two identical PGO builds in an A/B test to make sure that those do report the same memory use in your telemetry?
Well, ABI compatibility does not mean at all that compilers generate the same code :) So at least code size and layout differences may influence memory consumption and, of course, various timings. I see differences of 7-9% just in binary sizes (our version of chrome.dll is 41M in msvs build versus 47M in clang, chrome_child.dll is 55M vs 64M in clang), the only difference in the builds is a change of condition in BUILDCONFIG.gn to enable/disable clang compiler (1).
The picture Vadim posted was from a comparison that was done using 10 different machines of the same software and hardware configuration, each one reporting 33 samples of summary data in case of system_health.memory_desktop test. The picture you see includes all of the data combined into one graph, total of 330 samples for each configuration.
My confidence in this comparison is very high because of
a) the final p-value obtained from statistical tests on this data: we use Mann-Whitney U-test to compare individual sample pairs generated on one machine (we never compare results obtained from different machines directly), and then combine p-values with Stouffer's Z-score method. p-value for the data shown in this picture is less than 5e-5, unfortunately, I made a screenshot which does not include this value in the rightmost column :) Furthermore, we drop results which yield different comparison decisions on different machines (imagine 5% decrease in median on two of ten machines and 7% increase on all the other, all this data would have been filtered as inappropriate)
a) the great stability of system_health.memory_desktop results across different machines (from the same pool and from different pools) and stories;
c) results on individual stories, all of them show quite the same difference in memory consumption
d) results from other perftests checking memory consumption (for example, we still use page_cycler which used to measure memory in Chrome couple of years ago before system_health entered the game), the differences in those perftests metrics tend to be quite the same
Talking about A/B testing and false positives: yes, we did a lot of work trying to lower false positive rate, and I triggered such a comparison using two identical MSVS PGO builds we used earlier to confirm this: the results showed 3 random metrics with p-values sitting around confidence level of 0.05 with absolutely noisy graphs (nothing even closer to what we see in msvs-clang comparison), rendering them as clear false positives.
(1)
https://chromium.googlesource.com/chromium/src.git/+/master/build/config/BUILDCONFIG.gn#141