CFI: new (limited) launch is coming soon

Ivan Krasin

unread,

Jul 12, 2016, 4:51:48 PM7/12/16

to Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Hi there,

We plan to launch a subset of CFI (just virtual calls without cast checks) tomorrow. Preliminary, our perf impact seems reasonable and is worth the security benefits which come with the launch. Full numbers are attached (we launch the "ltocficall" subset. The "ltocfi" is the full CFI, which is slower)

The bottom line is that for CFI for calls only the size overhead is about 5%, the perf overhead for blink_perf.layout is median 2.81% and the perf overhead for smoothness.top_25_smooth is in the noise. Looking at the other suites Peter ran, most of the medians look good, but there are some outliers (most likely, due to the tests running on a desktop).

The plan is to launch this subset, then implement a few improvements in LLVM, which should bring the overhead somewhat down, and then launch the remaining part.

In particular, the following experimental stuff is in the works (but may not be available in the next 2-3 months):

1. CFI with splitting vtables. This will somewhat reduce the impact on the binary size (5%->3.5%) and slightly speed up layout benchmarks.

2. Relative ABI for vtables (https://crbug.com/589384). This should bring additional binary size savings.

3. Using LLD instead of Gold + LLVM Gold plugin. No runtime impact, but may somewhat speed up the linking process.

4. (ongoing) More automatic devirtualization

5. (long term) Adopting ThinLTO, which should make linking very fast. This one is highly speculative, as ThinLTO is not ready yet, but people are actively working on it.

The objectives for the launch attempt is to either turn on vcall checks and be happy, or find specific micro-benchmarks which are affected higher than anticipated. In the latter case, we'll revert the CL and will work on reducing the overhead in these specific cases.

This is mostly a friendly FYI, but if you have any upfront objections, please, let us know.

krasin

bm-devirt6.zip

Tom Hudson

unread,

Jul 12, 2016, 5:05:19 PM7/12/16

to Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

This is also limited to Linux x86_64?

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5RdFVFmL-LphrY%2BpdeVqF7q%2B5cMNux9EwbWo5183YN-Yzw%40mail.gmail.com.

Ivan Krasin

unread,

Jul 12, 2016, 5:06:20 PM7/12/16

to Tom Hudson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Correct. No Android or Chrome OS either.

Elliott Sprehn

unread,

Jul 12, 2016, 6:34:51 PM7/12/16

to Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

We don't have very many Linux users, and since this isn't on Chrome OS or Android the security benefit to the user population is pretty small right?

What's the performance impact on html-full-render on a low end Chromebook? Putting a branch at every virtual call still seems like needless perf overhead, especially compared to our competitors running on lower end machines. It also severely hurts our ability to refactor the code by adding virtual calls since now they're more expensive, but not locally.

Why do you think 3% is an acceptable regression in the layout benchmarks? :) Can you also run the Animometer benchmarks? We're already 20% slower than Safari at a bunch of things, I'm not very comfortable becoming even slower.

--

Ivan Krasin

unread,

Jul 12, 2016, 6:49:22 PM7/12/16

to Elliott Sprehn, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

Hi Elliott,

On Tue, Jul 12, 2016 at 3:34 PM, Elliott Sprehn <esp...@chromium.org> wrote:

We don't have very many Linux users, and since this isn't on Chrome OS or Android the security benefit to the user population is pretty small right?

Yes, this is why we use Linux x86-64 as our test bed. While the immediate impact will be just this small (still, millions) Linux population, it's the only way to expand to Android, Chrome OS, Mac OS and Windows.

What's the performance impact on html-full-render on a low end Chromebook?

How do we measure that? Does Perf team has any trybots for that?

Putting a branch at every virtual call still seems like needless perf overhead, especially compared to our competitors running on lower end machines. It also severely hurts our ability to refactor the code by adding virtual calls since now they're more expensive, but not locally.

It's possible build build CFI-profiled build locally. It's just slower to link. Eventually, ThinLTO should address this.

Why do you think 3% is an acceptable regression in the layout benchmarks? :)

It's not all benchmarks, and we knew that there will be some impact on the performance when we started to discuss implementing CFI for Chrome with the team. It was blessed in general terms. As for the specific thresholds: last time we regressed by 20% on some benchmarks, and Peter has improved things considerably, including speeding up some of the layout benchmarks by up to 7% with aggressive automatic devirtualization on the compiler level: https://crbug.com/580389 and https://crbug.com/617283

So, we're making it somewhat slower, but only after made it faster. Looks like the right sequence of the events.

Can you also run the Animometer benchmarks? We're already 20% slower than Safari at a bunch of things, I'm not very comfortable becoming even slower.

Yes, we can. How do we do that? Any specific doc pointers?

Ivan Krasin

unread,

Jul 12, 2016, 7:12:09 PM7/12/16

to Elliott Sprehn, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

To add a bit more clarify to my statement about "last time we regressed by 20% on some benchmarks": In December 2015 we made a attempt to launch full CFI, and we had to revert it due to some of the layout benchmarks regressed by 20%. Then Peter made these improvements (up to 7%) already deployed in the official Chrome, and the new attempt is only going to slowdown some of the benchmarks by ~3% (at least, by our best measurements using the telemetry scripts).

Peter Collingbourne

unread,

Jul 12, 2016, 7:20:51 PM7/12/16

to Ivan Krasin, Elliott Sprehn, Project TRIM, Annie Sullivan, Kostya Serebryany, Nico Weber, Elliott Friedman, Nat Duca

On Tue, Jul 12, 2016 at 3:49 PM, Ivan Krasin <kra...@google.com> wrote:

Hi Elliott,

On Tue, Jul 12, 2016 at 3:34 PM, Elliott Sprehn <esp...@chromium.org> wrote:
We don't have very many Linux users, and since this isn't on Chrome OS or Android the security benefit to the user population is pretty small right?
Yes, this is why we use Linux x86-64 as our test bed. While the immediate impact will be just this small (still, millions) Linux population, it's the only way to expand to Android, Chrome OS, Mac OS and Windows.

What's the performance impact on html-full-render on a low end Chromebook?
How do we measure that? Does Perf team has any trybots for that?

Putting a branch at every virtual call still seems like needless perf overhead, especially compared to our competitors running on lower end machines.

Bear in mind that this isn't at every virtual call site; we can avoid a branch in cases where devirtualization or virtual constant propagation are applied.

Regarding our competitors, note that Microsoft is already shipping Edge with their CFI implementation (Control Flow Guard) enabled [0], which is not only less precise but higher overhead, according to one study [1] which was carried out before the bulk of our performance improvements (note however that as the authors mention, Control Flow Guard and the CFI we're planning to roll out here aren't directly comparable because Control Flow Guard also protects calls via function pointers).

[0] https://blogs.windows.com/msedgedev/2015/05/11/microsoft-edge-building-a-safer-browser/

[1] http://arxiv.org/pdf/1602.04056v2.pdf

Brett Wilson

unread,

Jul 12, 2016, 9:00:27 PM7/12/16

to Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Is there a doc describing this launch and the impacts (positive and negative)? I feel like this is the kind of thing that probably already has a design doc, and it should collect the perf stats of the thing we want to launch. This also seems like the kind of thing that should be posted using the new design doc posting process we sent out 2 weeks ago.

It seems like at least some performance-focused people are unhappy. Given that this trades off some amount of speed for some amount of security, I think our decision to tradeoff two of our top-line project goals should be very explicit. I'm guessing Launch Review will be a bad place for this since the performance and security subtleties here aren't digestable in 5 minutes in a big meeting. Maybe I'm wrong or there may be more background I'm not aware of.

It may be worthwhile to set up a separate decision making meeting for this if there isn't obvious consensus on the mailing list about the stats. If there's a plan to get more stats from the wild, I would think it's appropriate to just launch to Linux dev channel to get real world stats, and then look at that to help make the final decision.

I'm not arguing against this launch, I just want everybody to be happy that we explicitly decided to trade off X% of speed for Y% of security and that this is in line with our broader project goals.

Brett

On Tue, Jul 12, 2016 at 1:51 PM, 'Ivan Krasin' via Project TRIM <projec...@chromium.org> wrote:

--

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5RdFVFmL-LphrY%2BpdeVqF7q%2B5cMNux9EwbWo5183YN-Yzw%40mail.gmail.com.

Ivan Krasin

unread,

Jul 12, 2016, 9:26:59 PM7/12/16

to Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Hi Brett,

On Tue, Jul 12, 2016 at 6:00 PM, Brett Wilson <bre...@chromium.org> wrote:

Is there a doc describing this launch and the impacts (positive and negative)? I feel like this is the kind of thing that probably already has a design doc, and it should collect the perf stats of the thing we want to launch. This also seems like the kind of thing that should be posted using the new design doc posting process we sent out 2 weeks ago.

CFI is a long-running effort. We went through the old process ('intent to implement') back in October 2015 ([1])

There's no explicit design doc on the Chromium side required, as in the end, it's just enabling a compiler flag and the LLVM side design doc is linked in the thread ([1]).

The only updates from that time is somewhat changed estimates for performance / size impact, which I have provided in the first message of the current thread.

It seems like at least some performance-focused people are unhappy. Given that this trades off some amount of speed for some amount of security, I think our decision to tradeoff two of our top-line project goals should be very explicit. I'm guessing Launch Review will be a bad place for this since the performance and security subtleties here aren't digestable in 5 minutes in a big meeting. Maybe I'm wrong or there may be more background I'm not aware of.

Back to October 2015, the consensus was it's good to go. Based on that estimates, it was approved for a launch in December 2015, and only after finding micro-benchmarks which regressed by 20%, it was decided that the negative impact is too high. So we have reverted and made a lot of work to make Chrome having smaller dynamic number of virtual calls ([2], [3]), and this part is already launched.

So, I kind of assumed that we don't need yet another launch review. Nothing has changed in the stated goals for CFI.

It may be worthwhile to set up a separate decision making meeting for this if there isn't obvious consensus on the mailing list about the stats. If there's a plan to get more stats from the wild, I would think it's appropriate to just launch to Linux dev channel to get real world stats, and then look at that to help make the final decision

.

I'm not arguing against this launch, I just want everybody to be happy that we explicitly decided to trade off X% of speed for Y% of security and that this is in line with our broader project goals.

The performance impact here is not a scalar, it's a vector. For instance, this launch is not going to slowdown Chrome by X%. For the most metrics there won't be any visible impact. Some of the metrics will degrade by ~3%, but we don't currently know how many of them, or if there're other metrics which will regress more than that (in which case we'll definitely revert). As for the security, it's not measured in percents at all. CFI closes one of the popular steps to create an exploit and it's widely discussed in the papers mentioned by Peter.

My proposal: tomorrow, I will submit the launch CFI CL ([4]), we'll wait for the Perf dashboard to report regressions (we know there will be some, but the exact numbers are hard to guess; the best estimate is the attached doc in the first message of this thread), and then continue this discussion in a more constructive way, where we know what has become slower.

Does it sound reasonable to you, Brett?

krasin

1. https://groups.google.com/a/chromium.org/d/msg/chromium-dev/pbJqt6ccMII/7iJC2oklCAAJ

2. https://crbug.com/580389

3. https://crbug.com/617283

4. https://codereview.chromium.org/2140373002/

Kentaro Hara

unread,

Jul 12, 2016, 9:44:56 PM7/12/16

to Ivan Krasin, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Regarding the Blink part, platform-architecture-dev@ would be a good place to discuss this kind of stuff.

Before making some decision, I'd like to have a more comprehensive performance result. Would you at least run the following benchmarks more than 20 times locally and compare the performance?

- blink_perf.* (not limited to blink_perf.layout)

- dromaeo.dom*

- speedometer

- Animometer

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5Rfg9PNrqatk8Hv3-Mk%2BcmBNidqrsp9Yi%2BEUcmFLrgQG3A%40mail.gmail.com.

--

Kentaro Hara, Tokyo, Japan

Ivan Krasin

unread,

Jul 12, 2016, 11:04:51 PM7/12/16

to Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

On Tue, Jul 12, 2016 at 6:44 PM, Kentaro Hara <har...@chromium.org> wrote:

Regarding the Blink part, platform-architecture-dev@ would be a good place to discuss this kind of stuff.

Before making some decision, I'd like to have a more comprehensive performance result. Would you at least run the following benchmarks more than 20 times locally and compare the performance?

- blink_perf.* (not limited to blink_perf.layout)
- dromaeo.dom*

yes, sure for the both above

- speedometer
- Animometer

How do I run speedometer / Animometer locally? I assume they are not regular benchmarks, are they?

Ivan Krasin

unread,

Jul 12, 2016, 11:18:38 PM7/12/16

to Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

To make sure that everyone is on the same page: I am collecting additional evidence, and once I have it, I will present it here.

As for Animometer mentioned in this thread twice, I would really appreciate a doc to follow. I never heard of it, and quick search didn't return anything really good.

Kentaro Hara

unread,

Jul 12, 2016, 11:38:03 PM7/12/16

to Ivan Krasin, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Speedometer is already in telemetry.

Animometer is here: http://rawgit.com/WebKit/webkit/master/PerformanceTests/Animometer/developer.html

On Wed, Jul 13, 2016 at 12:18 PM, Ivan Krasin <kra...@google.com> wrote:

To make sure that everyone is on the same page: I am collecting additional evidence, and once I have it, I will present it here.

As for Animometer mentioned in this thread twice, I would really appreciate a doc to follow. I never heard of it, and quick search didn't return anything really good.

Ivan Krasin

unread,

Jul 12, 2016, 11:46:49 PM7/12/16

to Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Thank you, Kentaro!

Do I understand it correctly, that I should just run it manually by clicking on "Run benchmark" button? Like, once for the current Chrome, and once for the cfi-vcall Chrome? No additional Chromium-specific tricks?

Kentaro Hara

unread,

Jul 12, 2016, 11:49:04 PM7/12/16

to Ivan Krasin, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Right.

Ivan Krasin

unread,

Jul 12, 2016, 11:52:56 PM7/12/16

to Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Thx!

Elliott Sprehn

unread,

Jul 13, 2016, 12:02:54 AM7/13/16

to Ivan Krasin, Victor Miura, Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Loading developer.html will give you more control over the benchmark. There's a lot more benchmarks than just the default set that the "Run benchmark" contains, vmiura@ can help you run the rests of the benchmarks.

https://trac.webkit.org/export/HEAD/trunk/PerformanceTests/Animometer/developer.html

There's a lot of them not in the default set which stress other various parts of the engine.

On Tue, Jul 12, 2016 at 8:52 PM, 'Ivan Krasin' via Project TRIM <projec...@chromium.org> wrote:

Thx!

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5RfjXBJ_b5ScW-HnTtCVAb6hWKt5jB%3DDqdPOD-Ygjmx1qw%40mail.gmail.com.

Elliott Sprehn

unread,

Jul 13, 2016, 12:12:27 AM7/13/16

to Ivan Krasin, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

On Tue, Jul 12, 2016 at 6:26 PM, 'Ivan Krasin' via Project TRIM <projec...@chromium.org> wrote:

Hi Brett,

On Tue, Jul 12, 2016 at 6:00 PM, Brett Wilson <bre...@chromium.org> wrote:
Is there a doc describing this launch and the impacts (positive and negative)? I feel like this is the kind of thing that probably already has a design doc, and it should collect the perf stats of the thing we want to launch. This also seems like the kind of thing that should be posted using the new design doc posting process we sent out 2 weeks ago.
CFI is a long-running effort. We went through the old process ('intent to implement') back in October 2015 ([1])

That was in Chromium, and probably didn't get discussed by the web platform folks which this directly impacts the performance of. Given that the original intent also made incorrect claims of a 1% performance impact (which was way off the 20% real regression) I don't think we should consider that approval.

There's no explicit design doc on the Chromium side required, as in the end, it's just enabling a compiler flag and the LLVM side design doc is linked in the thread ([1]).

The only updates from that time is somewhat changed estimates for performance / size impact, which I have provided in the first message of the current thread.

It changed from 1% (incorrect estimate given in all the early commits, bugs and emails) to the actual 20%. I know you claim it's only 3%, but you also claimed it was only 1% last time. :P So I think we should probably turn it on and let the bots cycle and see what they say.

It seems like at least some performance-focused people are unhappy. Given that this trades off some amount of speed for some amount of security, I think our decision to tradeoff two of our top-line project goals should be very explicit. I'm guessing Launch Review will be a bad place for this since the performance and security subtleties here aren't digestable in 5 minutes in a big meeting. Maybe I'm wrong or there may be more background I'm not aware of.
Back to October 2015, the consensus was it's good to go. Based on that estimates, it was approved for a launch in December 2015, and only after finding micro-benchmarks which regressed by 20%, it was decided that the negative impact is too high. So we have reverted and made a lot of work to make Chrome having smaller dynamic number of virtual calls ([2], [3]), and this part is already launched.

So, I kind of assumed that we don't need yet another launch review. Nothing has changed in the stated goals for CFI.

What changed is that you claimed it was 1% and it never actually was. The LTO optimization that fixes the 20% regression also requires a very long compile and link step. I think both of those new piece of information warrant a reassessment of if we want to do this full scale.

Note that Safari does not turn on CFI, and Edge is often much slower than Chrome. If Safari turns on your clang virtual const propagation change they'll get 7% faster without ever taking the regression (they're also 20% faster in lots of stuff already, so this is gravy). So I'm not sure "Edge does it" means we should do it. I totally understand the security improvements it provides, and perhaps we *should* turn it on, but I don't think we should assume it's going to ship because of an approval back in October based on an incorrect assessment which also never got discussed with the web platform leads. :)

It may be worthwhile to set up a separate decision making meeting for this if there isn't obvious consensus on the mailing list about the stats. If there's a plan to get more stats from the wild, I would think it's appropriate to just launch to Linux dev channel to get real world stats, and then look at that to help make the final decision
.

I'm not arguing against this launch, I just want everybody to be happy that we explicitly decided to trade off X% of speed for Y% of security and that this is in line with our broader project goals.
The performance impact here is not a scalar, it's a vector. For instance, this launch is not going to slowdown Chrome by X%. For the most metrics there won't be any visible impact. Some of the metrics will degrade by ~3%, but we don't currently know how many of them, or if there're other metrics which will regress more than that (in which case we'll definitely revert). As for the security, it's not measured in percents at all. CFI closes one of the popular steps to create an exploit and it's widely discussed in the papers mentioned by Peter.

Indeed, we need to weigh a full codebase change like this against the wins and discuss it with the various leads.

My proposal: tomorrow, I will submit the launch CFI CL ([4]), we'll wait for the Perf dashboard to report regressions (we know there will be some, but the exact numbers are hard to guess; the best estimate is the attached doc in the first message of this thread), and then continue this discussion in a more constructive way, where we know what has become slower.

Does it sound reasonable to you, Brett?

I'm always fine landing crazy things on trunk to let them cycle through the bots. How much slower does this make the cycle time on the perf bots though? Last time I think the link took like 14 hours? :P

krasin

1. https://groups.google.com/a/chromium.org/d/msg/chromium-dev/pbJqt6ccMII/7iJC2oklCAAJ
2. https://crbug.com/580389
3. https://crbug.com/617283
4. https://codereview.chromium.org/2140373002/

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5Rfg9PNrqatk8Hv3-Mk%2BcmBNidqrsp9Yi%2BEUcmFLrgQG3A%40mail.gmail.com.

Ivan Krasin

unread,

Jul 13, 2016, 12:25:13 AM7/13/16

to Elliott Sprehn, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Hi Elliott,

I believe you have got a few facts wrong:

1. We still claim less than 1% of slow down on average. It's still the claim.

2. Devirtualization speedups required turning on LinkTimeOptimization, but so is CFI. No additional slowdown of link time for these speedups was done. The upcoming CFI launch will not affect link time at all.

3. Perf Linux builders have <40 mins for a cycle: https://build.chromium.org/p/chromium.perf/builders/Linux%20Builder/builds/49176

Please, provide the source for the 14 hours link. I am extremely curious to see it.

krasin

Elliott Sprehn

unread,

Jul 13, 2016, 12:47:05 AM7/13/16

to Ivan Krasin, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

On Tue, Jul 12, 2016 at 9:25 PM, Ivan Krasin <kra...@google.com> wrote:

Hi Elliott,

I believe you have got a few facts wrong:

1. We still claim less than 1% of slow down on average. It's still the claim.

From running what benchmark on what devices? Have you tested on a low end chromebook? Folks keep saying it's 1% without providing details for where those numbers came from and on what devices. Getting numbers out of a Google workstation is possible but also very hard since the machine is so crazy fast. For example I recently fixed a performance bug that was only 30-100ms on my MacPro, but was 300-800ms on my Reks Chromebook.

See:

https://sites.google.com/a/chromium.org/dev/developers/testing/control-flow-integrity

http://clang.llvm.org/docs/ControlFlowIntegrity.html

Both claim 1% without any details about where that number came from.

2. Devirtualization speedups required turning on LinkTimeOptimization, but so is CFI. No additional slowdown of link time for these speedups was done. The upcoming CFI launch will not affect link time at all.

You're saying turning on LTO didn't make the link slower? Also it apparently now takes 200GB of RAM?

https://bugs.chromium.org/p/chromium/issues/detail?id=598011#c15

How long does it take on to build with LTO or CFI on a local workstation to run the benchmarks myself? Where are the instructions to do it? :)

3. Perf Linux builders have <40 mins for a cycle: https://build.chromium.org/p/chromium.perf/builders/Linux%20Builder/builds/49176
Please, provide the source for the 14 hours link. I am extremely curious to see it.

That came from someone working on turning on LTO on the bots, I can't find the source now. Perhaps it was because the link requires 200GB of RAM and the slaves were too small and were locking up though? It's good news it's only 40 min, what was it before we turned on LTO?

- E

Brett Wilson

unread,

Jul 13, 2016, 12:55:31 AM7/13/16

to Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

On Tue, Jul 12, 2016 at 6:26 PM, Ivan Krasin <kra...@google.com> wrote:

Hi Brett,

On Tue, Jul 12, 2016 at 6:00 PM, Brett Wilson <bre...@chromium.org> wrote:
Is there a doc describing this launch and the impacts (positive and negative)? I feel like this is the kind of thing that probably already has a design doc, and it should collect the perf stats of the thing we want to launch. This also seems like the kind of thing that should be posted using the new design doc posting process we sent out 2 weeks ago.
CFI is a long-running effort. We went through the old process ('intent to implement') back in October 2015 ([1])
There's no explicit design doc on the Chromium side required, as in the end, it's just enabling a compiler flag and the LLVM side design doc is linked in the thread ([1]).

The only updates from that time is somewhat changed estimates for performance / size impact, which I have provided in the first message of the current thread.

I think we need to collect all of the numbers we have and make a conscious decision about launching. If everybody is generally happy with the numbers, launch review is the right place to make that decision. If there is disagreement about benchmarks and such, we may want to take a deeper dime.

A launch review approval wouldn't ever carry over. Launch review is how people know about and approve of large changes in the product. And if something significant is changing in 53 (or whatever) people need to know that it's changing regardless of what happened in 47 (or whatever).

Brett

Brett Wilson

unread,

Jul 13, 2016, 12:56:56 AM7/13/16

to Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

On Tue, Jul 12, 2016 at 9:55 PM, Brett Wilson <bre...@chromium.org> wrote:

On Tue, Jul 12, 2016 at 6:26 PM, Ivan Krasin <kra...@google.com> wrote:
Hi Brett,

On Tue, Jul 12, 2016 at 6:00 PM, Brett Wilson <bre...@chromium.org> wrote:
Is there a doc describing this launch and the impacts (positive and negative)? I feel like this is the kind of thing that probably already has a design doc, and it should collect the perf stats of the thing we want to launch. This also seems like the kind of thing that should be posted using the new design doc posting process we sent out 2 weeks ago.
CFI is a long-running effort. We went through the old process ('intent to implement') back in October 2015 ([1])
There's no explicit design doc on the Chromium side required, as in the end, it's just enabling a compiler flag and the LLVM side design doc is linked in the thread ([1]).

The only updates from that time is somewhat changed estimates for performance / size impact, which I have provided in the first message of the current thread.

I think we need to collect all of the numbers we have and make a conscious decision about launching. If everybody is generally happy with the numbers, launch review is the right place to make that decision. If there is disagreement about benchmarks and such, we may want to take a deeper dime.

deeper dime -> deeper dive (as in a more dedicated meeting).

See go/newChromeFeature

You can ship to dev channel without launch review approval, especial if you're gathering metrics for a final decision. But there should be approval before the code goes out to beta.

Brett

Kentaro Hara

unread,

Jul 13, 2016, 12:59:42 AM7/13/16

to Brett Wilson, Ivan Krasin, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

Agreed that the first step would be to collect more numbers (with the benchmarks I mentioned above) and get a more comprehensive understanding on the performance impact.

--

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CABiGVV-wKD0bEm9fERYsXAK1qpD45OfVPWDAD7xdzrXUNM6Vqw%40mail.gmail.com.

Ivan Krasin

unread,

Jul 13, 2016, 1:05:25 AM7/13/16

to Kentaro Hara, Brett Wilson, Project TRIM, Annie Sullivan, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman

I agree on that too (collecting more data). Thank you for all your suggestions about where to look, especially, Animometer which is a genuinely new thing to me.

Daniel Bratell

unread,

Jul 13, 2016, 11:37:17 AM7/13/16

to projec...@chromium.org

On Wed, 13 Jul 2016 06:02:12 +0200, Elliott Sprehn <esp...@chromium.org> wrote:

Loading developer.html will give you more control over the benchmark. There's a lot more benchmarks than just the default set that the "Run benchmark" contains, vmiura@ can help you run the rests of the benchmarks.

https://trac.webkit.org/export/HEAD/trunk/PerformanceTests/Animometer/developer.html

There's a lot of them not in the default set which stress other various parts of the engine.

A bit off topic but are there any requirements to run those tests? I tried just out of curiosity and I mostly just got a blank screen and sandbox errors in the console. MSIE had some output but then nothing. Firefox and Chromium (including Opera) had sandbox errors.

/Daniel

--

/* Opera Software, Linköping, Sweden: CEST (UTC+2) */

Annie Sullivan

unread,

Jul 13, 2016, 1:07:03 PM7/13/16

to Ivan Krasin, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

On Tue, Jul 12, 2016 at 6:49 PM, 'Ivan Krasin' via Project TRIM <projec...@chromium.org> wrote:

Hi Elliott,

On Tue, Jul 12, 2016 at 3:34 PM, Elliott Sprehn <esp...@chromium.org> wrote:
We don't have very many Linux users, and since this isn't on Chrome OS or Android the security benefit to the user population is pretty small right?
Yes, this is why we use Linux x86-64 as our test bed. While the immediate impact will be just this small (still, millions) Linux population, it's the only way to expand to Android, Chrome OS, Mac OS and Windows.

What's the performance impact on html-full-render on a low end Chromebook?
How do we measure that? Does Perf team has any trybots for that?

ChromeOS has a separate performance lab, and I don't think they have trybots but +msheets could probably help you figure out how to measure.

Putting a branch at every virtual call still seems like needless perf overhead, especially compared to our competitors running on lower end machines. It also severely hurts our ability to refactor the code by adding virtual calls since now they're more expensive, but not locally.
It's possible build build CFI-profiled build locally. It's just slower to link. Eventually, ThinLTO should address this.

Why do you think 3% is an acceptable regression in the layout benchmarks? :)
It's not all benchmarks, and we knew that there will be some impact on the performance when we started to discuss implementing CFI for Chrome with the team. It was blessed in general terms. As for the specific thresholds: last time we regressed by 20% on some benchmarks, and Peter has improved things considerably, including speeding up some of the layout benchmarks by up to 7% with aggressive automatic devirtualization on the compiler level: https://crbug.com/580389 and https://crbug.com/617283

So, we're making it somewhat slower, but only after made it faster. Looks like the right sequence of the events.

Can you also run the Animometer benchmarks? We're already 20% slower than Safari at a bunch of things, I'm not very comfortable becoming even slower.
Yes, we can. How do we do that? Any specific doc pointers?

--

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5RfkrHKVHASbAUGYvoXmn%2BJbiEaG3CnxYVuaHsZrXFPe5A%40mail.gmail.com.

Ivan Krasin

unread,

Jul 13, 2016, 7:39:18 PM7/13/16

to Annie Sullivan, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

Hi everyone,

we have collected additional data, but need your help to interpret the results.

Let's start with the Animometer results: https://docs.google.com/spreadsheets/d/1GPAGJqxPFbW1Cf7uQA2_aGqBDUJ10F6669s4UElSU_E/edit?usp=sharing

I made 3 runs for LTO (the current official Chrome built from master) and CFI-vcall (the same + is_cfi=true in GN args).

In each series of the runs, there's one outlier (LTO, run 3 and CFI-vcall, run 2). If we ignore them, the total score LTO vs CFI-vcall is in the same ballpark (263.87 - 264.68), which is within 0.3%.

Please, take your educated look and tell us, if you see any obvious regressions. Also, I am ready to re-run evaluations if the collected data is missing anything.

krasin

Ivan Krasin

unread,

Jul 13, 2016, 11:16:26 PM7/13/16

to Annie Sullivan, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

Here are the results for speedometer and dromaeu with 50 runs per line. The current official Chrome is marked as lto, and the proposed for the launch cfi-vcall is marked as ltocficall. The third column is from an ongoing experiment to split vtables, when needed, and should be ignored.

In short:

speedometer shows Equal for all but two micro-benchmarks, where it has 1.07% and 2.52% regressions, which might (or might not be noise; need to verify on the real perf bots)

dromaeo results a very flaky; in some cases it reports up to 7% improvements, which is almost certainly not true, as the CL is expected a non-positive performance impact. There're similarly size regressions, which might be noise. Again, need to test that on the real bots to be certain.

bm-devirt8.zip

Kentaro Hara

unread,

Jul 13, 2016, 11:50:31 PM7/13/16

to Ivan Krasin, Annie Sullivan, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

Thanks for the results, Ivan!

The result is noisy, so we should not make a judgement just by looking at the average numbers reported by telemetry. Alternately, we should open the result of each benchmark one by one and see how all the 50 points in each benchmark are distributed.

I looked at all the results of Speedometer and Dromaeo, but I don't see any substantial regression/improvement there. (Great news :-)

--
You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5Rdaq_CTSVjFXBpbnap92V2AMGLy0BLFb%2BMmU1rEiY-QrQ%40mail.gmail.com.

Ivan Krasin

unread,

Jul 13, 2016, 11:51:43 PM7/13/16

to Kentaro Hara, Annie Sullivan, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

Thank you for looking into this, Kentaro. Can you please also look up the results from Animometer? (see a couple of messages above, there's a link for a spreadsheet)

Kentaro Hara

unread,

Jul 13, 2016, 11:55:17 PM7/13/16

to Ivan Krasin, Annie Sullivan, Matt Sheets, Elliott Sprehn, Project TRIM, Kostya Serebryany, Peter Collingbourne, Nico Weber, Elliott Friedman, Nat Duca

I'm not that familiar with Animometer, but as far as I look at the numbers, I don't see any regression.

If you don't observe any regression in blink_perf.*, I'll be convinced :)

On Thu, Jul 14, 2016 at 12:51 PM, Ivan Krasin <kra...@google.com> wrote:

Thank you for looking into this, Kentaro. Can you please also look up the results from Animometer? (see a couple of messages above, there's a link for a spreadsheet)

Elliott Sprehn

unread,

Jul 13, 2016, 11:58:18 PM7/13/16

to Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Ivan Krasin, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

What machine are you both testing on? Your Z620 is crazy fast and very noisy.

Ivan Krasin

unread,

Jul 14, 2016, 12:02:11 AM7/14/16

to Elliott Sprehn, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Yes, it is Z620, and yes, it's super noisy. :(

I can also try to run these benchmarks on my laptop, if you think it will be smoother. Alternative proposals are welcome!

Kentaro Hara

unread,

Jul 14, 2016, 12:08:40 AM7/14/16

to Ivan Krasin, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

I guess Z620 is not that bad. Your result doesn't look as noisy as I normally see these days.

If you're using Linux, it would be helpful to use taskset to stabilize your result.

$ taskset -c 12 ./your_command

On Thu, Jul 14, 2016 at 1:02 PM, Ivan Krasin <kra...@google.com> wrote:

Yes, it is Z620, and yes, it's super noisy. :(

I can also try to run these benchmarks on my laptop, if you think it will be smoother. Alternative proposals are welcome!

Ivan Krasin

unread,

Jul 14, 2016, 12:13:32 AM7/14/16

to Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Thank you for the suggestion. Affinity is one source of noise, but there're more tricks I know, but failed to automate: https://bugs.chromium.org/p/chromium/issues/detail?id=570904#c18

in short:

- setting affinity
- turning on Google-specific daemons
- setting CPU governor to performance
- reducing the max cpu frequency to avoid internal throttling

it worked for me in january and it reduced noise, but it was still higher than 2% even after 1000 runs, so not really practical.

Primiano Tucci

unread,

Jul 14, 2016, 6:45:42 AM7/14/16

to Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

> - setting affinity

also running the process in with SCHED_FIFO realtime priority (sudo schedtool -F -p 50 -e command) will prevent any pre-emption until the process voluntarily yields (essentially will take it out from CFS)

>- setting CPU governor to performance

I think you might have better odds with powersave, as it reduces the chance that you hit cpu throttling. These days DVFS are pretty aggressive and the powersave governor doesn't really disable them, conversely to what most of the various guides out there suggest.

There was a very good writeup circulating recently about somebody doing experiments on this topic and showing the various stddev achieved by tuning each variable. I tried very hard to search it in my inbox for 30 mins but failed. Maybe somebody with a better memory can dig it up. I think was somebody from the v8 team and I remember it was mentioning powersave, affinity & co.

Also you want

sudo sh -c "echo 1 /sys/devices/system/cpu/intel_pstate/no_turbo"

--

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5Rc_WTkYNL0rpe996wTk%2B9%3DDOYQfk9O0H_r8UQWutKMj5g%40mail.gmail.com.

Bruce Dawson

unread,

Jul 14, 2016, 1:15:59 PM7/14/16

to Primiano Tucci, Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

> I think you might have better odds with powersave

I saw the opposite where (on Windows and Linux) setting the governor to performance gave far more predictable results than the default, although I'll admit I didn't test powersave, just default versus performance.

We should never be hitting CPU throttling unless our cooling systems are broken. We should be able to run all cores at full speed continuously. Turboboost will come and go (unless disabled in the BIOS) but the rated clock speed should always be maintained. And fluctuations due to Turboboost should be fairly modest.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CA%2ByH71eg8Vr4jPdLSPuFcR___-JyN%2BH%2BSRXze-V3_cHNARqvgw%40mail.gmail.com.

Primiano Tucci

unread,

Jul 14, 2016, 4:03:53 PM7/14/16

to Bruce Dawson, Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Ok you officially opened the nerd sniping season. I still cannot find that doc so I repeated the experiment myself. :P

TL;DR

a combination of taskset (lock the affinity), powersave governor, lock scaling_min_freq to the minimum frequency, disable turbo boost, disable all P-state (% P0) can reduce the stddev by 2 orders of magnitude. At least on my Z620 w/ a Xeon E5-2680 and cc_perftests.

>We should never be hitting CPU throttling unless our cooling systems are broken.

Right, thinking twice I think that thermal throttling was a silly argument. I suspect what's really happening is that "performance" allows the cpu swing between a wider number of performance states, which increases variability.

On top of that, it seems that the max_freq is not really obeyed as one would expect under the "performance" governor (at least with my CPU model). see below.

The other interesting thing is that that P-states (idle states) seem to have the biggest influence, at least in the benchmark below. I suppose what's happening is that benchmarks (At lest this one) never fully use the cpu and necessarily go to idle from time to time (even just for hitting a lock in a syscall). I suppose depending on which P-state you fall into, the time it takes to go back to P0 has the biggest impact on the time variance.

I took cc_perftests --gtest_filter=TileManagerPerfTest.EvictionTileQueueConstructAndIterate as a testbed (no particular reason, I just happened to use that in the past and I knew it was quite noisy).

My findings on a Z620 running a 40-way Xeon E5-2680 on our Ubuntu Trusty are:

affinity makes a big difference
SCHED_FIFO a bit, but not that much
the powersave governor seems to make things actually worse
when you use powersave governor the scaling_max_freq seems to be ignored. Proof:
$ for cpu in /sys/devices/system/cpu/cpu*; do sudo sh -c "echo performance > $cpu/cpufreq/scaling_governor";
$ for cpu in /sys/devices/system/cpu/cpu*; do sudo sh -c "echo 1200000 > $cpu/cpufreq/scaling_min_freq"; done
$ for cpu in /sys/devices/system/cpu/cpu*; do echo -en "$cpu\t"; sudo cat $cpu/cpufreq/scaling_cur_freq;
/sys/devices/system/cpu/cpu0 3100015
/sys/devices/system/cpu/cpu1 3100015
/sys/devices/system/cpu/cpu10 3099906
/sys/devices/system/cpu/cpu11 3100015
...
The max_freq seems to be respected instead when using "powersave". Also powersave seems to respect the scaling_max_freq. Proof:
for cpu in /sys/devices/system/cpu/cpu*; do echo -en "$cpu\t"; sudo cat $cpu/cpufreq/scaling_cur_freq; done
/sys/devices/system/cpu/cpu0 1199953
/sys/devices/system/cpu/cpu1 1199953
/sys/devices/system/cpu/cpu10 1199953
/sys/devices/system/cpu/cpu11 1199953
/sys/devices/system/cpu/cpu12 1199953
Disabling all the P-state (but P0) makes the biggest difference (1 order of magnitude in stddev)

sudo sh -c "echo 1 > /sys/devices/system/cpu/intel_pstate/no_turbo

for cpu in /sys/devices/system/cpu/cpu*; do for p in $(seq 4); do sudo sh -c "echo 1 > $cpu/cpuidle/state$p/disable"; done; do

See data in https://docs.google.com/spreadsheets/d/1xAxScjSHWht-ftiag1ppiAedM2aLON_BJcafWEC9Xck/edit?usp=sharing

	Sample 1	Sample 2	Sample 3	Sample 4	Sample 5	Sample 6	StdDev
after reboot	160889.5938	153476.0156	160128.875	161425.7188	155683.75	162588.7031	3608.699385
affinity	161494.3594	161614.5938	159176.9688	161284.2813	160693.875	161932.25	998.2738996
affinity+sched	160497.2656	161454.8438	160677.8281	161023.875	160526.7031	160065.2813	479.3385152
performance + maxfreq	196765.9688	200409.4063	198774.2969	200694.875	202048.3906	201817.875	2003.439864
performance + min	201370.4688	199420	200186.8906	199033.4063	202179.1875	196381.4688	2034.025737
powersave + min + noturbo	70424.1875	70724.96094	70711.00781	70516.125	70590.65625	70002.27344	267.1821713
powersave + min + noturbo + nopstate	38898.91016	38838.93359	38855.80469	38914.06641	38856.36719	38868.25	28.66301926

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAE5mQiPHao1RQiFceTZt4draqdYK_vNZbBOJSdRawXpvGDJnVw%40mail.gmail.com.

Ivan Krasin

unread,

Jul 14, 2016, 4:10:53 PM7/14/16

to Primiano Tucci, Bruce Dawson, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Primiano,

this is an awesome research. Does it make sense to put these tricks into a script and (probably) include it into catapult / telemetry [1] so that all the people which run perf behcnmarks would get these improvements?

I am actually ready to do it myself (as well as to test on Z840, which may have somewhat different mileage), I rather need a general approval from the perf people, that is the right thing to do.

1. https://github.com/catapult-project/catapult

Bruce Dawson

unread,

Jul 14, 2016, 4:12:51 PM7/14/16

to Primiano Tucci, Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Hey, I tricked you into running some tests for me. Cool! It's reverse nerd sniping!

Did you try performance + noturbo? Is that even an option?

Some other possibly relevant thoughts...

I investigated aspects of this last year when looking at IPC benchmarks. These benchmarks just ping-pong between two processes. Both Windows and Linux like to schedule these processes on separate cores (makes sense) and each process spends 40-50% of its time waiting on the other process, so both cores are 40-50% idle, so both operating systems have great difficulty deciding what frequency to run the CPUs at. They bounce between ~2 GHz and ~800 MHz, making for huge variation at the ms level (I have cool ETW graphs showing this).

That is, it's a completely CPU bound benchmark but its spread out over two cores so that neither core is fully utilized, and power management algorithms lose their mind.

I found that using the performance governor made the results far more consistent in local tests, but didn't want to push that to the perf-bots. So I used affinity to put both processes on the same core. This made the tests run slightly slower, but extremely predictably, since now we have a single CPU core that is 100% utilized.

In the context of benchmarks this is a fascinating technical challenge to try to get the most consistent results. However these issues also occur to some extent in the real world and presumably have some ill-defined affect on performance.

TL;DR IPC confuses power management algorithms.

Primiano Tucci

unread,

Jul 15, 2016, 5:54:45 AM7/15/16

to Bruce Dawson, Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

> Did you try performance + noturbo? Is that even an option?

So I tried affinity + performance + maxfreq + noturbo + nopstate. It makes things noticeably better (~1 order of magnitude w.r.t just affinity + performance) but I still see that the core frequencies fluctuate. They still go over the nominal range (2.8 GHz in my case) and move into what it's supposed to be the turbo-boost range (2.8 to 3.6 GHz in my case).

powersave seems the only governor that is actually able to lock the core freq. Also noturbo + nopstate makes a huge difference.

In general the lower I go with the frequency the less stddev I get. I suppose this might be partly because a lower frequency tends to hide and smooth out cache misses that have to go through memory (which freq should not be affected by the governor and cpu core freq). So when you set a core freq of 1.2 GHz a L3 miss doesn't add that much variance because @ 1.2 GHz the speed of L3 cache and the speed of DDR3 (1.8 Ghz for Z620) become comparable.

If this is the case it makes me wonder if setting minfreq is the right thing to measure. Numbers are stable, but maybe at the code of ignoring cache misses.

I should do a bit more experiments on this (this theory can be proved by benchmarking a loop with no memory access vs a memory-spraying loop), sounds like I start having some material for a new blog post :)

The other thing that I lost track of is: what is linked to the core freq and what not? Are the ALUs always linked to the core freq? Are all levels of caches linked to that? I think these days there dozen of different clock domains.

>. Does it make sense to put these tricks into a script and (probably) include it into catapult / telemetry [1] so that all the people which run perf behcnmarks would get these improvements?

Yes I was thinking the same. I will start a thread on the telemetry group and CC you. It's going to be a tricky decision. I'd say that we should definitely use these tweaks for all microbenchmarks. Not sure what is the best thing to do for more end-to-end benchmarks. Let's keep a separate discussion.

Updated spreadsheet

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

StdDev

after reboot	160889.59	153476.02	160128.88	161425.72	155683.75	162588.70	3608.70
affinity	161494.36	161614.59	159176.97	161284.28	160693.88	161932.25	998.27
affinity+sched	160497.27	161454.84	160677.83	161023.88	160526.70	160065.28	479.34
affinity + performance + maxfreq	196765.97	200409.41	198774.30	200694.88	202048.39	201817.88	2003.44
affinity + performance + minfreq	201370.47	199420.00	200186.89	199033.41	202179.19	196381.47	2034.03
affinity + powersave + minfreq + noturbo	70424.19	70724.96	70711.01	70516.13	70590.66	70002.27	267.18
affinity + powersave + minfreq + noturbo + nopstate	38898.91	38838.93	38855.80	38914.07	38856.37	38868.25	28.66
affinity + performance + maxfreq + noturbo + nopstate	100784.30	100884.14	100752.73	100252.04	101029.09	100820.11	264.53
affinity + powersave + maxfreq + noturbo + nopstate	87690.92	87808.91	87686.32	87291.07	87444.13	87710.79	195.55
affinity + powersave + 2GHz + noturbo + nopstate	61692.01	61722.90	61525.05	61750.31	61651.24	61458.22	116.50

Bruce Dawson

unread,

Jul 15, 2016, 1:53:31 PM7/15/16

to Primiano Tucci, Ivan Krasin, Kentaro Hara, Elliott Sprehn, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Good analysis, especially on the risk that using powersave is effectively reducing the cost (in core cycles) of cache misses.

You could also try using perf to monitor cache misses to see if the stddev of the tests is correlated with higher numbers of cache misses, in order to see if that is a driver of the stddev.

It's too bad that + noturbo doesn't seem to work with the performance governor. As long as the frequency is varying it is inevitable that the stddev will be increased, so maybe cache misses aren't even needed as an explanation.

The ALUs would definitely be driven by the core frequency, and L1 cache also, because those are all tightly coupled. Below that I'm not sure - last-level-cache could easily be in a separate clock domain that is not tied to core frequency.

Elliott Sprehn

unread,

Jul 15, 2016, 5:12:07 PM7/15/16

to Bruce Dawson, Primiano Tucci, Ivan Krasin, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Back to the primary topic here (not that benchmark system settings aren't fun), I had a meeting with krasin@ to discuss the strategy for CFI:

- They're going to move slow, turning it only on for Linux and observing the performance. We can just flip it off it's it's bad.

- They're working on more optimization to devirtualize more often.

- They're going to add a mode to clang that reports what call sites were devirtualized so developers know what's going on.

https://reviews.llvm.org/rL275145

- They're going to add a regression test/warning system so that if someone makes a change which causes the amount of devirtualization to change dramatically we'll get notified. For example we're depending on virtual const propagation now to offset the costs of CFI. If someone defeated the optimization by mistake we'd take a perf hit. This warning system should tell us, and the clang reporting would allow a developer to see where the optimization applies.

- They're working on ThinLTO and other strategies to make the LTO step faster. Building takes about 45m on a workstation today.

http://blog.llvm.org/2016/06/thinlto-scalable-and-incremental-lto.html

- They think it should be a 5% binary size regression, if it's much more (ex. 7%+) we should revert to understand what's up.

- There's a blacklist to tell CFI not to mess with certain classes/directories/functions. We can use this if needed, for example to avoid the perf hit on some code that's very performance sensitive, for example if we find out CFI is a big regression in style/layout/paint we can black list the code as needed, and file bugs to clang to see if they can compensate with new LTO optimizations if possible.

https://cs.chromium.org/chromium/src/tools/cfi/blacklist.txt?q=cfi.blacklist&sq=package:chromium&dr

- Our current set of benchmarks don't seem to regress noticeably on Z620s, some Blink micro benchmarks do, but we also sped them up first. If benchmarks regress much more than the expected 3.5% we should revert.

Given that this launch is for Linux only, we have an escape hatch in the form of the blacklist, how easy it is to toggle the flag off if we see a bad regression, and the plan to get reporting and tooling for the LTO optimizations before expanding widely to other platforms, it seems like we should be good to turn this on for Dev channel and collect more data.

The patch that will turn it on is here:

https://codereview.chromium.org/2140373002/

- E

Brett Wilson

unread,

Jul 15, 2016, 5:22:19 PM7/15/16

to Elliott Sprehn, Bruce Dawson, Primiano Tucci, Ivan Krasin, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

This sounds great, thanks for the update.

Please also go through launch-review for this since I think this change warrants the visibility. Since it seems like there is general agreement about the path forward, this should not be a big deal.

Brett

--

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim...@chromium.org.
To post to this group, send email to projec...@chromium.org.

To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAO9Q3iJ8d_xW1LXTSNhBE-as-LrW7OdUOUGRNvj%3D9hBiuQKjbw%40mail.gmail.com.

Ivan Krasin

unread,

Jul 15, 2016, 5:29:23 PM7/15/16

to Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Elliott: thank you for diving into this and looking into the details.

Brett: sure. What should I do for starting a launch-review? I have a very limited experience with Chrome processes, and while I want to follow them, I definitely need some guidance.

Also: is it okay to submit https://codereview.chromium.org/2140373002/? It will allow us to understand if Perf dashboard sees any regressions and that will serve as additional input for the launch-review. Rolling back that CL is trivial. It's a 2-liner that just turns on a compiler flag.

Kostya Serebryany

unread,

Jul 15, 2016, 5:31:50 PM7/15/16

to Ivan Krasin, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

On Fri, Jul 15, 2016 at 2:29 PM, Ivan Krasin <kra...@google.com> wrote:

Elliott: thank you for diving into this and looking into the details.
Brett: sure. What should I do for starting a launch-review? I have a very limited experience with Chrome processes, and while I want to follow them, I definitely need some guidance.

Also: is it okay to submit https://codereview.chromium.org/2140373002/? It will allow us to understand if Perf dashboard sees any regressions and that will serve as additional input for the launch-review. Rolling back that CL is trivial. It's a 2-liner that just turns on a compiler flag.

On Fri, Jul 15, 2016 at 2:22 PM, Brett Wilson <bre...@chromium.org> wrote:
This sounds great, thanks for the update.

Please also go through launch-review

Hasn't this been done already?

Brett Wilson

unread,

Jul 15, 2016, 5:40:01 PM7/15/16

to Kostya Serebryany, Ivan Krasin, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

On Fri, Jul 15, 2016 at 2:31 PM, Kostya Serebryany <k...@google.com> wrote:

On Fri, Jul 15, 2016 at 2:29 PM, Ivan Krasin <kra...@google.com> wrote:
Elliott: thank you for diving into this and looking into the details.
Brett: sure. What should I do for starting a launch-review? I have a very limited experience with Chrome processes, and while I want to follow them, I definitely need some guidance.

Also: is it okay to submit https://codereview.chromium.org/2140373002/? It will allow us to understand if Perf dashboard sees any regressions and that will serve as additional input for the launch-review. Rolling back that CL is trivial. It's a 2-liner that just turns on a compiler flag.

On Fri, Jul 15, 2016 at 2:22 PM, Brett Wilson <bre...@chromium.org> wrote:
This sounds great, thanks for the update.

Please also go through launch-review

Hasn't this been done already?

As I explained above, launch review is to make sure the relevant people know (and potentially have input on) important things changing in a given release. One from 9 months ago doesn't help this.

Brett

Brett Wilson

unread,

Jul 15, 2016, 5:46:40 PM7/15/16

to Ivan Krasin, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

On Fri, Jul 15, 2016 at 2:29 PM, Ivan Krasin <kra...@google.com> wrote:

Elliott: thank you for diving into this and looking into the details.
Brett: sure. What should I do for starting a launch-review? I have a very limited experience with Chrome processes, and while I want to follow them, I definitely need some guidance.

I will follow-up with you off list.

Brett

Ivan Krasin

unread,

Jul 15, 2016, 5:47:30 PM7/15/16

to Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Thanks!

Ivan Krasin

unread,

Jul 15, 2016, 8:24:53 PM7/15/16

to Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

To update the status: the CL is landed to collect more data during the weekend.

At this time we know that the estimate for the Chrome binary size regression was very accurate. Predicted: 5%, reality: 4.8% as reported by sizes step:

Before:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.perf%2FLinux_Builder%2F50101%2F%2B%2Frecipes%2Fsteps%2Fsizes%2F0%2Flogs%2Fchrome-stripped-summary.dat%2F0

After:

https://luci-logdog.appspot.com/v/?s=chromium%2Fbb%2Fchromium.perf%2FLinux_Builder%2F50102%2F%2B%2Frecipes%2Fsteps%2Fsizes%2F0%2Flogs%2Fchrome-stripped-summary.dat%2F0

Waiting for the Perf dashboard to report regressions (nothing there at the time of writing):

https://chromeperf.appspot.com/group_report?rev=405893

Ivan Krasin

unread,

Jul 15, 2016, 8:29:05 PM7/15/16

to Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Kentaro Hara, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Sorry, wrong url (off-by-one error). The correct one: https://chromeperf.appspot.com/group_report?rev=405894

The Chrome binary size regression is already reported there, 4.8%.

Kentaro Hara

unread,

Jul 15, 2016, 9:50:08 PM7/15/16

to Ivan Krasin, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Thanks Elliott for the great summary! The plan sounds reasonable to me.

Krasin: Did you get results for blink_perf.*?

On Sat, Jul 16, 2016 at 9:29 AM, Ivan Krasin <kra...@google.com> wrote:

Sorry, wrong url (off-by-one error). The correct one: https://chromeperf.appspot.com/group_report?rev=405894
The Chrome binary size regression is already reported there, 4.8%.

Ivan Krasin

unread,

Jul 15, 2016, 10:02:27 PM7/15/16

to Kentaro Hara, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Hi Kentaro,

sorry, I thought I sent that reply, but apparently I didn't. The blink_perf.* results are in the zip archive attached to the very first message in this thread. They are noisy, as we didn't do all the tricks mentioned in the thread, but there're some numbers there.

And we'll hopefully have better data from bots very soon. If something is bad, we'll see it, and I will revert the CL. In fact, being reverted is the default fate for the CL. It could only escape it, if explicitly approved by the launch review (this is under works).

krasin

Ivan Krasin

unread,

Jul 16, 2016, 1:42:22 PM7/16/16

to Kentaro Hara, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

The current list of detected perf regressions is short: https://chromeperf.appspot.com/group_report?rev=405894

rasterize_and_record_micro.top_25_smooth: 2.3%

There could be also some regressions which are not shown in the dashboard. One candidate is

ChromiumPerf/linux-release/blink_perf.paint / large-table-background-change-with-invisible-collapsed-borders:

https://chromeperf.appspot.com/report?sid=73b9e039ce5a60d0e090d9a8cfc4967a65a4bedae49d737a93109a0de533b89a&start_rev=404312&

but it's hard to tell without starting a bisect (and I don't have the rights to do so).

Ivan Krasin

unread,

Jul 16, 2016, 1:55:19 PM7/16/16

to Kentaro Hara, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

Actually, after manually looking at blink_perf.layout benchmarks, many of them regressed by ~3.5%:

https://chromeperf.appspot.com/report?sid=ebf0165d8c96c7a70c790d179a9bdc1f9e58e616182522fd961d17ad648fc28f&start_rev=404312&end_rev=405943

The leader (and outlier) is ChromiumPerf/linux-release/blink_perf.layout / large-table-with-collapsed-borders-and-colspans-wider-than-table which has a regression ~6%.

The experiment is successful: we have found the performance impact of cfi-vcall. It's within the predictions, but I don't like that too many micro-benchmarks regressed. We will need to make a closer look before going forward.

I will rollback the CL shortly.

Ivan Krasin

unread,

Jul 16, 2016, 2:02:13 PM7/16/16

to Kentaro Hara, Brett Wilson, Elliott Sprehn, Bruce Dawson, Primiano Tucci, Peter Collingbourne, Kostya Serebryany, Matt Sheets, Annie Sullivan, Nico Weber, Elliott Friedman, Nat Duca, Project TRIM

The CL is reverted: https://codereview.chromium.org/2154993002/

Thank you everyone for the input and help. I hope to come back to this thread, if we find a way to speed this up.

Elliott Sprehn

unread,

Jul 16, 2016, 6:32:24 PM7/16/16

to Ivan Krasin, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca, Kentaro Hara

Thanks so much for staying on top of this! Let me know if you need help.

- E

Ivan Krasin

unread,

Jul 16, 2016, 6:35:29 PM7/16/16

to Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca, Kentaro Hara

Sure!

And thank you for your rigor review from the perf side: I was impressed seeing that you went through our open bugs to see what's there.

krasin

Ivan Krasin

unread,

Aug 17, 2016, 7:52:51 PM8/17/16

to Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca, Kentaro Hara

Hello everyone,

we're back. :)

In the past month three things happened bringing us closer to no-perf-regressions state:

1. I have landed a few changes to LLVM that allow us to see which virtual call sites / methods have been devirtualized and why. Please, find this wonderful list of 51491 devirtualized site calls pointing to 23149 virtual methods:

https://storage.googleapis.com/cfi-stats/2016-08-15/devirt-methods.html

Many of the call sites have been devirtualized due to a single implementation of a virtual method. Usually, such methods are virtual, only because there's a real and a test implementation, where Chrome binary only has the real one, and LinkTimeOptimization can see that removing the virtual call. Others have been eliminated by virtual const propagation. All in all, the list looks good, and we know how to devirtualize even more methods, which is a matter of an additional engineering effort.

2. After profiling the observed slowdown, I have identified ~130 methods which are either too hot, or have too many CFI checks, so they affect the overall performance, see https://crbug.com/634139

I plan to add an attribute on top of each such method to disable CFI on them, and that bring us to an overhead that I could not measure. See https://storage.googleapis.com/cfi-stats/2016-08-15/results.html

While 130 seems a lot, that's only ~0.1% of all the methods in Chrome => we'll have it enabled on 99.9%, which is pretty good already.

For the methods which have too many CFI checks, there're ideas how to significantly reduce their number, see, for example: https://crbug.com/638056. That optimization alone should cover 6 out of 8 methods in V8 which we'll need to disable CFI on. That will take time to implement, though, and does not seem like a launch blocker.

3. I have created a launch bug as suggested by Brett: https://crbug.com/638779

krasin

Kentaro Hara

unread,

Aug 17, 2016, 10:36:07 PM8/17/16

to Ivan Krasin, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

+platform-architecture-dev

(Context: Ivan is planning to launch CFI for virtual function calls. See here for the full context.)

Is there any update about the performance regression you'd been observing in blink_perf.layout?

From the Blink perspective, I support this change assuming that the following benchmarks don't regress:

- blink_perf.*

- dromaeo.dom*

- speedometer

- animometer

Ivan Krasin

unread,

Aug 18, 2016, 2:30:34 AM8/18/16

to Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

Hi Kentaro,

On Wed, Aug 17, 2016 at 7:35 PM, Kentaro Hara <har...@chromium.org> wrote:

+platform-architecture-dev

(Context: Ivan is planning to launch CFI for virtual function calls. See here for the full context.)

Is there any update about the performance regression you'd been observing in blink_perf.layout?

This is literally the point #2 in my update (see above). Sorry if that was not clear. :)

From the Blink perspective, I support this change assuming that the following benchmarks don't regress:

- blink_perf.*

https://storage.googleapis.com/cfi-stats/2016-08-15/results.html

- dromaeo.dom*

Will collect stats for that tomorrow.

- speedometer
- animometer

For these two benchmarks (speedometer and animometer) there was no slowdown even in our previous try.

Kentaro Hara

unread,

Aug 18, 2016, 4:30:10 AM8/18/16

to Ivan Krasin, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

On Thu, Aug 18, 2016 at 3:30 PM, Ivan Krasin <kra...@google.com> wrote:

Hi Kentaro,

On Wed, Aug 17, 2016 at 7:35 PM, Kentaro Hara <har...@chromium.org> wrote:
+platform-architecture-dev

(Context: Ivan is planning to launch CFI for virtual function calls. See here for the full context.)

Is there any update about the performance regression you'd been observing in blink_perf.layout?
This is literally the point #2 in my update (see above). Sorry if that was not clear. :)

From the Blink perspective, I support this change assuming that the following benchmarks don't regress:

- blink_perf.*

https://storage.googleapis.com/cfi-stats/2016-08-15/results.html

- dromaeo.dom*
Will collect stats for that tomorrow.

- speedometer
- animometer
For these two benchmarks (speedometer and animometer) there was no slowdown even in our previous try.

Regarding Dromaeo, I think you already collected the data before. At that point there was no regression.

Sorry, reading your email, I thought that you've made some substantial changes to LLVM since the last update, which would have changed (hopefully improved) performance. So I just wanted to say that if you don't observe any regression on the listed benchmarks, I'm fine with shipping it (from the Blink perspective).

Ivan Krasin

unread,

Aug 18, 2016, 4:33:43 AM8/18/16

to Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

The substantial changes are related to the visibility of devirtualization. That allows us to both keep track of regressions and understand what else can be devirtualized. No new optimizations have been implemented just yet, but there's a good list of what might be the next.

Yuta Kitamura

unread,

Aug 18, 2016, 5:03:14 AM8/18/16

to Ivan Krasin, Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

Just curious, if you apply the optmization without any CFI checks, how much perf gain will you get?

--
You received this message because you are subscribed to the Google Groups "platform-architecture-dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsub...@chromium.org.
To post to this group, send email to platform-architecture-dev@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/platform-architecture-dev/CAOei5RezWnUZQ1oDgMZDdrbGG4J9CQoahXsw6EYxx7G17M7cxw%40mail.gmail.com.

Ivan Krasin

unread,

Aug 18, 2016, 1:55:53 PM8/18/16

to Yuta Kitamura, Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

Hi Yuta,

this is exactly what currently happens in the official Chrome on Linux x86-64: devirtualization without CFI. When we turned it on, the overall the effect was small (~1%, may be even less), but there had been some blink micro-benchmarks which got faster up to 8%. See more details on https://crbug.com/580389 and https://crbug.com/617283 (the latter is Perf team runs bisects on a few selected microbenchmarks to confirm the speedups)

What we didn't get with turning on devirtualization is an automated report from the Perf dashboard about XX benchmarks improved. Even for these few large speedups, we had to manually look into the graphs to request the bisects. And if if 8% was easy to spot, even 3% is much harder as the level of noise is usually 3%-5% on the perf graphs.

The best thing about devirtualization is that we're not done yet. While not every method could potentially be devirtualized, there's still a number of interesting cases which are shown as hot in the profile.

krasin

On Thu, Aug 18, 2016 at 2:02 AM, Yuta Kitamura <yu...@chromium.org> wrote:

Just curious, if you apply the optmization without any CFI checks, how much perf gain will you get?

To unsubscribe from this group and stop receiving emails from it, send an email to platform-architecture-dev+unsubsc...@chromium.org.

To post to this group, send email to platform-architecture-dev@chromium.org.

Ivan Krasin

unread,

Aug 19, 2016, 7:35:51 PM8/19/16

to Yuta Kitamura, Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

Hi everyone,

this is just an FYI.

I made the second attempt to enable CFI on Linux x86-64: https://codereview.chromium.org/2259293002/

This is still not a launch, unless we got our launch bug approved (https://crbug.com/638779), the CL will be reverted next week. The intent is to reassess perf impact and make one-two iterations on adding more hot methods to the blacklist if needed, but the chances for that need are low; I was pretty aggressive already).

The dashboard page to follow is https://chromeperf.appspot.com/group_report?rev=413252

So far, it's only the binary size (4.9% larger), in line with the predictions. I don't expect any more regressions reported (unless they are unrelated to the launch and caused by adjacent changes from others), and to see if we regress by 2%-3% on some microbenchmarks will require to look the graphs in a day or so: such small changes are impossible to spot given just a couple of points.

krasin

Ivan Krasin

unread,

Aug 21, 2016, 3:19:05 PM8/21/16

to Yuta Kitamura, Kentaro Hara, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

There're no real regressions happened: https://chromeperf.appspot.com/report?sid=f29b95ffd471515fe48e6a8dd5a13592611997bbf5c4163454a3bcf14a3eb232

Initially (r413252), there was a spike caused by my mistake in defining the blacklist, which I have since fixed: https://codereview.chromium.org/2267543002/ and then a I added a few more methods to the list (just to be on the safe side).

On a related note, once the new Clang toolchain was rolled, the performance of large-table-with-collapsed-borders-and-colspans-wider-than-table.html (proved to be the most challenging last time) was improved by ~4%. The improvement is independent of our devirtualization efforts (as it got faster on Mac too), but it shows how noisy this microbenchmark is.

Anyway, I am pretty happy regarding to the Perf impact. I would appreciate if someone took a critical look at the graphs and confirm / refute my statement. I would not be surprised some graphs to show ~2% degradation, and I am willing to look into them and add more things to the blacklist, but it would be bad and feels unlikely, if something is 10% slower.

krasin

Kentaro Hara

unread,

Aug 21, 2016, 9:06:24 PM8/21/16

to Ivan Krasin, Yuta Kitamura, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

I don't see any substantial regression in the graph. Congrats!

What's the effect on the binary size? (Sorry if you have mentioned that earlier -- this thread is getting a bit too messy to understand the latest status :-)

You received this message because you are subscribed to the Google Groups "Project TRIM" group.
To unsubscribe from this group and stop receiving emails from it, send an email to project-trim+unsubscribe@chromium.org.
To post to this group, send email to projec...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/project-trim/CAOei5RdJiM4RbHPmLi3QkB7zNJJ_7WT09G8ZN%3DA5K88zDyVXow%40mail.gmail.com.

Ivan Krasin

unread,

Aug 21, 2016, 10:03:48 PM8/21/16

to Kentaro Hara, Yuta Kitamura, platform-architecture-dev, Elliott Sprehn, Matt Sheets, Kostya Serebryany, Peter Collingbourne, Brett Wilson, Annie Sullivan, Nico Weber, Elliott Friedman, Primiano Tucci, Bruce Dawson, Project TRIM, Nat Duca

Hi Kentaro,

thank you for looking into this. The binary size increased by 4.9%.

If Chrome adopts relative ABI for vtables, that could bring up to 9% in binary size savings: https://bugs.chromium.org/p/chromium/issues/detail?id=589384#c2 but it's not yet clear if it's a win from the perf perspective. We're looking into this, but I am not terribly optimistic.

krasin

Reply all

Reply to author

Forward