Mojo performance

749 views
Skip to first unread message

Bruce Dawson

unread,
Mar 13, 2015, 5:01:34 PM3/13/15
to mojo...@chromium.org
I've done some more work on mojo optimization and on performance measurements and I thought I'd share. This may repeat some previous data but I'll try to keep it short. The tests I am comparing are:

out\Release\ipc_perftests --gtest_filter=IPCChannelPerfTest.ChannelPingPong
out\Release\ipc_mojo_perftests --gtest_filter=MojoChannelPerfTest.ChannelPingPong (currently broken, crbug/466407)

Summary:
Timing summary results (done with buildtype=Official and the OS high-performance power options) are:

All times in msLinuxWindowsWindows slowdown
Original IPC2250300033%
Mojo3350463038%
Mojo slowdown49%54%

The executive summary is that mojo runs about 50% more slowly than old-style IPC. This appears to mostly caused by 30% more instructions, 47% more i-cache misses, and 150% more branch mispredicts. Mojo also does an extra message-sized allocation on both the receiving and sending end which adds cost, and can add unpredictability on Windows. The slowdown mostly doesn't depend on message size (but see proviso for large messages on Windows).

Both methods of IPC run quite fast. On my Windows test machine the ipc_mojo test can do about 52,000 round-trips per second with a small payload. However power-management peculiarities will sometimes harm this, or any other IPC method.

Unfortunately for those who like metrics, the performance of IPC is inherently noisy and messy and the most you can say definitively is "it depends".

The tests run about 35% slower on Windows compared to Linux. Some of this is due to my Windows CPU running at 2.8/3.6 GHz and my Linux CPU running at 3.5/3.9 GHz.

A change this week (eaa389606, after the tests were run) dropped one message-sized allocation per-iteration from both tests but didn't significantly change the results.


And now some details:

Power management:
There are a few factors which complicate the measurement of IPC performance. The first is that both ipc_perftests and ipc_mojo_perftests are sensitive to OS power-management vagaries. This is because the two test processes naturally run on different cores, which are then both about 40% idle. Both Windows and Linux interpret this as "not CPU bound" and do not reliably ramp up the CPU clock speeds, which can lead to 3x slowdowns. Ironically this is more likely to be a problem on powerful machines with many cores. The tests now use thread affinity to force both processes to the same core for more consistent (and usually better) performance. It's not perfect, but it helps reduce the noise. However this 'solution' is not practical in Chrome itself, so be aware of the risks caused by frequent IPC ping-ponging. Test consistency can be improved with "sudo cpupower frequency-set --governor performance" on Linux, or selecting the "High performance" power plan on Windows.

Heap oddities:
When running the 248,832 byte tests the Windows results are highly variable. If this test is run five times in a row then the fifth run is 3.3 times faster than the first!!! This is because performance is initially dominated by the overhead of the VirtualAlloc calls that back the heap allocations. After a while the Windows heap 'learns' the allocation patterns and hangs on to the memory. This behavior is undocumented and not controllable through documented or recommended means so we can do little except be aware of it. This variable behavior only happens on Windows for allocations between 16 KB and 512 KB which means that IPC usage within Chrome (12 bytes to 11 KB seen) is not currently affected. Above 512 KB the Windows heap always uses VirtualAlloc so large high-frequency allocations should be avoided. The cost of these large allocations is proportional to their size. If mojo is used for large packets then avoiding the extra allocations will be important.

CPU slowdowns:
The mojo system executes 30% more instructions, and that explains most of the slowdown, but not all of it. Both i-cache miss rates and branch mispredict rates were significantly higher on mojo (see above) and that probably explains the additional slowdown. This theory was tested on Linux by preceding the test commands with: perf stat -e 'instructions,branches,branch-misses,L1-icache-load-misses'. Note that running the tests under perf causes them to run about 40% slower, so it is important to separate performance measurements from profiling.

ETW for profiling:
ETW (go/etw) was used to profile on Windows and helped to understand the CPU frequency changes (by seeing huge swings in context-switches per second) and the heap caching behavior (by monitoring system process activity and VirtualAlloc rates over time). However the ETW sampling profiler was of limited use because it runs at a maximum of 8 KHz which causes significant aliasing when examining code that does 52,000 round-trips per second. The hot function would change from run to run and throughout a run. Accurate ETW profiling also required turning off context-switch callstack collection since otherwise this overhead would distort the results.

-- 
Bruce Dawson

Darin Fisher

unread,
Mar 13, 2015, 5:13:18 PM3/13/15
to Bruce Dawson, mojo...@chromium.org
Thanks for investigating!!


On Fri, Mar 13, 2015 at 2:01 PM, 'Bruce Dawson' via mojo-dev <mojo...@chromium.org> wrote:
I've done some more work on mojo optimization and on performance measurements and I thought I'd share. This may repeat some previous data but I'll try to keep it short. The tests I am comparing are:

out\Release\ipc_perftests --gtest_filter=IPCChannelPerfTest.ChannelPingPong
out\Release\ipc_mojo_perftests --gtest_filter=MojoChannelPerfTest.ChannelPingPong (currently broken, crbug/466407)

Summary:
Timing summary results (done with buildtype=Official and the OS high-performance power options) are:

All times in msLinuxWindowsWindows slowdown
Original IPC2250300033%
Mojo3350463038%
Mojo slowdown49%54%

The executive summary is that mojo runs about 50% more slowly than old-style IPC. This appears to mostly caused by 30% more instructions, 47% more i-cache misses, and 150% more branch mispredicts. Mojo also does an extra message-sized allocation on both the receiving and sending end which adds cost, and can add unpredictability on Windows.

^^^ This is low-hanging fruit fortunately. We've had a plan to eliminate this redundant allocation by changing Mojo{Read,Write}Message to support having the core system own the buffer allocation. How big of a factor do you think this is?

Viet-Trung Luu

unread,
Mar 13, 2015, 5:41:03 PM3/13/15
to Darin Fisher, Bruce Dawson, mojo...@chromium.org
I'm still somewhat hesitant about this plan.

It's easy and works well when there's no security boundary between the caller of Mojo...() and the real implementation, but it adds a lot of complexity when there is. The complexity means that the performance gains may well be lost (and indeed may be a performance hit) for small messages, which is probably the common case.

Why?

It means that the implementation not only has to have a trusted allocator (e.g.: in NaCl, the allocator for the trusted code; in a hypothetical kernel implementation, a kernel allocator) and the ability to copy data between trusted and untrusted memory, but also an allocator that provides allocations accessible to both trusted and untrusted code. (This has the side effect of "stealing" address space from untrusted code.)

This allocator has additional requirements, like having all its metadata stored in trusted memory (at least for the case of allocations that are to be writable by untrusted code), and the trusted code's use of that allocator would always have to be careful to not expose any other important data (to writes, certainly, and quite possibly even just reads).

The more likely thing to happen would be that such an API would be implemented purely in untrusted code (using the existing API) -- i.e., allocate in untrusted code and copy when crossing security boundaries. But then this at best breaks even and more likely would be a pure performance hit.

Thus as a private API (for the embedder, or special apps like the NaCl host -- to be used by trusted NaCl code) it's probably OK, but I'm sceptical about it being a good API for general untrusted apps.
 
How big of a factor do you think this is?

 
The slowdown mostly doesn't depend on message size (but see proviso for large messages on Windows).

Both methods of IPC run quite fast. On my Windows test machine the ipc_mojo test can do about 52,000 round-trips per second with a small payload. However power-management peculiarities will sometimes harm this, or any other IPC method.

Unfortunately for those who like metrics, the performance of IPC is inherently noisy and messy and the most you can say definitively is "it depends".

The tests run about 35% slower on Windows compared to Linux. Some of this is due to my Windows CPU running at 2.8/3.6 GHz and my Linux CPU running at 3.5/3.9 GHz.

A change this week (eaa389606, after the tests were run) dropped one message-sized allocation per-iteration from both tests but didn't significantly change the results.


And now some details:

Power management:
There are a few factors which complicate the measurement of IPC performance. The first is that both ipc_perftests and ipc_mojo_perftests are sensitive to OS power-management vagaries. This is because the two test processes naturally run on different cores, which are then both about 40% idle. Both Windows and Linux interpret this as "not CPU bound" and do not reliably ramp up the CPU clock speeds, which can lead to 3x slowdowns. Ironically this is more likely to be a problem on powerful machines with many cores. The tests now use thread affinity to force both processes to the same core for more consistent (and usually better) performance. It's not perfect, but it helps reduce the noise. However this 'solution' is not practical in Chrome itself, so be aware of the risks caused by frequent IPC ping-ponging. Test consistency can be improved with "sudo cpupower frequency-set --governor performance" on Linux, or selecting the "High performance" power plan on Windows.

Heap oddities:
When running the 248,832 byte tests the Windows results are highly variable. If this test is run five times in a row then the fifth run is 3.3 times faster than the first!!! This is because performance is initially dominated by the overhead of the VirtualAlloc calls that back the heap allocations. After a while the Windows heap 'learns' the allocation patterns and hangs on to the memory. This behavior is undocumented and not controllable through documented or recommended means so we can do little except be aware of it. This variable behavior only happens on Windows for allocations between 16 KB and 512 KB which means that IPC usage within Chrome (12 bytes to 11 KB seen) is not currently affected. Above 512 KB the Windows heap always uses VirtualAlloc so large high-frequency allocations should be avoided. The cost of these large allocations is proportional to their size. If mojo is used for large packets then avoiding the extra allocations will be important.

CPU slowdowns:
The mojo system executes 30% more instructions, and that explains most of the slowdown, but not all of it. Both i-cache miss rates and branch mispredict rates were significantly higher on mojo (see above) and that probably explains the additional slowdown. This theory was tested on Linux by preceding the test commands with: perf stat -e 'instructions,branches,branch-misses,L1-icache-load-misses'. Note that running the tests under perf causes them to run about 40% slower, so it is important to separate performance measurements from profiling.

ETW for profiling:
ETW (go/etw) was used to profile on Windows and helped to understand the CPU frequency changes (by seeing huge swings in context-switches per second) and the heap caching behavior (by monitoring system process activity and VirtualAlloc rates over time). However the ETW sampling profiler was of limited use because it runs at a maximum of 8 KHz which causes significant aliasing when examining code that does 52,000 round-trips per second. The hot function would change from run to run and throughout a run. Accurate ETW profiling also required turning off context-switch callstack collection since otherwise this overhead would distort the results.

-- 
Bruce Dawson

To unsubscribe from this group and stop receiving emails from it, send an email to mojo-dev+u...@chromium.org.

Darin Fisher

unread,
Mar 13, 2015, 5:43:51 PM3/13/15
to Viet-Trung Luu, Bruce Dawson, mojo...@chromium.org
How is this fundamentally different than two-phase read/write on a data pipe? Note, I think it may well be worth optimizing for in-process cases as those are going to be fairly common. Again, it would be nice to know the magnitude of the likely impact before going there.

-Darin

Viet-Trung Luu

unread,
Mar 13, 2015, 6:03:43 PM3/13/15
to Darin Fisher, Bruce Dawson, mojo...@chromium.org
At one level, they aren't (and this already causes pain).

At another, the general case for data pipes is streaming a large quantity of data repeatedly, so you're willing to take some large-ish up-front costs. And you'd expect any buffer allocated for data pipes to be large, so you'd be willing to allocate them with page-level granularity. I don't think you'd want to allocate memory for messages with page-level granularity (i.e., consuming 4 KB for every message). (You could do a two-level scheme, but again that adds complexity and penalizes the small-message case.)
 
Note, I think it may well be worth optimizing for in-process cases as those are going to be fairly common. Again, it would be nice to know the magnitude of the likely impact before going there.

Agreed.

Hajime Morrita

unread,
Mar 13, 2015, 6:28:23 PM3/13/15
to Viet-Trung Luu, Darin Fisher, Bruce Dawson, mojo...@chromium.org
In general, I'd rather enable ChannelMojo for more processes including possibly impact-ful ones like Pepper/ARC and see how the overall perf reacts. I used to worry about Pepper/ARC as its heavy use of sync messages, but I then learned that much of its IPC overhead is due to the NaCL sandbox related tricks, like message scanning, brokering, etc. Now it isn't clear for me how the pure IPC::Channel perf matters here.

That's said, there are a few things we can possibly do before changing the public API surface / memory allocation responsibility: 

 * I have a prototype CL to skip some of read-side overhead. This eliminates one extra copying on read side and some other overhead.

 * We could possibly pass the parameter of MojoWriteMessage() directly to send()/write() syscall without copying it into the internal buffer. This isn't always possible, but in Chrome case, most of the MojoWriteMessage() call could immediately result the syscall. So
   it is theoretically possible to have some kind of fast path. I'm not sure the internal complexity needed for this approach though.

 * The cost comes not only from the memory allocation, but comes from many other small things like synchronization and the general complexity needed for the fully-featured - multiplexed, handle-passable, thread-safe - API. It might be possible to have some embedder API to bypass some of these complexity, even without exposing the internal buffer.

You can see the difference between two channel in the flamegraph:
- this one for traditional IPC: http://jsfiddle.net/pt7mq15w/show/ and 
- another for ChannelMojo http://jsfiddle.net/48coeekk/1/show/





Darin Fisher

unread,
Mar 17, 2015, 2:18:42 PM3/17/15
to Viet-Trung Luu, Bruce Dawson, mojo...@chromium.org
By the way, another option here is to use the stack rather than the heap for preparing messages. We can easily reserve a chunk of stack space inside ReadAndDispatchMessage and scoped to MessageBuilder instances. For small messages, we could use this stack space and failover to the heap for larger messages. This would be fairly easy to implement.

-Darin

Hajime Morrita

unread,
Apr 10, 2015, 3:41:56 PM4/10/15
to Darin Fisher, Viet-Trung Luu, Bruce Dawson, mojo...@chromium.org
After all, ChannelMojo still isn't fast enough :-(
We have to squeeze a bit more. Looking...


--
morrita

Darin Fisher

unread,
Apr 12, 2015, 7:24:37 PM4/12/15
to Hajime Morrita, Viet-Trung Luu, Bruce Dawson, mojo...@chromium.org
Wow, so close. Any leading theories as to the cause of the slow down?

-Darin

Hajime Morrita

unread,
Apr 13, 2015, 1:26:52 PM4/13/15
to Darin Fisher, Viet-Trung Luu, Bruce Dawson, mojo...@chromium.org
I haven'd profiled it yet, but I haven't seen any IPC-related bottleneck other than the cookie access.
It is likely that ChannelMojo slowness (we still have a bit) isn't fully masked by the noise in this case.

--
morrita

Bruce Dawson

unread,
Apr 13, 2015, 2:25:55 PM4/13/15
to Hajime Morrita, Darin Fisher, Viet-Trung Luu, mojo...@chromium.org
My previous profiling suggested that mojo's slowness was mostly just from executing more instructions (plus more i-cache misses, more branch mispredicts) which suggests it is death-of-a-thousand-cuts slowdown.
--
Bruce Dawson

Hajime Morrita

unread,
Apr 13, 2015, 2:42:29 PM4/13/15
to Bruce Dawson, Darin Fisher, Viet-Trung Luu, mojo...@chromium.org
Yes, the cookie access just reveals that slowness as it does things in synchronous manner and directly affects the benchmark numbers.
--
morrita

Darin Fisher

unread,
Apr 13, 2015, 4:23:40 PM4/13/15
to Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
so there are not more thread hops or additional queues that stand out? :-(

Hajime Morrita

unread,
Apr 13, 2015, 5:44:30 PM4/13/15
to Darin Fisher, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
There is a queue, and I have a POC CL to bypass it. The downside of the change that made me not to pursue it is that it adds yet another complexity on Mojo embedder API surface. Once I learned what happens on these test cases.

Also I just got received another, clearer regression report. Bad news.

All regressions are happening on either XP or Android. Apparently my last attempt was enough for faster boxes but not enough for mobile :-(
One major metrics getting worse is "thread_IO_cpu_time_per_frame". This roughly agrees what Bruce says.

--
morrita

--
morrita

Hajime Morrita

unread,
Apr 14, 2015, 8:35:21 PM4/14/15
to Darin Fisher, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
I wrote some detail around the slowness, hoping this gives some colors to coming awfully-looking CLs.

Darin Fisher

unread,
Apr 15, 2015, 1:45:39 AM4/15/15
to Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
Very interesting!

Have you considered putting Mojo more directly underneath IPC::ChannelProxy rather than IPC::Channel? I ask because IPC::ChannelProxy introduces another queue for both sending and receiving messages.

I believe you can eliminate the send queue easily given that MojoWriteMessage can be called from any thread and implements its own internal queuing (i.e., MojoWriteMessage never generates a "would block" type error). The receive queue can probably also be eliminated as you could setup the HandleWatcher to run on the thread where the IPC::ChannelProxy is bound.

Maybe there is some hybrid approach whereby IPC::ChannelMojo could expose an API that would allow IPC::ChannelProxy to implement these optimizations.

-Darin

Hajime Morrita

unread,
Apr 15, 2015, 1:59:01 PM4/15/15
to Darin Fisher, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
Thanks for the advice Darin!

I prototyped the send-side change before but didn't see noticeable improvement at that time.
I felt it strange as it should skip one extra thread hop but didn't look into further.

Probably I should come back to the CL and give some fresh look.

--
morrita

Darin Fisher

unread,
Apr 16, 2015, 1:12:51 AM4/16/15
to Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
Maybe the benefits would be seen more on the receive side?
-Darin

John Abd-El-Malek

unread,
Jul 23, 2015, 1:25:42 PM7/23/15
to Darin Fisher, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.

If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:

IPCChannelPerfTest.ChannelProxyPingPong 7742ms
MojoChannelPerfTest.ChannelProxyPingPong 5314ms
i.e. Mojo IPC is 46% faster

If we don't use ChannelProxy, then Mojo IPC is 63% slower:
IPCChannelPerfTest.ChannelPingPong 2350ms
MojoChannelPerfTest.ChannelPingPong 3840ms

Darin Fisher

unread,
Jul 23, 2015, 5:41:59 PM7/23/15
to John Abd-El-Malek, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org, Ken Rockot
On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:
Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.

If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:

IPCChannelPerfTest.ChannelProxyPingPong 7742ms
MojoChannelPerfTest.ChannelProxyPingPong 5314ms
i.e. Mojo IPC is 46% faster

If we don't use ChannelProxy, then Mojo IPC is 63% slower:
IPCChannelPerfTest.ChannelPingPong 2350ms
MojoChannelPerfTest.ChannelPingPong 3840ms

That's good news. Why did we revert IPCChannelMojo? Do we need to do more work on IPC::ChannelProxy to make it leverage Mojo in the right way?

John Abd-El-Malek

unread,
Jul 23, 2015, 5:46:49 PM7/23/15
to Darin Fisher, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
ok I looked some more at the code and I have somewhat surprising results. First, ChannelProxy::Send doesn't do an extra copy since it gets a Message that it owns, so both Mojo and Chrome IPC cases should be the same. Also ChannelMojo::IsSendThreadSafe() was set to false which meant that even when using Mojo channel we were still doing a PostTask on send. So I'm not sure of the exact reason that Mojo is faster in this case and I'll avoid speculating for now.

Turning ChannelMojo::IsSendThreadSafe() to true makes MojoChannelPerfTest.ChannelProxyPingPong take 4616ms (pretty stable, average of three runs). That means Mojo is now 68% faster on Windows.


I also ran these numbers on Linux. There the numbers were definitely a lot noisier (in the range of about a second):
IPCChannelPerfTest.ChannelProxyPingPong: 7407ms
MojoChannelPerfTest.ChannelProxyPingPong:10097ms
MojoChannelPerfTest.ChannelProxyPingPong with ChannelMojo::IsSendThreadSafe() = true: 6987ms

The interesting thing here is that Mojo is slower in trunk. but setting IsSendThreadSafe to be true makes it slightly faster. Note that my Linux box is a Z620 while Windows is Z840.

For posterity, here are the single threaded numbers for Linux as well:
IPCChannelPerfTest.ChannelPingPong 2931ms
MojoChannelPerfTest.ChannelPingPong 4427ms

On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:

Ken Rockot

unread,
Jul 23, 2015, 5:49:29 PM7/23/15
to Darin Fisher, John Abd-El-Malek, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org
On Thu, Jul 23, 2015 at 2:41 PM, Darin Fisher <da...@chromium.org> wrote:


On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:
Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.

If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:

IPCChannelPerfTest.ChannelProxyPingPong 7742ms
MojoChannelPerfTest.ChannelProxyPingPong 5314ms
i.e. Mojo IPC is 46% faster

If we don't use ChannelProxy, then Mojo IPC is 63% slower:
IPCChannelPerfTest.ChannelPingPong 2350ms
MojoChannelPerfTest.ChannelPingPong 3840ms

That's good news. Why did we revert IPCChannelMojo? Do we need to do more work on IPC::ChannelProxy to make it leverage Mojo in the right way?

Page cycler perf regression on XP. It may have been a red herring though. Trying to verify before turning it back on.

John Abd-El-Malek

unread,
Jul 23, 2015, 5:49:39 PM7/23/15
to Darin Fisher, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org, Ken Rockot
On Thu, Jul 23, 2015 at 2:41 PM, Darin Fisher <da...@chromium.org> wrote:


On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:
Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.

If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:

IPCChannelPerfTest.ChannelProxyPingPong 7742ms
MojoChannelPerfTest.ChannelProxyPingPong 5314ms
i.e. Mojo IPC is 46% faster

If we don't use ChannelProxy, then Mojo IPC is 63% slower:
IPCChannelPerfTest.ChannelPingPong 2350ms
MojoChannelPerfTest.ChannelPingPong 3840ms

That's good news. Why did we revert IPCChannelMojo? Do we need to do more work on IPC::ChannelProxy to make it leverage Mojo in the right way?

Ken mentioned today that it was reverted earlier because it was suspected to cause one XP benchmark to slow down. However after the revert it didn't improve, so it must have been something else.

Our messages crossed, and I did notice a change we should do to make it even faster.

John Abd-El-Malek

unread,
Jul 24, 2015, 3:57:18 PM7/24/15
to Darin Fisher, Hajime Morrita, Bruce Dawson, Viet-Trung Luu, mojo...@chromium.org, Ken Rockot
Also, here are Android numbers:

IPCChannelPerfTest.ChannelProxyPingPong 37892ms
MojoChannelPerfTest.ChannelProxyPingPong 47298ms
MojoChannelPerfTest.ChannelProxyPingPong with ChannelMojo::IsSendThreadSafe() = true: 38159ms

single threaded:
IPCChannelPerfTest.ChannelPingPong 20634ms
MojoChannelPerfTest.ChannelPingPong 30347ms

Jeff Brown

unread,
Jul 28, 2015, 8:07:01 PM7/28/15
to mojo-dev, da...@chromium.org, mor...@google.com, bruce...@google.com, viettr...@chromium.org, roc...@chromium.org, j...@chromium.org
Any attempts to benchmark Mojo vs. Android binder?

I'm somewhat concerned about the extra copy in the Mojo receive pipeline (Binder only copies once from the sender into a kernel buffer then maps that buffer on the receiver's end).  Using a two-phase API for message pipes, as for data pipes, would enable greater flexibility in terms of how messages are delivered to reduce allocations and copies given suitable OS-level support.

I'm also concerned about the fact that all Mojo messages for a process are being received on one thread then dispatched to other threads to be handled.  I think it would be better to use separate channels for each handler so as to avoid any unnecessary context switches (or priority inversion) introduced by funneling everything through the same reader thread.

For certain applications, it may also be worth using SOCK_SEQPACKET for message framing when supported by the OS.

Jeff.

Benjamin Lerman

unread,
Jul 29, 2015, 4:46:15 AM7/29/15
to Jeff Brown, mojo-dev, Darin Fisher, Hajime Morrita, bruce...@google.com, Viet-Trung Luu, roc...@chromium.org, John Abd-El-Malek
I'm also concerned about the fact that all Mojo messages for a process are being received on one thread then dispatched to other threads to be handled.  I think it would be better to use separate channels for each handler so as to avoid any unnecessary context switches (or priority inversion) introduced by funneling everything through the same reader thread.

 You do not have to dispatch all of your message in a single thread. You can bind a message pipe to an implementation in any thread. The mapping of services to threads is entirely up to the user.

Jeff Brown

unread,
Jul 29, 2015, 11:57:47 AM7/29/15
to Benjamin Lerman, Darin Fisher, Hajime Morrita, John Abd-El-Malek, Viet-Trung Luu, bruce...@google.com, mojo-dev, roc...@chromium.org

Hmm.  I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.

Jeff.

Yuzhu Shen

unread,
Jul 29, 2015, 12:09:46 PM7/29/15
to Jeff Brown, Benjamin Lerman, Darin Fisher, Hajime Morrita, John Abd-El-Malek, Viet-Trung Luu, Bruce Dawson, mojo-dev, Ken Rockot
On Wed, Jul 29, 2015 at 8:57 AM, 'Jeff Brown' via mojo-dev <mojo...@chromium.org> wrote:

Hmm.  I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.

I think that is correct. Between two processes, we have a channel (built on top of OS "pipe"). And message pipes are multiplexed over that channel. There is an IO thread for that channel to receive incoming messages.

My understanding is that this is not very different from how we do IPC in Chrome today. For example, between the browser and renderer process, we also multiplex messages for different functionality over a single channel.

IIRC, we didn't use a dedicated OS pipe for each mojo message pipe to avoid hitting limit on some OSes. On the other hand, this is an implementation detail. We can make changes in the future without affecting users. (Trung knows better. Please correct me if I am wrong.)



--
Best regards,
Yuzhu Shen.

James Robinson

unread,
Jul 29, 2015, 12:26:46 PM7/29/15
to Yuzhu Shen, Jeff Brown, Benjamin Lerman, Darin Fisher, Hajime Morrita, John Abd-El-Malek, Viet-Trung Luu, Bruce Dawson, mojo-dev, Ken Rockot
On Wed, Jul 29, 2015 at 9:09 AM, 'Yuzhu Shen' via mojo-dev <mojo...@chromium.org> wrote:


On Wed, Jul 29, 2015 at 8:57 AM, 'Jeff Brown' via mojo-dev <mojo...@chromium.org> wrote:

Hmm.  I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.

I think that is correct. Between two processes, we have a channel (built on top of OS "pipe"). And message pipes are multiplexed over that channel. There is an IO thread for that channel to receive incoming messages.

My understanding is that this is not very different from how we do IPC in Chrome today. For example, between the browser and renderer process, we also multiplex messages for different functionality over a single channel.

IIRC, we didn't use a dedicated OS pipe for each mojo message pipe to avoid hitting limit on some OSes. On the other hand, this is an implementation detail. We can make changes in the future without affecting users. (Trung knows better. Please correct me if I am wrong.)

On windows MsgWaitForMultipleObjectsEx() can only wait on up to 63 handles, so there's another thread to mux OS pipes on to a single HANDLE (using I/O completion ports).  epoll()/etc have no such limitation but maintaining a separate implementation for windows and non-windows is trickier.

- James

Jeff Brown

unread,
Jul 29, 2015, 2:05:13 PM7/29/15
to James Robinson, Yuzhu Shen, Benjamin Lerman, Darin Fisher, Hajime Morrita, John Abd-El-Malek, Viet-Trung Luu, Bruce Dawson, mojo-dev, Ken Rockot

I wouldn't be surprised if we ended up wanting separate implementations of the core message pipeline for different OSs anyhow.  Otherwise we'll be missing out on important optimization opportunities for latency and throughput.

In my opinion, we shouldn't be benchmarking Mojo against Chromium's IPC mechanism but rather against the best underlying native OS primitives since that's our real competition.

Mojo's architecture will likely be somewhat more latency sensitive than other systems to date (due to the amount of delegation inherent in the services) so it will be worthwhile going to some length to avoid unnecessary context switches and related overhead.  :)

Jeff.

Kris Giesing

unread,
Jul 29, 2015, 2:08:19 PM7/29/15
to Jeff Brown, James Robinson, Yuzhu Shen, Benjamin Lerman, Darin Fisher, Hajime Morrita, John Abd-El-Malek, Viet-Trung Luu, Bruce Dawson, mojo-dev, Ken Rockot
Agree with Jeff on all points.

To unsubscribe from this group and stop receiving emails from it, send an email to mojo-dev+u...@chromium.org.

John Abd-El-Malek

unread,
Jul 29, 2015, 6:57:43 PM7/29/15
to Kris Giesing, Jeff Brown, James Robinson, Yuzhu Shen, Benjamin Lerman, Darin Fisher, Hajime Morrita, Viet-Trung Luu, Bruce Dawson, mojo-dev, Ken Rockot
Just to give some context, the background for this thread was for switching Chrome's IPC to use Mojo at a low level. So that's why there's the comparison to ensure that this switch doesn't lead to a regression.

The points you raise are all valid, in particular for your use case which is different than Chrome's (since this is a public thread I'll stop at that :) ).
Reply all
Reply to author
Forward
0 new messages