All times in ms | Linux | Windows | Windows slowdown |
Original IPC | 2250 | 3000 | 33% |
Mojo | 3350 | 4630 | 38% |
Mojo slowdown | 49% | 54% |
I've done some more work on mojo optimization and on performance measurements and I thought I'd share. This may repeat some previous data but I'll try to keep it short. The tests I am comparing are:out\Release\ipc_perftests --gtest_filter=IPCChannelPerfTest.ChannelPingPongout\Release\ipc_mojo_perftests --gtest_filter=MojoChannelPerfTest.ChannelPingPong (currently broken, crbug/466407)Summary:Timing summary results (done with buildtype=Official and the OS high-performance power options) are:
All times in ms Linux Windows Windows slowdown Original IPC 2250 3000 33% Mojo 3350 4630 38% Mojo slowdown 49% 54% The executive summary is that mojo runs about 50% more slowly than old-style IPC. This appears to mostly caused by 30% more instructions, 47% more i-cache misses, and 150% more branch mispredicts. Mojo also does an extra message-sized allocation on both the receiving and sending end which adds cost, and can add unpredictability on Windows.
How big of a factor do you think this is?The slowdown mostly doesn't depend on message size (but see proviso for large messages on Windows).Both methods of IPC run quite fast. On my Windows test machine the ipc_mojo test can do about 52,000 round-trips per second with a small payload. However power-management peculiarities will sometimes harm this, or any other IPC method.Unfortunately for those who like metrics, the performance of IPC is inherently noisy and messy and the most you can say definitively is "it depends".The tests run about 35% slower on Windows compared to Linux. Some of this is due to my Windows CPU running at 2.8/3.6 GHz and my Linux CPU running at 3.5/3.9 GHz.A change this week (eaa389606, after the tests were run) dropped one message-sized allocation per-iteration from both tests but didn't significantly change the results.And now some details:Power management:There are a few factors which complicate the measurement of IPC performance. The first is that both ipc_perftests and ipc_mojo_perftests are sensitive to OS power-management vagaries. This is because the two test processes naturally run on different cores, which are then both about 40% idle. Both Windows and Linux interpret this as "not CPU bound" and do not reliably ramp up the CPU clock speeds, which can lead to 3x slowdowns. Ironically this is more likely to be a problem on powerful machines with many cores. The tests now use thread affinity to force both processes to the same core for more consistent (and usually better) performance. It's not perfect, but it helps reduce the noise. However this 'solution' is not practical in Chrome itself, so be aware of the risks caused by frequent IPC ping-ponging. Test consistency can be improved with "sudo cpupower frequency-set --governor performance" on Linux, or selecting the "High performance" power plan on Windows.Heap oddities:When running the 248,832 byte tests the Windows results are highly variable. If this test is run five times in a row then the fifth run is 3.3 times faster than the first!!! This is because performance is initially dominated by the overhead of the VirtualAlloc calls that back the heap allocations. After a while the Windows heap 'learns' the allocation patterns and hangs on to the memory. This behavior is undocumented and not controllable through documented or recommended means so we can do little except be aware of it. This variable behavior only happens on Windows for allocations between 16 KB and 512 KB which means that IPC usage within Chrome (12 bytes to 11 KB seen) is not currently affected. Above 512 KB the Windows heap always uses VirtualAlloc so large high-frequency allocations should be avoided. The cost of these large allocations is proportional to their size. If mojo is used for large packets then avoiding the extra allocations will be important.CPU slowdowns:The mojo system executes 30% more instructions, and that explains most of the slowdown, but not all of it. Both i-cache miss rates and branch mispredict rates were significantly higher on mojo (see above) and that probably explains the additional slowdown. This theory was tested on Linux by preceding the test commands with: perf stat -e 'instructions,branches,branch-misses,L1-icache-load-misses'. Note that running the tests under perf causes them to run about 40% slower, so it is important to separate performance measurements from profiling.ETW for profiling:ETW (go/etw) was used to profile on Windows and helped to understand the CPU frequency changes (by seeing huge swings in context-switches per second) and the heap caching behavior (by monitoring system process activity and VirtualAlloc rates over time). However the ETW sampling profiler was of limited use because it runs at a maximum of 8 KHz which causes significant aliasing when examining code that does 52,000 round-trips per second. The hot function would change from run to run and throughout a run. Accurate ETW profiling also required turning off context-switch callstack collection since otherwise this overhead would distort the results.--Bruce Dawson
To unsubscribe from this group and stop receiving emails from it, send an email to mojo-dev+u...@chromium.org.
Note, I think it may well be worth optimizing for in-process cases as those are going to be fairly common. Again, it would be nice to know the magnitude of the likely impact before going there.
Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:IPCChannelPerfTest.ChannelProxyPingPong 7742msMojoChannelPerfTest.ChannelProxyPingPong 5314msi.e. Mojo IPC is 46% fasterIf we don't use ChannelProxy, then Mojo IPC is 63% slower:IPCChannelPerfTest.ChannelPingPong 2350msMojoChannelPerfTest.ChannelPingPong 3840ms
On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:IPCChannelPerfTest.ChannelProxyPingPong 7742msMojoChannelPerfTest.ChannelProxyPingPong 5314msi.e. Mojo IPC is 46% fasterIf we don't use ChannelProxy, then Mojo IPC is 63% slower:IPCChannelPerfTest.ChannelPingPong 2350msMojoChannelPerfTest.ChannelPingPong 3840msThat's good news. Why did we revert IPCChannelMojo? Do we need to do more work on IPC::ChannelProxy to make it leverage Mojo in the right way?
On Thu, Jul 23, 2015 at 10:25 AM, John Abd-El-Malek <j...@chromium.org> wrote:Bringing back this thread since there have been a few discussions about Mojo perf lately. Talking to Trung, he explained how with Mojo IPC, there are two copies when sending (to internal buffer, and then to pipe) and two when receiving (reverse). Looking through the code, it appears that the code uses locks so that sending/receiving works the same from any thread.If my understanding is correct, then comparing MojoChannelPerfTest.ChannelPingPong to IPCChannelPerfTest.ChannelPingPong is not a fair comparison. This is because the cost to send the Mojo IPC in the former test is the same independent of which thread is sending/receiving. However the latter test only covers the case when Chrome sends/dispatches IPCs from the IO thread. In the renderer, this is a tiny minority of messages. In the browser, this is more likely but still most messages are on other threads (primarily UI). ChannelProxy handles this through more buffer allocations (one more on each side) and also using PostTask. In that case, comparing ChannelProxy performance is more representative. Here is the average of 3 release runs on my Z840 on Windows:IPCChannelPerfTest.ChannelProxyPingPong 7742msMojoChannelPerfTest.ChannelProxyPingPong 5314msi.e. Mojo IPC is 46% fasterIf we don't use ChannelProxy, then Mojo IPC is 63% slower:IPCChannelPerfTest.ChannelPingPong 2350msMojoChannelPerfTest.ChannelPingPong 3840msThat's good news. Why did we revert IPCChannelMojo? Do we need to do more work on IPC::ChannelProxy to make it leverage Mojo in the right way?
I'm also concerned about the fact that all Mojo messages for a process are being received on one thread then dispatched to other threads to be handled. I think it would be better to use separate channels for each handler so as to avoid any unnecessary context switches (or priority inversion) introduced by funneling everything through the same reader thread.
Hmm. I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.
Jeff.
Hmm. I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.
On Wed, Jul 29, 2015 at 8:57 AM, 'Jeff Brown' via mojo-dev <mojo...@chromium.org> wrote:Hmm. I was under the impression that there is still just one socket per pair of processes and all pipes between those processes are multiplexed over that socket.
I think that is correct. Between two processes, we have a channel (built on top of OS "pipe"). And message pipes are multiplexed over that channel. There is an IO thread for that channel to receive incoming messages.My understanding is that this is not very different from how we do IPC in Chrome today. For example, between the browser and renderer process, we also multiplex messages for different functionality over a single channel.IIRC, we didn't use a dedicated OS pipe for each mojo message pipe to avoid hitting limit on some OSes. On the other hand, this is an implementation detail. We can make changes in the future without affecting users. (Trung knows better. Please correct me if I am wrong.)
I wouldn't be surprised if we ended up wanting separate implementations of the core message pipeline for different OSs anyhow. Otherwise we'll be missing out on important optimization opportunities for latency and throughput.
In my opinion, we shouldn't be benchmarking Mojo against Chromium's IPC mechanism but rather against the best underlying native OS primitives since that's our real competition.
Mojo's architecture will likely be somewhat more latency sensitive than other systems to date (due to the amount of delegation inherent in the services) so it will be worthwhile going to some length to avoid unnecessary context switches and related overhead. :)
Jeff.
To unsubscribe from this group and stop receiving emails from it, send an email to mojo-dev+u...@chromium.org.