Regarding Mojo performance

132 views
Skip to first unread message

Kevin Chowski

unread,
Nov 5, 2020, 12:14:31 PM11/5/20
to chromi...@chromium.org, kabylake y uprev
Hello chromium-mojo folks!

I am an engineer working on Chrome OS. These days I'm currently focusing on the Eve model.

I found a claim of Mojo performance [0] in some docs I stumbled upon, but the source data is from 2017 and appears to have manually gathered into a spreadsheet [1]. The spreadsheet mentions the specific tests that ran and I see them still in the repository generally, but I can't tell if they are running regularly or where I might find the results. In particular, I'm specifically interested in cross-process latency on a Chrome OS device.

Are there more recent, and regularly running, results for Mojo latency? In lieu of that, is there some ready-to-run test for Chrome OS devices?


Thanks in advance for your time,
Kevin

Jesse Barnes

unread,
Nov 5, 2020, 12:17:08 PM11/5/20
to Kevin Chowski, Brian Geffon, chromi...@chromium.org, kabylake y uprev
Adding Brian too; he's been looking at this recently.

--
You received this message because you are subscribed to the Google Groups "kabylake y uprev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to kabylake-y-upr...@google.com.
To post to this group, send email to kabylake...@google.com.
To view this discussion on the web visit https://groups.google.com/a/google.com/d/msgid/kabylake-y-uprev/CAB%3DH8NU4cRgk%3D1EmfmRrpgUudx7L7joq3jG8VVeU2hQf7yzviQ%40mail.gmail.com.

Brian Geffon

unread,
Nov 5, 2020, 2:06:08 PM11/5/20
to Jesse Barnes, Kevin Chowski, chromi...@chromium.org, kabylake y uprev
Hi, I can tell you all about Mojo performance because I've been doing a lot of work to improve it on CrOS. Specifically, I recently added writev(2) support into mojo on posix, this will help for these situations where we have a ton of very small messages. I'm currently working on a shared mem implementation of a mojo channel, in my testing this is reducing rtt by about 30%, and finally we're going to add support to Mojo for FUTEX_SWAP which we recently landed in CrOS 5.4 kernel which will reduce any further overhead substantially.

Regarding your question on tests mojo_perftests are my goto. Please feel free to ping me on chat with any questions.

Brian

Andrew Moylan

unread,
Nov 5, 2020, 8:18:54 PM11/5/20
to Brian Geffon, Jesse Barnes, Kevin Chowski, chromium-mojo, kabylake y uprev
Thanks for this discussion! Brian all these improvements are very exciting!
At some point I did a quick test (continuous ping/pong to our ml_service) on a Chrome OS DUT (eve) to get a rule of thumb. And I came up with: "reply comes back in typically well under a millisecond, with rare spikes to a few milliseconds".
What sort of rule of thumb might I be able to update this to in future after these improvements land?
(This affects what user-visible contexts we feel we can run ML models out-of-process.)


You received this message because you are subscribed to the Google Groups "chromium-mojo" group.
To unsubscribe from this group and stop receiving emails from it, send an email to chromium-moj...@chromium.org.
To view this discussion on the web visit https://groups.google.com/a/chromium.org/d/msgid/chromium-mojo/CADyq12z1SRe7xWu-RT%3Dvm5HgKdZ9EEikqO9F2RdY1UHitQ-hgA%40mail.gmail.com.

Brian Geffon

unread,
Nov 6, 2020, 10:38:40 AM11/6/20
to Andrew Moylan, Jesse Barnes, Kevin Chowski, chromium-mojo, kabylake y uprev
Anand recently added some metrics for Posix which provide more visibility into this data fleet wide: https://uma.googleplex.com/p/chrome/histograms?sid=2bfd4ae02f8f46d05c1db27f700b5565.

In my opinion microbenchmarks while useful tell far from the whole picture when it comes to Mojo. The problem with any IPC system IMO is the cost of the syscalls that allow it to happen, those being write/sendmsg, read/recvmsg, along with the waiting via epoll_wait, the cost to send a message is a jump to the IO thread where it does a write syscall, which copies memory into kernel space, which wakes up another user thread, which does a read syscall which copies the data back into userspace, and then a jump back to the appropriate sequence to dispatch the message. What we're hoping to do from the ChromeOS side is improve this flow where we can, we previously added writev which helps lower the number of round trips, and as I mentioned I'm currently working on a prototype with Ken Rockot's help to use shared memory to eliminate the need to copy data in and out of kernel space, we expect this will help a lot.  Unfortunately we still have the wake part of the signaling, and that's where we hope FUTEX_SWAP will help, because with FUTEX_SWAP we'll effectively be able to trigger the other side to read without going through the traditional wake up signaling. This is still very early but I hope that explains what we're hoping to do on the CrOS side and most if not all of this will be available for Linux.

The metrics Anand added are:
Mojo.Channel.WriteMessageHandles
Mojo.Channel.WriteMessageLatency
Mojo.Channel.WriteMessageSize
Mojo.Channel.WriteQueuePendingMessages
Mojo.Channel.WriteQueued
Mojo.Channel.WritevBatchedMessages

Brian

Kevin Chowski

unread,
Nov 6, 2020, 2:10:59 PM11/6/20
to Brian Geffon, Andrew Moylan, Jesse Barnes, chromium-mojo, kabylake y uprev
Thanks a lot for the information, all! In particular, the UMA link from the previous message is pretty awesome! I will have to play around with this more, and remember it for later.

I definitely agree that microbenchmarks only make it so far in understanding real-world performance. That's actually related to the genesis of my question: while upgrading the kernel we observed some changes in latency for copy-paste between Android and Chrome (which I believe involves Mojo at least on some level), and I was somewhat wondering whether it's noticeable in the smaller-scale benchmarks or if it's some deeper issue (or, some issue in the integration of all the disparate parts). If you're interested in my past findings, take a look at the graphs attached to this bug comment: b/157615371#comment32

I took a look at the UMA page and selected for the specific model/kernels I'm looking at to see if the change in latency is noticeable from these micro-measurements and did see anything interesting, but the earlier kernel has 200,000x the number of samples than the newer kernel so I'm thinking the long-tail data is not directly comparable anyway.

I also have a followup issue to investigate the observed latency for this operation across the full suite of devices we have running regularly in our builders, which would be a really interesting view into this end-to-end experience. Feel free to cc yourself to b/168041545 if you're interested in that information.

It is certainly nice to hear that folks are working on optimizing the system even further :)

Thanks again, all!

Reply all
Reply to author
Forward
0 new messages