Tock System Call Performance Evaluation - memcopy overhead

77 views
Skip to first unread message

Leon Schuermann

unread,
Mar 14, 2021, 4:47:24 PM3/14/21
to Tock Embedded OS Development Discussion, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com, p...@cs.stanford.edu

The recent presentation by Gabriel Marcano (CCed) about their report
titled "Benchmarking the Tock operating system for embedded
platforms"[1] has inspired me to take a deeper look into what exactly my
CPU does when a process issues a `Command`-style system call in Tock. :)

For testing this, I have used the LiteX SoC platform with the VexRiscv
RISC-V CPU that Tock has been ported to. LiteX features a simulated
hardware target ("litex_sim.py") that can emulate a similar SoC to what
would run on a real FPGA using Verilator. Combined with the Ibex tracer
module that plugs into the RISC-V formal interface (RVFI) of a CPU, and
a custom VexRiscv CPU build that has the RVFI enabled, we can see every
instruction being executed. For more info, see [2].

The results presented here have been obtained using
- rustc 1.51.0-nightly (c2de47a9a 2021-01-06)
- tock-2.0-dev commit f4ee443f1d26c7 ("Merge pull request #2478 ...")
with `litex_sim` changed to from rv32i to rv32imc

The annotated and formatted instruction traces[3], full instruction traces,
tested ELF and disassembly can be found under [4].

While I could not reproduce the exact numbers from the report referred
to (likely because of the vastly different testing methodology), indeed
the number of instructions required for routing a system call from
process to a capsule is unexpectedly large: in total, from the userspace
ecall (or _start_trap kernel symbol respectively) up to the first
instruction of the <capsules::console::Console as
kernel::Driver>::command method, a total of 530 instructions are being
executed for a rv32imc target.

Notably, a rather large portion (123 * 2 = 246 instructions, ~46%) of
the traced system calls are caused by calls to memcopy, which does a
byte-wise copy a struct. It turns out that these two invocations are
copying the Syscall-struct, which is 5 words / 20 bytes large. These
calls are caused by the invocation of `process.debug_syscall_called(` in
`sched.rs:755`.

Indeed commenting out the line in question reduces the instructions
required for routing the system call to 242, which probably still has
room for improvement, but is a whopping ~2.2x improvement over the
current situation (regarding instruction count only).

This naturally raises some questions. For instance, why couldn't Rust
(given it knows the size of the struct) resort to a more efficient
method of copying the data? I imagine something along the lines of

lw x14,0(x11)
sw x14,0(x13)
lw x14,4(x11)
sw x14,4(x13)
...

This had me puzzled quite a bit, the answer being very simple: the
opt-level "z" used in Tock does not trigger this optimization. Indeed
with opt-level 3, the Rust compiler manages to perform the entire system
call routing in 233 instructions, _with_ Process:debug_syscall_called
enabled. The generated assembly is even futher optimized than the
handwritten ASM above, with the variant discriminator of Syscall being
loaded from an immediate.

A further question is, why do we copy the struct two times in the first
place? Currently we move the struct (which has the Copy trait) into
Process::debug_syscall_called and this passes it further into a closure,
which could explain a single copy. Writing it to the target location in
the kernel's process memory explains the second copy. However, even if
we pass it by reference and remove the Copy trait (cloning it explicitly
once, when writing to the target memory in
Process::debug_syscall_called), the compiler will still memcopy it
twice, presumably over the stack. This seems unnecessary.

Ultimately, this seems like an important issue to solve given it's quite
high cost per system call and affecting users of Tock right now. I don't
however know how this issue can be resolved while still allowing us to
capture the last system call as part of a panic message.

There is the promising feature of overriding the profile (including
opt-level) for specific crates in a workspace. Unfortunately, generic
code of a crate is only compiled when all types are known (s.t. it can
be monomorphized) and hence an opt-level=3 in the kernel does not apply
to the kernel_loop function. Adding an opt-level=3 to the board crate
does trigger the desired optimizations, but also optimizes a whole bunch
of other generic code (from capsules, etc.) according to this level and
hence causes a more drastic increase in text size. The numbers can be
seen in [5].

I'd be happy about ideas on what could be causing this, how we can find
ways around this behavior, and thoughts on this issue. If anyone wants
to test this for themselves, I'm happy to help get started with LiteX
and the simulation.

Leon


[1]: https://github.com/tock/TockBenchmarking/blob/main/reports/final.pdf
[2]: https://github.com/tock/tock/tree/master/boards/litex/sim
[3]: https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e#file-console_putstr_syscall-log
[4]: https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e
[5]: https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e#file-binary_sizes_opt_levels-txt

Philip Levis

unread,
Mar 14, 2021, 6:43:33 PM3/14/21
to le...@is.currently.online, Tock Embedded OS Development Discussion, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com
Leon,

This is great. Thank you for the very detailed look at the assembly.

I think that the transition from alpha1 to rc1 is going to involve a bunch of work like this, to improve computational efficiency. I think we will also want to take a very close look at code size. For a long time we’ve been pretty lax on this due to our platforms having lots of flash space, but as we start looking at more complex applications and multiple (A/B) images, our time of free lunch is coming to an end. :)

We’ve put lots of debugging hooks into the kernel which is great for development but a killer when it comes to efficiency. For, example, the hard fault debug dump — I can’t count the number of times the extremely rich data it prints out has saved me hours of debugging. But the formatting string alone is I believe ~2kB. We should start to think about devel vs. deploy builds.

Phil
———————
Philip Levis (he/him)
Associate Professor, Computer Science and Electrical Engineering
Faculty Director, lab64 Maker Space
Stanford University
http://csl.stanford.edu/~pal

Hudson Randal Ayers

unread,
Mar 15, 2021, 9:14:48 PM3/15/21
to Leon Schuermann, Tock Embedded OS Development Discussion, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com, p...@cs.stanford.edu
Thanks for the thorough investigation Leon! Regarding the extra memcpy, it may actually be worth trying on the latest nightly to see if that still occurs. As part of the upgrade to LLVM 12, Rust was able to move to a new algorithm for optimizing out memcpy calls across blocks (see https://github.com/rust-lang/rust/pull/82806) which may resolve this for us -- though I do not know if opt-level=z will still cause problems.
LLVM 12 ships with an implementation of MemCpyOpt which is based on MSSA instead of MDA. This implementation can eliminate memcpys across blocks, and as such fixes many (but not all) failures to el...


From: tock...@googlegroups.com <tock...@googlegroups.com> on behalf of Leon Schuermann <le...@is.currently.online>
Sent: Sunday, March 14, 2021 1:47 PM
To: Tock Embedded OS Development Discussion <tock...@googlegroups.com>
Cc: gmar...@ucsd.edu <gmar...@ucsd.edu>; ppan...@ucsd.edu <ppan...@ucsd.edu>; am...@amitlevy.com <am...@amitlevy.com>; p...@cs.stanford.edu <p...@cs.stanford.edu>
Subject: [tock-dev] Tock System Call Performance Evaluation - memcopy overhead
 
--
You received this message because you are subscribed to the Google Groups "Tock Embedded OS Development Discussion" group.
To unsubscribe from this group and stop receiving emails from it, send an email to tock-dev+u...@googlegroups.com.
To view this discussion on the web visit https://groups.google.com/d/msgid/tock-dev/875z1tzcmh.fsf%40silicon.

Leon Schuermann

unread,
Mar 16, 2021, 3:59:55 AM3/16/21
to Hudson Randal Ayers, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com, p...@cs.stanford.edu, Tock Embedded OS Development Discussion

Hudson Randal Ayers <hay...@stanford.edu> writes:
> Regarding the extra memcpy, it may actually be worth trying on the
> latest nightly to see if that still occurs. As part of the upgrade to
> LLVM 12, Rust was able to move to a new algorithm for optimizing out
> memcpy calls across blocks (see
> https://github.com/rust-lang/rust/pull/82806) which may resolve this
> for us -- though I do not know if opt-level=z will still cause
> problems.

I just tried with rustc 1.52.0-nightly (d6eaea1c8 2021-03-14), opt-level
= z for the entire workspace. Unfortunately, these optimizations do not
seem to apply in this case. I still see two memcpy invocations, although
the instruction count has changed somewhat, going down from 530 to 518
instructions for the system call routing. The text section increased
slightly as well, going from 69381 to 70569 bytes.

Given the fact that we can trigger optimizations (using opt-level=3)
which remove these memcpy operations entirely (just reconstructing the
struct from information on the stack / immediates) and use word-size
operations -- which as far as I can tell is as efficient as we can get
here -- I don't _think_ this is an issue of LLVM not being able to
optimize this code. Rather its the tradeoff between runtime performance
vs. code size and the natural impact of opt-level=z, though arguably in
this case we'd rather have the +20 or so bytes text vs. the ~240
instructions.

Leon Schuermann

unread,
Mar 16, 2021, 12:35:22 PM3/16/21
to Philip Levis, Tock Embedded OS Development Discussion, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com

Philip Levis <p...@cs.stanford.edu> writes:
> I think that the transition from alpha1 to rc1 is going to involve a
> bunch of work like this, to improve computational efficiency. I think
> we will also want to take a very close look at code size. For a long
> time we’ve been pretty lax on this due to our platforms having lots of
> flash space, but as we start looking at more complex applications and
> multiple (A/B) images, our time of free lunch is coming to an end. :)

I agree, this is very important. On the bright side of things, at least
for Command-style system calls (which I've been analyzing until now, but
will also shift the focus to allow and subscribe), the changes on
tock-2.0-dev don't actually seem to have a very significant runtime
performance impact: we went from 524 instructions on master to 530
instructions on tock-2.0-dev.

For me, this at least in part validates that the changes to the ABI
bring benefits _and_ don't cause a major performance impact (as
predicted). I imagine for allow & subscribe there's a larger performance
impact, but that might be unavoidable due to the additional safety
guarantees and checks we're performing. Tock 2.0 resolves some major
issues with the previous way of doing things after all, a significant
part being reducing the undefined behavior w.r.t. AppSlices etc.

> We’ve put lots of debugging hooks into the kernel which is great for
> development but a killer when it comes to efficiency. For, example,
> the hard fault debug dump — I can’t count the number of times the
> extremely rich data it prints out has saved me hours of debugging. But
> the formatting string alone is I believe ~2kB. We should start to
> think about devel vs. deploy builds.

Yes, this hit me a number of times already, both in the positive and
negative sense. As far as I'm concerned, having more fine-grained
control over the information available for debugging vs. better runtime
performance is desperately needed, especially for applications where
we're at the limit of the microcontroller's capabilities. Flash size is
also a valid concern of course.

Because of the vastly different application areas people might use Tock
in, might it make sense to support (or at least document) optimizations
with emphasis on either flash size or runtime performance? For instance,
it might make sense to "officially support" opt-level=3 builds for areas
where flash space is no concern whatsoever but reacting to time critical
events is or high throughput is. The additional optimizations might
benefit users.

Philip Levis

unread,
Mar 16, 2021, 8:54:39 PM3/16/21
to le...@is.currently.online, Tock Embedded OS Development Discussion, gmar...@ucsd.edu, ppan...@ucsd.edu, am...@amitlevy.com


On Mar 16, 2021, at 3:20 AM, Leon Schuermann <le...@is.currently.online> wrote:


Because of the vastly different application areas people might use Tock
in, might it make sense to support (or at least document) optimizations
with emphasis on either flash size or runtime performance? For instance,
it might make sense to "officially support" opt-level=3 builds for areas
where flash space is no concern whatsoever but reacting to time critical
events is or high throughput is. The additional optimizations might
benefit users.

My experience has been that these two do not need to be too much at odds, because shorter instruction streams are both faster and smaller. The two major tension points are inlining and loop unrolling. Loop unrolling is not that big a deal for us: we don’t have extremely tight computational kernels that dominate execution time. Inlining, though, can make a difference, especially if there are lots of tiny functions that are called from many places. My sense — which could be completely off — is that Tock code doesn’t look like this, except perhaps for stuff like ::cmp or writer calls in debug. However, compiled code (e.g., with invocations of memcpy) could trigger this pretty badly.

I’m wondering if the debugging stabs have enough information to pull this apart. E.g., we can see where things are being inlined.

Phil
Reply all
Reply to author
Forward
0 new messages