The recent presentation by Gabriel Marcano (CCed) about their report
titled "Benchmarking the Tock operating system for embedded
platforms"[1] has inspired me to take a deeper look into what exactly my
CPU does when a process issues a `Command`-style system call in Tock. :)
For testing this, I have used the LiteX SoC platform with the VexRiscv
RISC-V CPU that Tock has been ported to. LiteX features a simulated
hardware target ("litex_sim.py") that can emulate a similar SoC to what
would run on a real FPGA using Verilator. Combined with the Ibex tracer
module that plugs into the RISC-V formal interface (RVFI) of a CPU, and
a custom VexRiscv CPU build that has the RVFI enabled, we can see every
instruction being executed. For more info, see [2].
The results presented here have been obtained using
- rustc 1.51.0-nightly (c2de47a9a 2021-01-06)
- tock-2.0-dev commit f4ee443f1d26c7 ("Merge pull request #2478 ...")
with `litex_sim` changed to from rv32i to rv32imc
The annotated and formatted instruction traces[3], full instruction traces,
tested ELF and disassembly can be found under [4].
While I could not reproduce the exact numbers from the report referred
to (likely because of the vastly different testing methodology), indeed
the number of instructions required for routing a system call from
process to a capsule is unexpectedly large: in total, from the userspace
ecall (or _start_trap kernel symbol respectively) up to the first
instruction of the <capsules::console::Console as
kernel::Driver>::command method, a total of 530 instructions are being
executed for a rv32imc target.
Notably, a rather large portion (123 * 2 = 246 instructions, ~46%) of
the traced system calls are caused by calls to memcopy, which does a
byte-wise copy a struct. It turns out that these two invocations are
copying the Syscall-struct, which is 5 words / 20 bytes large. These
calls are caused by the invocation of `process.debug_syscall_called(` in
`sched.rs:755`.
Indeed commenting out the line in question reduces the instructions
required for routing the system call to 242, which probably still has
room for improvement, but is a whopping ~2.2x improvement over the
current situation (regarding instruction count only).
This naturally raises some questions. For instance, why couldn't Rust
(given it knows the size of the struct) resort to a more efficient
method of copying the data? I imagine something along the lines of
lw x14,0(x11)
sw x14,0(x13)
lw x14,4(x11)
sw x14,4(x13)
...
This had me puzzled quite a bit, the answer being very simple: the
opt-level "z" used in Tock does not trigger this optimization. Indeed
with opt-level 3, the Rust compiler manages to perform the entire system
call routing in 233 instructions, _with_ Process:debug_syscall_called
enabled. The generated assembly is even futher optimized than the
handwritten ASM above, with the variant discriminator of Syscall being
loaded from an immediate.
A further question is, why do we copy the struct two times in the first
place? Currently we move the struct (which has the Copy trait) into
Process::debug_syscall_called and this passes it further into a closure,
which could explain a single copy. Writing it to the target location in
the kernel's process memory explains the second copy. However, even if
we pass it by reference and remove the Copy trait (cloning it explicitly
once, when writing to the target memory in
Process::debug_syscall_called), the compiler will still memcopy it
twice, presumably over the stack. This seems unnecessary.
Ultimately, this seems like an important issue to solve given it's quite
high cost per system call and affecting users of Tock right now. I don't
however know how this issue can be resolved while still allowing us to
capture the last system call as part of a panic message.
There is the promising feature of overriding the profile (including
opt-level) for specific crates in a workspace. Unfortunately, generic
code of a crate is only compiled when all types are known (s.t. it can
be monomorphized) and hence an opt-level=3 in the kernel does not apply
to the kernel_loop function. Adding an opt-level=3 to the board crate
does trigger the desired optimizations, but also optimizes a whole bunch
of other generic code (from capsules, etc.) according to this level and
hence causes a more drastic increase in text size. The numbers can be
seen in [5].
I'd be happy about ideas on what could be causing this, how we can find
ways around this behavior, and thoughts on this issue. If anyone wants
to test this for themselves, I'm happy to help get started with LiteX
and the simulation.
Leon
[1]:
https://github.com/tock/TockBenchmarking/blob/main/reports/final.pdf
[2]:
https://github.com/tock/tock/tree/master/boards/litex/sim
[3]:
https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e#file-console_putstr_syscall-log
[4]:
https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e
[5]:
https://gist.github.com/lschuermann/31f67f6f76fbca1fdac1b61262b0042e#file-binary_sizes_opt_levels-txt