mtimecmp memory mapped register

Joel Vandergriendt

unread,

Oct 18, 2016, 6:04:41 PM10/18/16

to isa...@groups.riscv.org

I asked this same question shortly after the priv-1.9 came out, Perhaps by now more people have read the spec and someone can answer me. I’d like some more explanation on why the mtimecmp register is in a memory mapped register rather than a csr.

The spec says :
The timer compare register mtimecmp is provided as a memory-mapped register rather than a CSR, as it must live in the same voltage/frequency domain as the real-time clock, and is accessed much less frequently than the timer value itself.

Just before this it says regarding the mtime register: We assume the RTC will be exposed via memory-mapped registers at the platform level, and so one microarchitectural implementation strategy is to convert reads of the mtime CSR into uncached reads of the platform RTC, reusing the existing memory-mapped path to the RTC

I don’t understand why the mtimecmp register can’t also be a CSR, and use the same implementation strategy. Is it not preferable that mtime and mtimecmp are accessed in a consistent manner?

Jacob Bachmeyer

unread,

Oct 19, 2016, 12:07:08 AM10/19/16

to Joel Vandergriendt, isa...@groups.riscv.org

Joel Vandergriendt wrote:
>
> I asked this same question shortly after the priv-1.9 came out,
> Perhaps by now more people have read the spec and someone can answer
> me. I’d like some more explanation on why the mtimecmp register is in
> a memory mapped register rather than a csr.
>
> The spec says :

> /The timer compare register mtimecmp is provided as a memory-mapped

> register rather than a CSR, as it must live in the same
> voltage/frequency domain as the real-time clock, and is accessed much

> less frequently than the timer value itself./
>
> Just before this it says regarding the mtime register: /We assume the

> RTC will be exposed via memory-mapped registers at the platform level,
> and so one microarchitectural implementation strategy is to convert
> reads of the mtime CSR into uncached reads of the platform RTC,

> reusing the existing memory-mapped path to the RTC/

>
> I don’t understand why the mtimecmp register can’t also be a CSR, and
> use the same implementation strategy. Is it not preferable that mtime
> and mtimecmp are accessed in a consistent manner?
>

Another potential problem is from having per-hart mtimecmp RTC
registers. The RTC is a shared resource and being able to re-use the
same RTC module across a variety of systems (with varying numbers of
sockets/cores/harts) sounds to me like something we should not discard
lightly. Except for SoC implementations, the RTC itself is likely to be
in a separate chip. (PCs have done this for decades, although some of
the latest models seem to be moving towards SoC-like designs.)

The stated rationale for this arrangement in RISC-V is to reduce costs,
since access to the RTC may cross voltage and clock domains, but I think
that that has been insufficiently considered. The draft does not
require that mtime count time since any particular event, only that it
count at a constant and documented rate. As such, mtime can reasonably
count the slowest clock (say, 32.768kHz) in the system, and therefore be
able to keep count regardless of voltage scaling on the main processor.
(Please correct me if this is not so.)

Since mtime counts from an unspecified point in the past, I suggest that
mtime should be a per-core resource (identical across all harts on the
same core) cleared at power-on reset and mtimecmp should be a per-hart
CSR. The mtime CSR would count asynchronously, using a timebase signal
brought in from an external RTC. This would reduce the costs of
level-shifting, since only the unidirectional RTC timebase signal must
be carried across voltage domains. Splitting mtime into (hardware)
count and latch registers with a simple state machine solves the clock
domain crossing problem, assuming that the system clock is much faster
than the RTC timebase: the counter increments on a rising edge from the
RTC, the latch loads the count value (once; latch control is reset when
the RTC timebase is high for two system clock cycles) on an edge of the
system clock if the RTC timebase was low during the last two system
clock cycles, and all reads from the core read the latch value. The
latch value is also driven onto an internal bus for simultaneous
comparison with all mtimecmp CSRs to produce per-hart timer interrupt
signals.

The above suggestion puts the timer register in the RISC-V core, which
means that all of the associated switching activity will be on the CPU
power rail. This should reduce energy usage, however, since the CPU
will probably be fabricated on the most advanced process used for the
system and will have the lowest dynamic power dissipation per-toggle.
On the other hand, this approach also means that a multi-core or
multi-socket system will have duplicate mtime CSRs, all synchronized
from the same RTC, but all drawing dynamic power. Multi-core systems
can employ other methods (an obvious solution is to have one common
timer register and simply route the aforementioned bus to every core in
the package, but this may require crossing clock domains when the "time
bus" enters each core), and multi-socket systems are likely to be
sufficiently high-performance that the cost of duplicating mtime is lost
in the noise.

The approach outlined above requires that mtimecmp be a CSR, since it
would be tightly coupled to each hart. An implementation using
memory-mapped mtime/mtimecmp can easily convert the CSR reads to
memory-mapped RTC access (using an M-mode trap if not special hardware),
but specifying mtimecmp as memory-mapped all but precludes the approach
outlined above, since hardware to map a CSR into the memory address
space is much more complicated.

In short, I agree that mtimecmp should be a CSR.

-- Jacob

Monte Dalrymple

unread,

Oct 19, 2016, 12:58:18 AM10/19/16

to RISC-V ISA Dev, jo...@vectorblox.com, jcb6...@gmail.com

This is how I would implement it. The only wrinkle is the case where

the system might want to power-down the hart, but keep mtime

counting, perhaps to be used to wake up. But this is a special case

that would require some unique implementation details, and doesn't

seem to be the primary use case anyway.

Samuel Falvo II

unread,

Oct 19, 2016, 1:10:26 AM10/19/16

to Monte Dalrymple, RISC-V ISA Dev, Joel Vandergriendt, Jacob Bachmeyer

On Tue, Oct 18, 2016 at 9:58 PM, Monte Dalrymple <mon...@systemyde.com> wrote:
> This is how I would implement it. The only wrinkle is the case where
> the system might want to power-down the hart, but keep mtime
> counting, perhaps to be used to wake up. But this is a special case
> that would require some unique implementation details, and doesn't
> seem to be the primary use case anyway.

In the current design of the Kestrel's CPU, I don't even implement
mtime (value hardwired to zero). I see no practical reason for its
existence. If we're going to perform memory-mapped access to an
interrupt source, the timer itself can reside in that same
memory-mapped resource, just as it's always been since the days of the
Z80 and 6502. :)

I do implement the minstret and the cycle counter (name escapes me at
the moment), though; those are more easily implemented and are
relevant for actual profile work.

--
Samuel A. Falvo II

Stefan O'Rear

unread,

Oct 19, 2016, 1:17:04 AM10/19/16

to Samuel Falvo II, Monte Dalrymple, RISC-V ISA Dev, Joel Vandergriendt, Jacob Bachmeyer

On Tue, Oct 18, 2016 at 10:10 PM, Samuel Falvo II <sam....@gmail.com> wrote:
> On Tue, Oct 18, 2016 at 9:58 PM, Monte Dalrymple <mon...@systemyde.com> wrote:
>> This is how I would implement it. The only wrinkle is the case where
>> the system might want to power-down the hart, but keep mtime
>> counting, perhaps to be used to wake up. But this is a special case
>> that would require some unique implementation details, and doesn't
>> seem to be the primary use case anyway.
>
> In the current design of the Kestrel's CPU, I don't even implement
> mtime (value hardwired to zero). I see no practical reason for its

Y'all know mtime has been removed right?
https://github.com/riscv/riscv-opcodes/commit/c86d2ee8

(it appears that all privilege levels are now expected to use the
U-mode "time" CSR for reads. It makes sense IMO to have an abstract
CSR for U-mode because U-mode isn't generally going to have access to
MMIO resources; I'm still eagerly awaiting the U-mode version of misa)

-s

Monte Dalrymple

unread,

Oct 19, 2016, 1:32:48 AM10/19/16

to RISC-V ISA Dev, sam....@gmail.com, mon...@systemyde.com, jo...@vectorblox.com, jcb6...@gmail.com

Y'all know mtime has been removed right?
https://github.com/riscv/riscv-opcodes/commit/c86d2ee8

(it appears that all privilege levels are now expected to use the
U-mode "time" CSR for reads. It makes sense IMO to have an abstract
CSR for U-mode because U-mode isn't generally going to have access to
MMIO resources; I'm still eagerly awaiting the U-mode version of misa)

-s

My bad. Where I come from the spec comes first, then the implementation.

I've been designing to the spec...

Monte

Samuel Falvo II

unread,

Oct 19, 2016, 3:00:02 AM10/19/16

to Stefan O'Rear, Monte Dalrymple, RISC-V ISA Dev, Joel Vandergriendt, Jacob Bachmeyer

On Tue, Oct 18, 2016 at 10:17 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> Y'all know mtime has been removed right?

No, I did not know. I knew there was some discussion *at one time* on
this list, but I'd never seen anything further on the topic.

Michael Clark

unread,

Oct 19, 2016, 7:36:44 AM10/19/16

to jcb6...@gmail.com, Joel Vandergriendt, isa...@groups.riscv.org

Sent from my iPhone

On 19/10/2016, at 5:07 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

In short, I agree that mtimecmp should be a CSR.

Also interesting from the perspective of a tiny implementation without a full blown PLIC, that simple wants timer interrupts for the kernel and maybe software injected interrupts for virtio (net, block, framebuffer, input). framebuffer needs outbound interrupts for mode setting, keyboard and mouse are inbound interrupts.

Like fromhost/tohost with MMIO regions and the CSR values contain message signalled interrupt and acknowledge (device routing number for inbound outbound interrupt signalling and an acknowledgement bit)

I'll have to read up on the PLIC.

Rusty's documents on virtio are interesting. From what I can remember, some aspects of emulating physical hardware hardware are easy (MMIO regions for buffers) and some are harder (MMIO regions for interrupts). i.e. Indirect page faults on a virtual from physical page in the host (with write combine issues), versus a register in the CSR space. The buffer case has different synchronisation than the interrupt case.

My memory of this is pretty vague but I am sure there are some aspects of virtual hardware that degrade when emulating a typical physical hardware model. I think it's interrupts. The MMIO region for control vs data transfer may be slower to emulate as a page fault than as a CSR read/write?

I can try and find a reference this weekend... maybe page faults for PLIC emulation is the easiest. Every write from the guest is going to have to fault in the host whereas the the other MMIO regions are just shared mappings.

Wrapping IPI in SBI is a good idea as an implementation that benefits from a CSR can hide it. The CSR register signal is much more fine grained than checking a set of host addresses are signals. Namely due to the 12-bit address space (table is order 1 as the size is practical) versus XLEN - page shift (hash lookup is order 1 in the size domain as a table is not practical due to size).

In the virtual case the MMIO for interrupt signalling needs backing page (4K granularity) for each PLIC whereas CSR requires only the state space of the PLIC signalling and control registers.

I don't know if it is a burden to maintain two models, one optimised for physical and the other optimised for virtual.

There is a PVHVM model which is a hybrid of Type 1 and Type 2 and can use PV constructs where they are faster than hardware emulation.

KVM is full hardware virtualisation and may be slower in some cases than PV and faster in others.

virtual keyboard, touch, mouse, framebuffer, net and block... exist. There is the goldfish devices for android IIRC.

https://android.googlesource.com/platform/external/qemu/+/master/docs/GOLDFISH-VIRTUAL-HARDWARE.TXT

Jacob Bachmeyer

unread,

Oct 19, 2016, 7:32:27 PM10/19/16

to Michael Clark, Joel Vandergriendt, isa...@groups.riscv.org

Michael Clark wrote:
> On 19/10/2016, at 5:07 PM, Jacob Bachmeyer <jcb6...@gmail.com

> <mailto:jcb6...@gmail.com>> wrote:
>
>> In short, I agree that mtimecmp should be a CSR.
>
> Also interesting from the perspective of a tiny implementation without
> a full blown PLIC, that simple wants timer interrupts for the kernel
> and maybe software injected interrupts for virtio (net, block,
> framebuffer, input). framebuffer needs outbound interrupts for mode
> setting, keyboard and mouse are inbound interrupts.

The PLIC only handles external interrupts; software and timer interrupts
are internal to a core.

> Like fromhost/tohost with MMIO regions and the CSR values contain
> message signalled interrupt and acknowledge (device routing number for
> inbound outbound interrupt signalling and an acknowledgement bit)

> [...]

> Rusty's documents on virtio are interesting. From what I can remember,
> some aspects of emulating physical hardware hardware are easy (MMIO
> regions for buffers) and some are harder (MMIO regions for
> interrupts). i.e. Indirect page faults on a virtual from physical page
> in the host (with write combine issues), versus a register in the CSR
> space. The buffer case has different synchronisation than the
> interrupt case.
>
> My memory of this is pretty vague but I am sure there are some aspects
> of virtual hardware that degrade when emulating a typical physical
> hardware model. I think it's interrupts. The MMIO region for control
> vs data transfer may be slower to emulate as a page fault than as a
> CSR read/write?
>
> I can try and find a reference this weekend... maybe page faults for
> PLIC emulation is the easiest. Every write from the guest is going to
> have to fault in the host whereas the the other MMIO regions are just
> shared mappings.
>
> Wrapping IPI in SBI is a good idea as an implementation that benefits
> from a CSR can hide it. The CSR register signal is much more fine
> grained than checking a set of host addresses are signals. Namely due
> to the 12-bit address space (table is order 1 as the size is
> practical) versus XLEN - page shift (hash lookup is order 1 in the
> size domain as a table is not practical due to size).

Indeed, it may be possible to put the CSR write in the SBI page and
avoid trap overhead.

> In the virtual case the MMIO for interrupt signalling needs backing
> page (4K granularity) for each PLIC whereas CSR requires only the
> state space of the PLIC signalling and control registers.
>
> I don't know if it is a burden to maintain two models, one optimised
> for physical and the other optimised for virtual.

Yes, especially if the physical model can be implemented behind the
virtual interface using SBI.

> There is a PVHVM model which is a hybrid of Type 1 and Type 2 and can
> use PV constructs where they are faster than hardware emulation.

I think that this is the purpose of SBI in RISC-V, except that physical
hardware *also* supports the PV constructs.

> KVM is full hardware virtualisation and may be slower in some cases
> than PV and faster in others.
>
> virtual keyboard, touch, mouse, framebuffer, net and block... exist.
> There is the goldfish devices for android IIRC.
>
> https://android.googlesource.com/platform/external/qemu/+/master/docs/GOLDFISH-VIRTUAL-HARDWARE.TXT

I proposed an SBI virtio interface earlier (message-id
<57F2FF26...@gmail.com>) and I think I now understand why your
response to that proposal seemed odd to me: this is a different kind of
virtio. Instead of emulating hardware, which requires handling page
faults on RISC-V, the SBI virtio I proposed is a set of environment
calls from S-mode. The sbi_vio_* calls do not emulate hardware, unlike
goldfish devices which use an emulated I/O register interface. This is
why that proposal forbade monitors from presenting any physical hardware
as "pure virtio" devices. A bootloader or simple supervisor would use
sbi_vio_{attach,read_page,write_page,detach} SBI calls just as user code
uses {open,read,write,close} syscalls. The sbi_vio_read_meta_page()
call is a Plan9-ish extension to this model instead of having a binary
"struct sbi_vio_stat" to avoid forwards compatibility issues. A
multi-tasking supervisor would use the asynchronous virtio interface,
accessed through the sbi_vio_start_io() call. That interface was
sparsely defined in my proposal because I am uncertain of how best to
define it and I was hoping to start a discussion on the list that would
illuminate the matter.

How exactly asynchronous I/O is implemented (or if it is
implemented--the sbi_vio_iovec entries could have an "in-progress" flag
that a stub sbi_vio_start_io() would never set) is left to the
implementation and could range from simple interrupt handlers all the
way to a standard interface for dedicated I/O accelerator hardware.
This last reason was my motivation for including sbi_vio_start_io()
because having a standard interface to whatever I/O accelerators are
made makes writing high-performance supervisors easier. Of necessity,
that interface must be fairly high-level, thus the split between block
I/O, cell-oriented stream I/O, and packet-oriented stream I/O. Avoiding
a proliferation of hardware that is gratuitously incompatible "just to
be different" (and require yet another driver) is a wise goal, in my
view. That proposal provided an abstraction over all I/O accelerators.

-- Jacob

Michael Clark

unread,

Oct 19, 2016, 8:54:24 PM10/19/16

to jcb6...@gmail.com, Joel Vandergriendt, isa...@groups.riscv.org

Hi Jacob,

It makes more sense now. However the data transfer buffer itself would be mapped using sbi_device_mmap ? taking a device tree node?

I think we need to work examples within the context of virtio for network, disk and framebuffer (I can reason easier when seeing concrete use of an API). Now the context is clear it will be easier.

Keyboard, Mouse, Touch, GPS can all be virtual uart as the bandwidth doesn't require MMIO.

Now I'm thinking about OpenCL, and coprocessor DMA (unified and non-unified address space), but if it is MMIO, then there is no need to have SBI calls around writes to the buffer region, as MMIO implies unified address space. Signalling is the problem with virtual hardware? Is this correct? Or do we need buffer copies for some emulation models? (e.g. buffer is locked during transfer).

In any case I understand better now. I'll review your earlier email this weekend...

Cheers,
Michael

Sent from my iPhone

Jacob Bachmeyer

unread,

Oct 20, 2016, 12:39:25 AM10/20/16

to Michael Clark, Joel Vandergriendt, isa...@groups.riscv.org

Michael Clark wrote:
>> On 20/10/2016, at 12:32 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> KVM is full hardware virtualisation and may be slower in some cases than PV and faster in others.
>>>
>>> virtual keyboard, touch, mouse, framebuffer, net and block... exist. There is the goldfish devices for android IIRC.
>>>
>>> https://android.googlesource.com/platform/external/qemu/+/master/docs/GOLDFISH-VIRTUAL-HARDWARE.TXT
>>>
>> I proposed an SBI virtio interface earlier (message-id <57F2FF26...@gmail.com>) and I think I now understand why your response to that proposal seemed odd to me: this is a different kind of virtio. Instead of emulating hardware, which requires handling page faults on RISC-V, the SBI virtio I proposed is a set of environment calls from S-mode. The sbi_vio_* calls do not emulate hardware, unlike goldfish devices which use an emulated I/O register interface. This is why that proposal forbade monitors from presenting any physical hardware as "pure virtio" devices. A bootloader or simple supervisor would use sbi_vio_{attach,read_page,write_page,detach} SBI calls just as user code uses {open,read,write,close} syscalls. The sbi_vio_read_meta_page() call is a Plan9-ish extension to this model instead of having a binary "struct sbi_vio_stat" to avoid forwards compatibility issues. A multi-tasking supervisor would use the asynchronous virtio interface, accessed through the sbi_vio_start_io() call. That interface was sparsely defined in my proposal because I am uncertain of how best to define it and I was hoping to start a discussion on the list that would illuminate the matter.
>>
>> How exactly asynchronous I/O is implemented (or if it is implemented--the sbi_vio_iovec entries could have an "in-progress" flag that a stub sbi_vio_start_io() would never set) is left to the implementation and could range from simple interrupt handlers all the way to a standard interface for dedicated I/O accelerator hardware. This last reason was my motivation for including sbi_vio_start_io() because having a standard interface to whatever I/O accelerators are made makes writing high-performance supervisors easier. Of necessity, that interface must be fairly high-level, thus the split between block I/O, cell-oriented stream I/O, and packet-oriented stream I/O. Avoiding a proliferation of hardware that is gratuitously incompatible "just to be different" (and require yet another driver) is a wise goal, in my view. That proposal provided an abstraction over all I/O accelerators.

>> [quotes moved to top]

>
> It makes more sense now. However the data transfer buffer itself would be mapped using sbi_device_mmap ? taking a device tree node?
>

The data transfer buffers are allocated by the supervisor in S-mode
address space and are page-aligned. The SEE is given a pointer in the
synchronous calls (all of which read or write exactly one 4KiB page),
and is given a base address and length in an sbi_vio_iovec entry.

> I think we need to work examples within the context of virtio for network, disk and framebuffer (I can reason easier when seeing concrete use of an API). Now the context is clear it will be easier.
>

That is a bit of a problem--stream devices require the
not-yet-fully-defined asynchronous interface. Disks would normally use
it as well, but a synchronous interface is provided for the benefit of
bootloaders and very simple supervisors that consider blocking I/O
acceptable.

> Keyboard, Mouse, Touch, GPS can all be virtual uart as the bandwidth doesn't require MMIO.
>

Yes and no. All SBI virtio in my proposal is MMIO, just as all I/O in
RISC-V is MMIO. These devices would *not* be virtual UARTs however--a
UART is a stream device that accepts and produces single-byte cells.
(Some implementations might offer two-byte cells for serial interfaces
with nine data bits.) A keyboard would produce cells, but they might be
two bytes if 16-bit scancodes are used. A mouse would likewise produce
cells, likely three bytes or longer, containing movement and button
updates. A touch device would probably produce cells, but multi-touch
screens may want to use variable-length packets. A GPS would either
produce fixed-length cells, if it reports using some binary protocol, or
variable-length packets containing NMEA reports.

All input from these devices would be written into circular buffers
provided by the supervisor. The supervisor may be required to
acknowledge input collected from those buffers by advancing a read
pointer, so the SEE (or hardware, as you mentioned below) can avoid
overwriting data that has not yet been processed.

Event notification for stream devices would consist of events for "new
data received" and "transmit space available", with configurable
thresholds for maximum input delay, how many input items to have before
firing the event, how much outgoing space is required to fire the event,
etc. The events are analogous to the FIFO interrupts on a 16550 UART.

> Now I'm thinking about OpenCL, and coprocessor DMA (unified and non-unified address space), but if it is MMIO, then there is no need to have SBI calls around writes to the buffer region, as MMIO implies unified address space. Signalling is the problem with virtual hardware? Is this correct? Or do we need buffer copies for some emulation models? (e.g. buffer is locked during transfer).
>

It only requires that the SEE is able to arrange for the specific pages
allocated by the supervisor to be visible to the coprocessor.
Constraints on physical addresses for I/O buffers could be expressed in
the device's metadata pages, if a coprocessor instead contains its I/O
buffers and offers them at particular physical addresses.

A hypervisor does not need such constraints, since it can simply remap
whatever "physical" pages the supervisor used to point into the
coprocessor hardware, but a monitor must expose such details, because
the SEE is not permitted to rewrite the supervisor's page tables.

This also suggests that iovec stream buffers need to contain their
buffer headers, rather than including that information in the iovec
entry, since the buffer may end up in a different memory area from the
iovec entry that declares it to the SEE, especially in the case of a
coprocessor with its own memory. This discussion has already begun to
illuminate details of the asynchronous I/O model that I had missed
earlier; thanks.

-- Jacob

NoDot

unread,

Oct 22, 2016, 5:32:44 PM10/22/16

to RISC-V ISA Dev, michae...@mac.com, jo...@vectorblox.com, jcb6...@gmail.com

On Wednesday, October 19, 2016 at 7:32:27 PM UTC-4, Jacob Bachmeyer wrote:

> There is a PVHVM model which is a hybrid of Type 1 and Type 2 and can
> use PV constructs where they are faster than hardware emulation.

I think that this is the purpose of SBI in RISC-V, except that physical
hardware *also* supports the PV constructs.

I... don't know. At this point, I think it's worth asking (or discussing) where on the line between a thin and simple HAL and a full Paravirtualization Interface (capitals included) of some sort the SBI is supposed to be.

I would favor the latter, but I know that would increase the complexity of the Monitor a great deal.

(It might also mean that the page table changes being SBI calls make sense: easier emulation on M/U systems.)

Jacob Bachmeyer

unread,

Oct 24, 2016, 12:46:19 AM10/24/16

to NoDot, RISC-V ISA Dev, michae...@mac.com, jo...@vectorblox.com

I think that there are no plans to support emulating S-mode on systems
that do not have it, even though a classic VMM *could* achieve that in
M-mode while running everything in U-mode. Paging is very hard to
emulate efficiently. I think that you would need to trap for every
memory access.

I argue that the distinction between what should go into the SBI (with
flexible implementation) and what should be rigidly defined (like the
page tables) is best considered by how complex adapting to different
implementations would be for a supervisor. IPI, for example, is a
relatively simple "cause an interrupt on this other hart" and could even
be an S-mode routine on an implementation that does not support virtual
harts. Similarly, "remote fence" is a special case of IPI for which
some implementations may have special hardware; again, this could
require a vertical trap or simply perform an implementation-specific
dance in S-mode. The PLIC interrupt-control calls are similar. All of
these interfaces affect very little and are reasonable subroutines in a
supervisor.

The page table entries, on the other hand, have far-reaching effects
that are difficult to wrap a useful interface around. For example, the
supervisor must allocate storage for page tables: How large is a page
table? How many virtual address bits does a single table map? The
flexibility that could be provided would add enormous complexity to
supervisor memory allocators that in the current model can safely assume
that each page table is one page long and maps 10 bits on RV32, 9 bits
on RV64, and (presumably) 8 bits on RV128.

The SBI virtio interface I propose falls more in the first category and
follows reasonably closely a model I got from the Linux kernel. I
believe that this model is close enough to the underlying problem that
any reasonable supervisor should be able to adapt to it easily.

The approach I propose to avoid gratuitous exploding monitor complexity
is to make the features that are more paravirtualization-oriented
optional; for example, my SBI virtio proposal defined a "minimal virtio"
(and included most of a sample implementation) as a subset of the "full
virtio" that supports paravirtualization or hardware acceleration.
Further, the virtio model I suggested could also provide standard
interfaces to I/O channel processors, although I am still planning the
details. This would enable an architecture not dissimilar to the
CDC6600 or IBM360, where auxiliary processors handled I/O. Of course,
in a RISC-V system, that "auxiliary" processor may be another RISC-V
core, merely an additional hart on the same RISC-V core, or even a
software implementation.

Another proposal I made (for an sbi_sexec() call) would reuse code (an
ELF loader) that a monitor already needs, since the standard appears to
envision loading a supervisor from an ELF program image. Since the
monitor must already have the ability to load a supervisor, making that
code available in the SBI allows bootloaders to themselves operate as
miniature operating systems (I believe that GRUB already runs this way),
but reuse the monitor's ELF loader instead of needing to handle linking
the SBI themselves. (I also envision SBI calls as linked by the SEE,
since ELF provides for dynamic linking and SBI calls seem to be linked
by name.)

This is another important detail--a well-designed SBI should provide
transparent support for both virtualization and hardware accelerators,
since these can be nearly equivalent from the view of a supervisor using
services they provide.

-- Jacob

NoDot

unread,

Oct 24, 2016, 8:11:20 PM10/24/16

to RISC-V ISA Dev, no_...@msn.com, michae...@mac.com, jo...@vectorblox.com, jcb6...@gmail.com

On Monday, October 24, 2016 at 12:46:19 AM UTC-4, Jacob Bachmeyer wrote:

The page table entries, on the other hand, have far-reaching effects
that are difficult to wrap a useful interface around. For example, the
supervisor must allocate storage for page tables: How large is a page
table? How many virtual address bits does a single table map?

Some rambling:

The mental model I have assumes the supervisor would setup its memory map not have it changed very often, allowing it to be a simple SBI call with a list of changes... except this puts a damper on that by needing to alter permissions often...

Unless the mapping portion is setup and changed rarely, but permissions are under the supervisor's control...

But even assuming that the user memory is always a subset of the supervisor's (without supervisor RWX permissions), allowing permissions to be directly controlled but not the mappings is probably not worth it. (And probably very bizarre outside of a capability system.)

Allen Baum

unread,

Oct 24, 2016, 8:35:57 PM10/24/16

to NoDot, RISC-V ISA Dev, michae...@mac.com, jo...@vectorblox.com, jcb6...@gmail.com

On Oct 24, 2016, at 5:11 PM, NoDot <no_...@msn.com> wrote:
...

Unless the mapping portion is setup and changed rarely, but permissions are under the supervisor's control...

You should check out millcomputing.com for an example of something like that

Jacob Bachmeyer

unread,

Oct 25, 2016, 12:56:31 AM10/25/16

to NoDot, RISC-V ISA Dev, michae...@mac.com, jo...@vectorblox.com

NoDot wrote:
> On Monday, October 24, 2016 at 12:46:19 AM UTC-4, Jacob Bachmeyer wrote:
>
> The page table entries, on the other hand, have far-reaching effects
> that are difficult to wrap a useful interface around. For
> example, the
> supervisor must allocate storage for page tables: How large is a
> page
> table? How many virtual address bits does a single table map?
>
>
> Some rambling:
>
> The mental model I have assumes the supervisor would setup its memory
> map not have it changed very often, allowing it to be a simple SBI
> call with a list of changes... except this puts a damper on that by
> needing to alter permissions often...
>

> Unless the /mapping/ portion is setup and changed rarely, but
> /permissions/ are under the supervisor's control...

>
> But even assuming that the user memory is always a subset of the
> supervisor's (without supervisor RWX permissions), allowing
> permissions to be directly controlled but not the mappings is probably
> not worth it. (And probably very bizarre outside of a capability system.)

In practice, mappings change very frequently. Most user page faults are
likely to result in a mapping change. In Linux, page faults are used
for demand-loading executables, demand-loading memory-mapped files,
swapping data back in when needed, and probably more. Also, the entire
page table set is changed on every process switch, but this is done by
reloading the page table base register.

-- Jacob

NoDot

unread,

Oct 25, 2016, 3:59:15 PM10/25/16

to RISC-V ISA Dev

On Tuesday, October 25, 2016 at 12:56:31 AM UTC-4, Jacob Bachmeyer wrote:

In practice, mappings change very frequently. Most user page faults are
likely to result in a mapping change. In Linux, page faults are used
for demand-loading executables, demand-loading memory-mapped files,
swapping data back in when needed, and probably more. Also, the entire
page table set is changed on every process switch, but this is done by
reloading the page table base register.

Most of those sound like user page changes* which I thought would be under supervisor control-both mappings and permissions, unlike the supervisor's page mappings.

And in retrospect, wow that makes little sense.

* Well, unless the kernel needs to page itself in and out of RAM. More problems with my (now discarded) idea!

Samuel Falvo II

unread,

Oct 25, 2016, 4:02:25 PM10/25/16

to NoDot, RISC-V ISA Dev

On Tue, Oct 25, 2016 at 12:59 PM, NoDot <no_...@msn.com> wrote:
> Most of those sound like user page changes* which I thought would be under
> supervisor control-both mappings and permissions, unlike the supervisor's
> page mappings.
>
> And in retrospect, wow that makes little sense.

I'm inclined to agree, at least as long as the kernel is coresident
with the user process image. However, wouldn't this become more
relevant when running a hypervisor under (over?) Linux? I'd imagine
that, under this scenario, most "kernels" are actually running in
user-mode, and are subject to lots of page mapping.

Jacob Bachmeyer

unread,

Oct 26, 2016, 12:06:20 AM10/26/16

to NoDot, RISC-V ISA Dev

NoDot wrote:
> On Tuesday, October 25, 2016 at 12:56:31 AM UTC-4, Jacob Bachmeyer wrote:
>
> In practice, mappings change very frequently. Most user page
> faults are
> likely to result in a mapping change. In Linux, page faults are used
> for demand-loading executables, demand-loading memory-mapped files,
> swapping data back in when needed, and probably more. Also, the
> entire
> page table set is changed on every process switch, but this is
> done by
> reloading the page table base register.
>
>
> Most of those sound like user page changes* which I thought would be
> under supervisor control-both mappings and permissions, unlike the
> supervisor's page mappings.

Another problem with "fixed" supervisor mappings is that (under Linux)
the same physical page may be a user page at one time and a kernel page
at another. Linux has several caches (mostly filesystem related) that
are designed to grow to fill available memory and "give back" RAM as
user programs need it.

> And in retrospect, /wow/ that makes little sense.

>
> * Well, unless the kernel needs to page itself in and out of RAM. More
> problems with my (now discarded) idea!

While I do not think that Linux has swappable kernel memory (yet), NT
(all current Windows) can swap at least some parts of its kernel.

-- Jacob

Reply all

Reply to author

Forward