determining sv39 vs. sv48 in supervisor mode

ron minnich

unread,

Nov 2, 2016, 3:49:36 PM11/2/16

to RISC-V ISA Dev

Is there some bit somewhere I missed that lets me determine, in S mode, whether I'm in sv39 or sv48?

Stefan O'Rear

unread,

Nov 2, 2016, 4:20:11 PM11/2/16

to ron minnich, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 12:49 PM, ron minnich <rmin...@gmail.com> wrote:
> Is there some bit somewhere I missed that lets me determine, in S mode,
> whether I'm in sv39 or sv48?

No. You're expected to compile your kernel for a single VA size, and
then communicate this out of band to the firmware.

(IMO this is one of the things that should be changed before the priv
spec is frozen.)

-s

Samuel Falvo II

unread,

Nov 2, 2016, 4:22:19 PM11/2/16

to Stefan O'Rear, ron minnich, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 1:20 PM, Stefan O'Rear <sor...@gmail.com> wrote:
> (IMO this is one of the things that should be changed before the priv
> spec is frozen.)

Or perhaps make it available as a SBI function of some kind. But,
it'd probably be safer to just expose the VM field in sstatus as a
read-only field.

--
Samuel A. Falvo II

Paolo Bonzini

unread,

Nov 2, 2016, 4:24:48 PM11/2/16

to Samuel Falvo II, Stefan O'Rear, ron minnich, RISC-V ISA Dev

That would be the simplest, but it is liable to mess up the future
introduction of two-level paging for H-mode. It's simpler to let S-mode
define the width.

S-mode is entered with an identity page table of undefined width, and it
can then set up another page table if it wishes.

Paolo

Samuel Falvo II

unread,

Nov 2, 2016, 4:31:44 PM11/2/16

to Paolo Bonzini, Stefan O'Rear, ron minnich, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 1:24 PM, Paolo Bonzini <bon...@gnu.org> wrote:
> That would be the simplest, but it is liable to mess up the future
> introduction of two-level paging for H-mode. It's simpler to let S-mode
> define the width.

I was under the impression that the consensus was to avoid a separate
H-mode, and use a SIE-like mechanism.

In that case, the best route would be to make it available as a
standard SBI call then.

Paolo Bonzini

unread,

Nov 2, 2016, 4:40:24 PM11/2/16

to Samuel Falvo II, Stefan O'Rear, ron minnich, RISC-V ISA Dev

On 02/11/2016 21:31, Samuel Falvo II wrote:
> On Wed, Nov 2, 2016 at 1:24 PM, Paolo Bonzini <bon...@gnu.org> wrote:
>> That would be the simplest, but it is liable to mess up the future
>> introduction of two-level paging for H-mode. It's simpler to let S-mode
>> define the width.
>
> I was under the impression that the consensus was to avoid a separate
> H-mode, and use a SIE-like mechanism.

Not the consensus, just my proposal.

> In that case, the best route would be to make it available as a
> standard SBI call then.

It doesn't seem to fit well into the purpose of the SBI, but it would
certainly work. A while ago I suggested putting it in sptbr.

Paolo

ron minnich

unread,

Nov 2, 2016, 4:45:59 PM11/2/16

to Paolo Bonzini, Samuel Falvo II, Stefan O'Rear, RISC-V ISA Dev

an SBI call I would rather not see.

I was surprised to see that the # page level mappings was not communicated to s-mode in some way as a bit in sstatus. This is the kind of thing a kernel needs to know.

The kernel in this case is Harvey, derived from Plan 9. The code is flexible enough to manage having a 4 or 3 level PT without recompilation, as long as I can tell what it is!

ron

Samuel Falvo II

unread,

Nov 2, 2016, 4:47:54 PM11/2/16

to ron minnich, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 1:45 PM, ron minnich <rmin...@gmail.com> wrote:
> I was surprised to see that the # page level mappings was not communicated
> to s-mode in some way as a bit in sstatus. This is the kind of thing a
> kernel needs to know.

I would imagine that it'd need more than just one bit, since there
exists Sv32, Sv39, Sv48, and potentially more in the future,
especially with RV128. It's not optimal, but perhaps this should be
communicated instead via config-string or device-tree?

ron minnich

unread,

Nov 2, 2016, 4:50:03 PM11/2/16

to Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

yes, it's more than one bit. But that said I still don't see why there is not sstatus or other info that just tells me. It's a puzzle. kernels like to know this kind of thing :-)

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAEz%3Dso%3DHh%2BUMn4MGxd9WjyPkYFTZLAGtWobuJ1KTTP-3rJLbRA%40mail.gmail.com.

Stefan O'Rear

unread,

Nov 2, 2016, 5:01:07 PM11/2/16

to ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 1:45 PM, ron minnich <rmin...@gmail.com> wrote:
> an SBI call I would rather not see.
>
> I was surprised to see that the # page level mappings was not communicated
> to s-mode in some way as a bit in sstatus. This is the kind of thing a
> kernel needs to know.

My understanding of the situation is that you're expected to just pick
a VM mode, then put it in the ELF header somewhere so that the
bootloader can set it.

See also https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/cV9DEHo1XYU/-PTKEEVICwAJ

-s

ron minnich

unread,

Nov 2, 2016, 8:50:08 PM11/2/16

to Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

OK, here's what I'm going to do: coreboot is going to indicate the 3 of levels in the page table in bits 8-9 of the last PTE in the root.

IOW, in an sv39 core, bits 8 and 9 are 0. In an sv48 they are 1. Harvey can pick this up and will Do The Right Thing such that I don't need to compile different kernels for different RV64 cores. I'm just about done the harvey side and the coreboot side doesn't look too bad.

ron minnich

unread,

Nov 2, 2016, 8:51:37 PM11/2/16

to Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

I no, I have not had a long day :-)

OK, here's what I'm going to do: coreboot is going to indicate the # of levels in the page table in bits 8-9 of the last PTE in the root.

IOW, in an sv39 core, bits 8 and 9 are 0. In an sv48 they are 1. Harvey can pick this up and will Do The Right Thing such that I don't need to compile different kernels for different RV64 cores. I'm just about done the harvey side and the coreboot side doesn't look too bad.

kernel can go to the sptbr, get that, index into the last PTE, look at bits 8:9, and work out what to do from there.

Michael Clark

unread,

Nov 2, 2016, 11:15:17 PM11/2/16

to Stefan O'Rear, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Sent from my iPhone

It doesn't seem appropriate to put a dynamic property into a static ELF attribute. It means the same kernel cannot run on hardware with different paging extensions.

I agree with Sam; sstatus.VM should be readable by the supervisor; and once PMA protection is in place it could perceivably be writable by a supervisor. At the least it could be passed in a dynamic aux val attribute e.g. AT_RISCV_VM

The ELF auxiliary vector is IMHO a better way to communicate dynamic properties at boot (dynamic from the kernel's perspective). ELF Aux vector, Environment and command line map nicely to Config/FDT, NVRAM and boot command line. A single kernel image should be deployable on many RISC-V hardware variants. If the PTE mode is in the ELF then this is impossible. It's data communicated /from the loader/ /to the kernel/ in the boot protocol, not the other way around.

The only attributes I would put in the ELF are ones indicating ABI variant and privilege level the image is compiled for /from the image/ /to the loader/ e.g an ELF attribute indicating type: Boot loader, Hypervisor, Kernel along with the expected SBI version (0x0107) for priv 1.7 (0x0109) for priv 1.9. Attributes in the ELF are ones that need to be communicated to the loader from the image. An SBI version attribute should be essential and would allow us to evolve the SBI ABI and provide back compat for kernels linked against older SBI versions.

BTW, how is the kernel going to link to the SBI? Known constant offsets in the Kernel VA? or does the kernel have a PLT and dynamically link to riscv-sbi.so? or do we pass in a structure containing function pointers (like paravirt_ops).

User process can be distinguished by the lack of any special attribute. i.e. ELFOSABI_SYSV = 0 which is the current method?

N mode blurs the lines as a kernel could perceivably be run using uptbr, uscratch, uepc, ubadaddr, uip, uie and ustatus. e.g. using a Type 2 Hypervisor. The nice thing about RISC-V CSRs is this could be abstracted with a logical 'or' of the mode in CSR accesses. Is it serendipity that the CSR prefixes match the privilege modes?

Also, the neat thing with a PIC kernel image is we could map the kernel image X and R pages into another PDID, boot it again, and get more separation than traditional OS level sandboxing (kernel namespaces); although this approach prevents binary patching which some kernels like to do. I would binary patch in a translator that loads the kernel.

Also file (1) should be able to distinguish a kernel from a regular process. Linux for example may require a boot loader, kexec, KVM or a full Hypervisor to execute a new kernel, respectively replacing the current kernel, or running a kernel in a new set of user threads or in a new protection domain.

This is part of the RISC-V boot protocol for boot loaders or kernels which as I understand are all currently ELF (e.g a UEFI COFF loader would be an ELF boot loader that can then load some COFF vmskernel.exe)

We could define additional ELF OSABI values for use in the ELF EHdr. This seems the easiest to way to communicate the executable type.

ELFOSABI_SYSV = 0, /* existing, used for user processes */
ELFOSABI_RV_SUPERVISOR = 1,
ELFOSABI_RV_HYPERVISOR = 2,
ELFOSABI_RV_MONITOR = 3,

(Serendipity, these values match mode prefix in CSRs).

We also need to put the SBI version the image expects somewhere; to expose it to the loader so it knows what function pointers to provide to the image.

It's easier to read the SBI version from a section header attribute or note than to require the image to dynamic link to riscv-sbi.so and use symbol versioning however both would provide a mechanism to detect the SBI version the image was compiled for. Something that will hit us in the future if we don't provide for versioning now.

riscv-sbi.so.1.9 would be somewhat like the linux-vdso.so.1

Something to think about. An SBI shared object solves SBI evolution and deprecation. A note attribute with an SBI version (image to loader) and an AT_RISCV_SBI auxval (loader to image) containing a pointer to a structure at the top of the initial scratch stack with sp set up prior to calling '_start' would be quicker and easier to implement than a dynamic riscv-sbi.so. Both of these match existing ELF start protocols. Constant offsets to the SBI VA function pointers would be a bit awful. Wonder how paravirt_ops are passed in?

- AT_RISCV_SBI (pointer to struct of SBI function pointers)
- AT_RISCV_FDT (pointer to device tree, if present)
- AT_RISCV_CONFIG (pointer to config string, if present)

Don't know what the outcome of the config string versus FDT was. The config string is more elegant as FDT has device domain specific encodings versus a standardised property list. e.g.

115200,8,N,1

vs

{
baud: 115200;
width: 8;
parity: false;
stop: 1;
}

The former is fine but it's hard to extend and can't be reflected on easily; marshal, unmarshal and editors need to be written to support the format of every specific device encoding scheme. I think OpenFirmware (IEEE 1275-1944) had a tree of attributes and is much lighter weight than EFI. OF does however have device specific encodings.

UART config is a bad example (,+fc for flow control versus flow_control: 1) as it is typically a cryptic device specific encoding, however on reading the device tree spec I found that too many devices have device specific configuration formats versus a simple consistent scheme that one could use in something like sysfs).

openbios.org might be worth a look as it's a IEEE 1275-1944 implementation for multiple CPU architectures and it can use U-Boot or coreboot for platform initialisation.

Sony's SELF (Signed ELF) is also worth a look for implementations that want to verify the firmware they load. I think it is based on FreeBSD.

Ha. We should make a boot.md proposal somewhere...

Michael

Andrew Waterman

unread,

Nov 2, 2016, 11:22:51 PM11/2/16

to Michael Clark, Stefan O'Rear, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 8:14 PM, Michael Clark <michae...@mac.com> wrote:
>
>
>
> Sent from my iPhone
>> On 3/11/2016, at 10:01 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>>
>>> On Wed, Nov 2, 2016 at 1:45 PM, ron minnich <rmin...@gmail.com> wrote:
>>> an SBI call I would rather not see.
>>>
>>> I was surprised to see that the # page level mappings was not communicated
>>> to s-mode in some way as a bit in sstatus. This is the kind of thing a
>>> kernel needs to know.
>>
>> My understanding of the situation is that you're expected to just pick
>> a VM mode, then put it in the ELF header somewhere so that the
>> bootloader can set it.
>>
>> See also https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/cV9DEHo1XYU/-PTKEEVICwAJ
>
> It doesn't seem appropriate to put a dynamic property into a static ELF attribute. It means the same kernel cannot run on hardware with different paging extensions.

Not necessarily... Sv48 systems will generally support Sv39 (the
hardware cost is immeasurable), so kernels that require less address
space can run on systems that support more.

>
> I agree with Sam; sstatus.VM should be readable by the supervisor; and once PMA protection is in place it could perceivably be writable by a supervisor. At the least it could be passed in a dynamic aux val attribute e.g. AT_RISCV_VM
>
> The ELF auxiliary vector is IMHO a better way to communicate dynamic properties at boot (dynamic from the kernel's perspective). ELF Aux vector, Environment and command line map nicely to Config/FDT, NVRAM and boot command line. A single kernel image should be deployable on many RISC-V hardware variants. If the PTE mode is in the ELF then this is impossible. It's data communicated /from the loader/ /to the kernel/ in the boot protocol, not the other way around.
>
> The only attributes I would put in the ELF are ones indicating ABI variant and privilege level the image is compiled for /from the image/ /to the loader/ e.g an ELF attribute indicating type: Boot loader, Hypervisor, Kernel along with the expected SBI version (0x0107) for priv 1.7 (0x0109) for priv 1.9. Attributes in the ELF are ones that need to be communicated to the loader from the image. An SBI version attribute should be essential and would allow us to evolve the SBI ABI and provide back compat for kernels linked against older SBI versions.

Config string is probably a better way to communicate info to the
kernel than the ELF auxiliary vector, because it is more flexible.

> --
> You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

> To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/8D140F3D-5F25-4A43-BAD7-65A8538D661A%40mac.com.

Michael Clark

unread,

Nov 3, 2016, 12:32:03 AM11/3/16

to Andrew Waterman, Stefan O'Rear, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On 3 Nov 2016, at 4:22 PM, Andrew Waterman <and...@sifive.com> wrote:

Config string is probably a better way to communicate info to the
kernel than the ELF auxiliary vector, because it is more flexible.

I don’t think they are mutually exclusive.

The question is where do you get the pointer to the config string in the boot protocol or is it at a constant offset in the VA. Yuck.

Assuming scratch memory is setup for an initial scratch stack, then SP could use the existing ELF protocol to provide a pointer to the config string.

e.g. AT_RISCV_CONFIG

ELF AT vector (pointer to config string and SBI function pointers), Environment (parallel to OpenFirmware environment), and command line (parallel to boot command line) is an extensible mechanism, much like the config string.

I don’t think they are mutually exclusive. I need to reiterate that.

However if there is a way that a pointer to the config string and the SBI functions is already there, then okay, we need to document it in the “RISC-V Boot Protocol”; where can we find the docs? I guess we have to read the code… no worries. It’s not a complaint. I’d like RISC-V to have an elegant and symmetrical boot protocol.

michael.

Stefan O'Rear

unread,

Nov 3, 2016, 12:36:47 AM11/3/16

to Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 9:31 PM, Michael Clark <michae...@mac.com> wrote:
>
> On 3 Nov 2016, at 4:22 PM, Andrew Waterman <and...@sifive.com> wrote:
>
> Config string is probably a better way to communicate info to the
> kernel than the ELF auxiliary vector, because it is more flexible.
>
>
> I don’t think they are mutually exclusive.
>
> The question is where do you get the pointer to the config string in the
> boot protocol or is it at a constant offset in the VA. Yuck.

The config string is located in physical memory outside of the
kernel's initial data area. It is not accessible using ANY pointer
until the kernel gives it one by creating its own page tables.

> Assuming scratch memory is setup for an initial scratch stack, then SP could
> use the existing ELF protocol to provide a pointer to the config string.

There is no pointer to the config string.

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 12:47:57 AM11/3/16

to ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

ron minnich wrote:
> I no, I have not had a long day :-)
>
> OK, here's what I'm going to do: coreboot is going to indicate the #
> of levels in the page table in bits 8-9 of the last PTE in the root.
>
> IOW, in an sv39 core, bits 8 and 9 are 0. In an sv48 they are 1.
> Harvey can pick this up and will Do The Right Thing such that I don't
> need to compile different kernels for different RV64 cores. I'm just
> about done the harvey side and the coreboot side doesn't look too bad.
>
> kernel can go to the sptbr, get that, index into the last PTE, look at
> bits 8:9, and work out what to do from there.

There is an easier solution that does not rely on firmware assistance.
The kernel can walk any 4KiB mapping and simply count how many levels of
page tables it traverses before reaching a leaf.

> On Wed, Nov 2, 2016 at 5:49 PM ron minnich <rmin...@gmail.com
> <mailto:rmin...@gmail.com>> wrote:
>
> OK, here's what I'm going to do: coreboot is going to indicate the
> 3 of levels in the page table in bits 8-9 of the last PTE in the
> root.
>
> IOW, in an sv39 core, bits 8 and 9 are 0. In an sv48 they are 1.
> Harvey can pick this up and will Do The Right Thing such that I
> don't need to compile different kernels for different RV64 cores.
> I'm just about done the harvey side and the coreboot side doesn't
> look too bad.
>

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 12:49:32 AM11/3/16

to Paolo Bonzini, Samuel Falvo II, Stefan O'Rear, ron minnich, RISC-V ISA Dev

Moving the paging depth into sptbr and putting it under supervisor
control is something that I have also been advocating.

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 12:57:52 AM11/3/16

to Michael Clark, Andrew Waterman, Stefan O'Rear, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Michael Clark wrote:
>
>> On 3 Nov 2016, at 4:22 PM, Andrew Waterman <and...@sifive.com

>> <mailto:and...@sifive.com>> wrote:
>>
>> Config string is probably a better way to communicate info to the
>> kernel than the ELF auxiliary vector, because it is more flexible.
>>
>
> I don’t think they are mutually exclusive.
>
> The question is where do you get the pointer to the config string in
> the boot protocol or is it at a constant offset in the VA. Yuck.

I have proposed making the config string accessible via a virtio
interface; essentially copying it page-by-page on demand to wherever the
supervisor wants it.

> Assuming scratch memory is setup for an initial scratch stack, then SP
> could use the existing ELF protocol to provide a pointer to the config
> string.

Logically, the stack could be an ELF segment in the supervisor image.

> e.g. AT_RISCV_CONFIG
>
> ELF AT vector (pointer to config string and SBI function pointers),
> Environment (parallel to OpenFirmware environment), and command line
> (parallel to boot command line) is an extensible mechanism, much like
> the config string.
>
> I don’t think they are mutually exclusive. I need to reiterate that.
>
> However if there is a way that a pointer to the config string and the
> SBI functions is already there, then okay, we need to document it in
> the “RISC-V Boot Protocol”; where can we find the docs? I guess we
> have to read the code… no worries. It’s not a complaint. I’d like
> RISC-V to have an elegant and symmetrical boot protocol.

I *think* that the SBI functions are supposed to be dynamically linked
by the SEE loader. This is also why I proposed sbi_sexec()
earlier--what is more elegant than "feed new ELF image into SEE loader",
especially when the SEE must already understand ELF? The suggestion
earlier in this thread to use the ELF ABI field to indicate the expected
privilege mode for a program could even remove the need for sbi_hexec()
from my proposal.

-- Jacob

Michael Clark

unread,

Nov 3, 2016, 12:58:50 AM11/3/16

to Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

How it is, how it should be and how it will be are 3 different things. How it is is how it is.

Exploiting an intrinsic symmetry and configuration passing interface in the ELF format that *is* presently used seems like a more elegant way to *exchange* information in the RISC-V boot protocol. i.e. attaching to normative models in adopted components versus inventing new methods (if the former meet the purpose). COFF for example is legacy compared to ELF as COFF doesn’t /officially/ support shared objects and/or symbol versioning through any elegant mechanism and was dropped in SYSV.

1). /loader to image/
- config string
- nvram environment
- boot command line
- state of registers when e_entry is called. i.e. sp pointing to scratch RAM with initial vector

2). /image to loader/
- dependency on SBI version riscv-sbi-1.9.so or ELF Note: RV_SBI_VERSION 0x0109
- type: monitor, hypervisor, supervisor

Michael Clark

unread,

Nov 3, 2016, 1:06:48 AM11/3/16

to Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

BTW ARM boot is an /absolute utter disaster/. Every vendor has their own boot protocol.

ELF AT auxiliary vector (auxv) with AT_RISCV_CONFIG sounds logical. We also have a place for NVRAM (env) and boot command line.

Add an ELF NOTE section with signature for locked boot. Define a standard. Sony’s SELF is a bit silly as we don’t need to modify ELF. The way to sign ELF is to zero the NOTE section that contains the signature, sign, and then place the signature in the section. The signature verifier needs to feed zeros for the one range that contains the signature. Don’t do stupid things like having unsigned portions of the binary range (authenticode or the TLS1.2 record layer during TLS1.2 negotiation) that allow binary segments to be tampered with outside of the signature verification.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/10CB9EE4-232A-4B50-8572-94ECADAE9AA5%40mac.com.

Stefan O'Rear

unread,

Nov 3, 2016, 1:08:13 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 9:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Michael Clark wrote:
>>
>>
>>> On 3 Nov 2016, at 4:22 PM, Andrew Waterman <and...@sifive.com
>>> <mailto:and...@sifive.com>> wrote:
>>>
>>> Config string is probably a better way to communicate info to the
>>> kernel than the ELF auxiliary vector, because it is more flexible.
>>>
>>
>> I don’t think they are mutually exclusive.
>>
>> The question is where do you get the pointer to the config string in the
>> boot protocol or is it at a constant offset in the VA. Yuck.
>
>
> I have proposed making the config string accessible via a virtio interface;
> essentially copying it page-by-page on demand to wherever the supervisor
> wants it.

Um, the goal is to make the config string accessible to early-boot
code that *doesn't know where its page tables are* and *cannot
allocate memory*. Virtio is a clear step in the wrong direction.
Please do not suggest it again.

>> Assuming scratch memory is setup for an initial scratch stack, then SP
>> could use the existing ELF protocol to provide a pointer to the config
>> string.
>
>
> Logically, the stack could be an ELF segment in the supervisor image.
>
>> e.g. AT_RISCV_CONFIG
>>
>> ELF AT vector (pointer to config string and SBI function pointers),
>> Environment (parallel to OpenFirmware environment), and command line
>> (parallel to boot command line) is an extensible mechanism, much like the
>> config string.
>>
>> I don’t think they are mutually exclusive. I need to reiterate that.
>>
>> However if there is a way that a pointer to the config string and the SBI
>> functions is already there, then okay, we need to document it in the “RISC-V
>> Boot Protocol”; where can we find the docs? I guess we have to read the
>> code… no worries. It’s not a complaint. I’d like RISC-V to have an elegant
>> and symmetrical boot protocol.
>
>
> I *think* that the SBI functions are supposed to be dynamically linked by
> the SEE loader. This is also why I proposed sbi_sexec() earlier--what is
> more elegant than "feed new ELF image into SEE loader", especially when the
> SEE must already understand ELF? The suggestion earlier in this thread to
> use the ELF ABI field to indicate the expected privilege mode for a program
> could even remove the need for sbi_hexec() from my proposal.

The current RISC-V privilege spec targets an embedded SoC environment,
where the kernel is compiled for the actual runtime situation and can
"bake in" details like the hardware-supported paging depth.

In the more general, there is no such thing as an "expected privilege
mode". The Fedora generic kernel should work fine under hypervisors
and it should work fine without them and it should work on Sv48
hardware and it should work on Sv39 hardware etc etc. And I want it,
*when run on Sv48 hardware*, to be able to map more than 256GB of
files; when run without a hypervisor, it *should be able to* start one
of its own. But neither of these capabilities should block booting in
more constrained environments.

You cannot get either of these abilities, or many others, with
"expected privilege mode" fields or "expected VM mode" fields. You
need a kernel with the ability to adapt itself to the actual runtime
environment, which is not a situation that is relevant for embedded
SoCs.

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 1:08:55 AM11/3/16

to Andrew Waterman, Michael Clark, Stefan O'Rear, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Andrew Waterman wrote:
> On Wed, Nov 2, 2016 at 8:14 PM, Michael Clark <michae...@mac.com> wrote:
>
>> I agree with Sam; sstatus.VM should be readable by the supervisor; and once PMA protection is in place it could perceivably be writable by a supervisor. At the least it could be passed in a dynamic aux val attribute e.g. AT_RISCV_VM
>>
>> The ELF auxiliary vector is IMHO a better way to communicate dynamic properties at boot (dynamic from the kernel's perspective). ELF Aux vector, Environment and command line map nicely to Config/FDT, NVRAM and boot command line. A single kernel image should be deployable on many RISC-V hardware variants. If the PTE mode is in the ELF then this is impossible. It's data communicated /from the loader/ /to the kernel/ in the boot protocol, not the other way around.
>>
>> The only attributes I would put in the ELF are ones indicating ABI variant and privilege level the image is compiled for /from the image/ /to the loader/ e.g an ELF attribute indicating type: Boot loader, Hypervisor, Kernel along with the expected SBI version (0x0107) for priv 1.7 (0x0109) for priv 1.9. Attributes in the ELF are ones that need to be communicated to the loader from the image. An SBI version attribute should be essential and would allow us to evolve the SBI ABI and provide back compat for kernels linked against older SBI versions.
>>
>
> Config string is probably a better way to communicate info to the
> kernel than the ELF auxiliary vector, because it is more flexible.
>

The impression that I got from the 1.9 draft was the the configuration
string exists to describe hardware, whether physical or virtual. The
configuration string can also change during runtime (especially if my or
another proposal to access it using virtio is adopted) to reflect
hotplug events. The ELF aux vector is given to the supervisor at early
boot and may or may not remain relevant afterwards. I would suggest
details like the virtual address of the initial page table tree should
go into the aux vector.

The aux vector passes information from the environment into a
newly-created process and is usually very low-level and contains binary
values. Pointers that a supervisor cannot assume or trivially acquire
would belong in the auxv, in my view.

The ELF argument vector should be reserved for user-selected inputs on
the scale of a command line.

The ELF environment is something that I still am not quite sure how best
to use. Loading it from NVRAM could be reasonable, but would require
standardizing the contents and layout of NVRAM. I am unsure if that is
wise at this stage, or even if the existence of NVRAM should be
standardized. (Although I guess the environment could be a NULL pointer
if there is no NVRAM.) This also raises the question of how would NVRAM
be updated? (I favor a virtio device for writing to NVRAM, but no
virtio proposal has yet been adopted.)

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 1:14:17 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

And *that* makes sbi_get_config() as currently defined impossible to
implement. Changing it to return a physical address which the kernel
must map to use is a possibility, but I believe that sbi_get_config() is
currently defined as returning "void *", which is a virtual address in
S-mode.

Please do not introduce (or maintain) a concept of pointers that are not
actually pointers--that way lies madness and exploits run amok. Modern
C has a type system. It is relatively simple and in some ways
primitive, but it helps much more when it is not being subverted.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 1:17:33 AM11/3/16

to Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 10:06 PM, Michael Clark <michae...@mac.com> wrote:
> BTW ARM boot is an /absolute utter disaster/. Every vendor has their own
> boot protocol.

> ELF AT auxiliary vector (auxv) with AT_RISCV_CONFIG sounds logical. We also
> have a place for NVRAM (env) and boot command line.
>
> Add an ELF NOTE section with signature for locked boot. Define a standard.
> Sony’s SELF is a bit silly as we don’t need to modify ELF. The way to sign
> ELF is to zero the NOTE section that contains the signature, sign, and then
> place the signature in the section. The signature verifier needs to feed
> zeros for the one range that contains the signature. Don’t do stupid things
> like having unsigned portions of the binary range (authenticode or the
> TLS1.2 record layer during TLS1.2 negotiation) that allow binary segments to
> be tampered with outside of the signature verification.

Most kernels these days require or encourage multiple files for the
initial bootstrap, so just make one of them a detached signature file.
People will probably be much happier with this than anything that
modifies the binaries themselves, especially since not all of them
will be ELF (e.g. on Linux one of them is likely to be a .cpio.gz)

-s

Stefan O'Rear

unread,

Nov 3, 2016, 1:27:49 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 10:14 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> And *that* makes sbi_get_config() as currently defined impossible to
> implement. Changing it to return a physical address which the kernel must
> map to use is a possibility, but I believe that sbi_get_config() is
> currently defined as returning "void *", which is a virtual address in
> S-mode.
>
> Please do not introduce (or maintain) a concept of pointers that are not
> actually pointers--that way lies madness and exploits run amok. Modern C
> has a type system. It is relatively simple and in some ways primitive, but
> it helps much more when it is not being subverted.

Actually it looks like sbi_get_config, sbi_config_string_base, and
sbi_config_string_len are all gone and S-mode can no longer access the
config string at all, looking at
https://github.com/riscv/riscv-pk/blob/master/machine/sbi.h .

Andrew Waterman: I'm guessing this is a force push gone wrong? I
don't see a revert comment, and I don't think I've gone crazy.

-s

Michael Clark

unread,

Nov 3, 2016, 1:35:58 AM11/3/16

to Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

In an embedded environment init ram disk is often appended to the kernel image and delivered as a single file. The linux kernel looks for it at the end of its image.

The logical way to do this nicely, but not necessary, using a signature NOTE section, is to also add an ELF section for the init ramdisk. I could code this in a several minutes. It’s because the linux kernel developers don’t work with binutils as an integrated system like BSD.

It’s not strictly necessary if the constraint is documented that the signature is verified for the entire file excluding the range of the signature range in the NOTE section (the signature itself is omitted or fed into the verifier as zeros).

People don’t want multiple files. Multiple files are a pain. For a trust chain, every separate file then needs to be loaded i.e. loadable modules for the boot loader need to be signed and verified when loaded: multiple files can be supported but in early boot on embedded environments its a pain. You can’t for example add a boot screen grub module on an Android device.

In any case, dlopen, will have to verify signatures on any loaded files, and files that are outside of an encapsulation that provides for signatures are a pain. The same technique can be used to sign ZIPs that contain uncompressed files (internal files are compressed). It’s a common technique for Android app asset files. I use code to mmap a ZIP of uncompressed files as a VFS in my own projects. Instead of a ELF NOTE we pop in a META-INF/SIGNATURE.TXT for which we verify the entire file but feed zeros for the section that contains the signature (as signature was created with that area formatted to zeros).

—

Stefan O'Rear

unread,

Nov 3, 2016, 1:45:03 AM11/3/16

to Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 10:35 PM, Michael Clark <michae...@mac.com> wrote:
> In an embedded environment init ram disk is often appended to the kernel image and delivered as a single file. The linux kernel looks for it at the end of its image.
>
> The logical way to do this nicely, but not necessary, using a signature NOTE section, is to also add an ELF section for the init ramdisk. I could code this in a several minutes. It’s because the linux kernel developers don’t work with binutils as an integrated system like BSD.
>
> It’s not strictly necessary if the constraint is documented that the signature is verified for the entire file excluding the range of the signature range in the NOTE section (the signature itself is omitted or fed into the verifier as zeros).
>
> People don’t want multiple files. Multiple files are a pain. For a trust chain, every separate file then needs to be loaded i.e. loadable modules for the boot loader need to be signed and verified when loaded: multiple files can be supported but in early boot on embedded environments its a pain. You can’t for example add a boot screen grub module on an Android device.
>
> In any case, dlopen, will have to verify signatures on any loaded files, and files that are outside of an encapsulation that provides for signatures are a pain. The same technique can be used to sign ZIPs that contain uncompressed files (internal files are compressed). It’s a common technique for Android app asset files. I use code to mmap a ZIP of uncompressed files as a VFS in my own projects. Instead of a ELF NOTE we pop in a META-INF/SIGNATURE.TXT for which we verify the entire file but feed zeros for the section that contains the signature (as signature was created with that area formatted to zeros).

It is not clear to me whether you want an ELF (vmlinux) image and
auxiliary files, or a single concatenated image which contains code to
parse itself. I'd rather we didn't do both.

-s

Michael Clark

unread,

Nov 3, 2016, 1:54:13 AM11/3/16

to Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

I would have a single ELF file with a concatenated initramdisk.

- optional NOTE section containing a signature
- nice to have. ELF contains a section named .ramdisk that is PT_LOAD +R and covers the extent of the concatenated ramdisk
- linux doesn’t require this, but it is tidy
- signature verification checks whole except for the signature extent (zeros or omitted, omitted is easier to code without changing memory)
- doesn’t require the .ramdisk section as we always verify the whole file

anything variable is in nvram and is passed in ENV in the standard ELF protocol, so a loader with privileges to access this NVRAM can alter boot i.e. when the device is in debug mode.

Then we just need to feed one file to the device when debugging, and the rest is on the flash.

Works for servers too. Hypervisor can always kexec a signed but non-verifying initial domain which can have unsigned grub modules galore. The chain of trust from the manufacturer obviously needs to be much tighter in this case. i.e. you don’t walk into a retail store to purchase a rack of servers, so it’s likely known whether the firmware has been rooted by the vendor.

Angelo Bulfone

unread,

Nov 3, 2016, 1:55:25 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

One major issue with requiring the kernel to allocate pages for the config string is the fact that a (proper) page frame allocator depends on the memory map of available RAM, which as currently specified, lies within the config string. In over words, the kernel needs the contents of the config string just to retrieve the config string. A simple solution would be to pre-map or identity map the pages containing the config string so that, assuming the kernel hasn't remaped anything, it can just access it normally.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CADJ6UvMn1e7rP5mEQzxq%3DF5qhPmMj%2BqKAT8NQCYF6GECO8B0iw%40mail.gmail.com.

Alex Elsayed

unread,

Nov 3, 2016, 2:02:41 AM11/3/16

to isa...@groups.riscv.org

On Wednesday, 2 November 2016 22:08:12 PDT Stefan O'Rear wrote:

<snip>

> The current RISC-V privilege spec targets an embedded SoC environment,
> where the kernel is compiled for the actual runtime situation and can
> "bake in" details like the hardware-supported paging depth.

ARM thought this was a workable plan too. Hundreds of "board" definition files
in the kernel later, it was found that no, it really _really_ isn't. Baking in
details is inimical to _exactly_ the kind of "generic kernel" work that ARM
has had to undergo (at great expense of time and effort) relatively recently.
Let's try and learn from their mistakes.

> In the more general, there is no such thing as an "expected privilege
> mode". The Fedora generic kernel should work fine under hypervisors
> and it should work fine without them and it should work on Sv48
> hardware and it should work on Sv39 hardware etc etc. And I want it,
> *when run on Sv48 hardware*, to be able to map more than 256GB of
> files; when run without a hypervisor, it *should be able to* start one
> of its own. But neither of these capabilities should block booting in
> more constrained environments.

Sure. But requiring a differently-compiled kernel in each of those cases to
boot _at all_ is a similar non-starter - I would be unsurprised if upstreaming
Linux support met roadblocks if that was the approach taken; once bitten,
twice shy.

> You cannot get either of these abilities, or many others, with
> "expected privilege mode" fields or "expected VM mode" fields. You
> need a kernel with the ability to adapt itself to the actual runtime
> environment, which is not a situation that is relevant for embedded
> SoCs.

It is. Being able to have a _single kernel_ that can boot on disparate
embedded SoCs is _exactly_ what the Linux ARM support has converged on, after
much wailing and gnashing of teeth caused by exactly the approach you
describe.

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:17:22 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Stefan O'Rear wrote:
> On Wed, Nov 2, 2016 at 9:57 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Michael Clark wrote:
>>
>>>> On 3 Nov 2016, at 4:22 PM, Andrew Waterman <and...@sifive.com
>>>> <mailto:and...@sifive.com>> wrote:
>>>>
>>>> Config string is probably a better way to communicate info to the
>>>> kernel than the ELF auxiliary vector, because it is more flexible.
>>>>
>>>>
>>> I don’t think they are mutually exclusive.
>>>
>>> The question is where do you get the pointer to the config string in the
>>> boot protocol or is it at a constant offset in the VA. Yuck.
>>>
>> I have proposed making the config string accessible via a virtio interface;
>> essentially copying it page-by-page on demand to wherever the supervisor
>> wants it.
>>
>
> Um, the goal is to make the config string accessible to early-boot
> code that *doesn't know where its page tables are* and *cannot
> allocate memory*. Virtio is a clear step in the wrong direction.
> Please do not suggest it again.
>

Incorrect, but possibly a symptom of confusion--the "virtio" I propose
is *not* an existing virtio interface, but a new(-ish) concept using
SBI, which other platforms lack.

At the entry point--the very first instruction executed from the
supervisor--the page tables have been established by the SEE, following
instructions in the ELF image (".text" segment, ".data" segment, ".bss"
segment, stack, etc.). The SEE does the initial memory allocations.
Further, the SEE can describe the location(s) of the page tables using
the ELF aux vector. The early boot code *can* know where its page
tables are on RISC-V.

There is no need for early boot code to allocate memory--it could read
config string pages into a buffer allocated as part of the initial .bss
segment or even a special initbss segment, which may be thrown away
after the main allocator is initialized or kept for whatever use the
system has for it. Early boot does not need to read the config string
into some sort of tree, only parse just enough to get details like
physical memory layout. Again, the minimal space for that "first arena"
can be part of the .bss segment and allocated by the SEE.

Put simply, early boot on RISC-V is far more capable than on most
systems and the early boot environment is not an impediment to using
virtio. The virtio that I proposed scales from early boot
(synchronously reading the configuration string page-by-page into a
static buffer in the .bss segment) all the way to advanced concurrent
I/O using accelerators that have their own internal buffers (requiring
the full kernel allocator to even start to set such a system up). I
believe that virtio for the config string is a step in the *right*
direction. I will continue to advocate for it.

>>> Assuming scratch memory is setup for an initial scratch stack, then SP
>>> could use the existing ELF protocol to provide a pointer to the config
>>> string.
>>>
>> Logically, the stack could be an ELF segment in the supervisor image.
>>
>>
>>> e.g. AT_RISCV_CONFIG
>>>
>>> ELF AT vector (pointer to config string and SBI function pointers),
>>> Environment (parallel to OpenFirmware environment), and command line
>>> (parallel to boot command line) is an extensible mechanism, much like the
>>> config string.
>>>
>>> I don’t think they are mutually exclusive. I need to reiterate that.
>>>
>>> However if there is a way that a pointer to the config string and the SBI
>>> functions is already there, then okay, we need to document it in the “RISC-V
>>> Boot Protocol”; where can we find the docs? I guess we have to read the
>>> code… no worries. It’s not a complaint. I’d like RISC-V to have an elegant
>>> and symmetrical boot protocol.
>>>
>> I *think* that the SBI functions are supposed to be dynamically linked by
>> the SEE loader. This is also why I proposed sbi_sexec() earlier--what is
>> more elegant than "feed new ELF image into SEE loader", especially when the
>> SEE must already understand ELF? The suggestion earlier in this thread to
>> use the ELF ABI field to indicate the expected privilege mode for a program
>> could even remove the need for sbi_hexec() from my proposal.
>>
>
> The current RISC-V privilege spec targets an embedded SoC environment,
> where the kernel is compiled for the actual runtime situation and can
> "bake in" details like the hardware-supported paging depth.
>

This is my motivation for advocating changes that are more suited to
larger environments, while still being reasonably within reach for any
embedded SoC that supports S-mode at all. This is why the virtio that I
propose has a minimal option that only gives access to the config
string. I expect that embedded RISC-V will often be programmed
"bare-metal" or "bring your own monitor", rather than having established
firmware as is common in larger systems. Such a system may lack even
minimal virtio or any SBI at all and simply build the config string and
HAL bits into the kernel image, but that should be a non-standard
variant, not the standard boot process.

> In the more general, there is no such thing as an "expected privilege
> mode". The Fedora generic kernel should work fine under hypervisors
> and it should work fine without them and it should work on Sv48
> hardware and it should work on Sv39 hardware etc etc. And I want it,
> *when run on Sv48 hardware*, to be able to map more than 256GB of
> files; when run without a hypervisor, it *should be able to* start one
> of its own. But neither of these capabilities should block booting in
> more constrained environments.
>
> You cannot get either of these abilities, or many others, with
> "expected privilege mode" fields or "expected VM mode" fields. You
> need a kernel with the ability to adapt itself to the actual runtime
> environment, which is not a situation that is relevant for embedded
> SoCs.
>

An "expected privilege mode" field still allows you to determine if a
given ELF image is intended to be a supervisor, hypervisor, or user
application. The generic kernel should work fine with or without a
hypervisor, yes, and enabling seamless operation in both Sv39 and Sv48
was the motivation for my first proposal (message-id
<57B4E7D8...@gmail.com>, the very first message I sent to this
list) to move paging height into sptbr.

The last issue of starting a hypervisor from S-mode opens a can of
worms--S-mode is not supposed to know whether or not a hypervisor is
present. I worked around this in my sbi_{s,h}exec() proposal by
proposing that an attempt to start a hypervisor unconditionally
terminate the supervisor that calls sbi_hexec(), even if H-mode is not
supported on the hardware at all.

On RISC-V, the environment seems to be expected to "meet the kernel
halfway", and in some ways this is justified--you cannot boot a kernel
that expects to use paging on hardware that does not support paging--and
paging is optional in RISC-V.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 2:18:39 AM11/3/16

to Angelo Bulfone, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 10:55 PM, Angelo Bulfone <mbul...@gmail.com> wrote:
> One major issue with requiring the kernel to allocate pages for the config
> string is the fact that a (proper) page frame allocator depends on the
> memory map of available RAM, which as currently specified, lies within the
> config string. In over words, the kernel needs the contents of the config

sbi_query_memory exists to break this cycle, but has its own
limitations; notably, there's no way to tell the kernel about memory
that it can use *after* early boot, like the memory that the initramfs
is in (you don't want to reuse that for early page tables or anything
else until it's unpacked!)

> string just to retrieve the config string. A simple solution would be to
> pre-map or identity map the pages containing the config string so that,
> assuming the kernel hasn't remaped anything, it can just access it normally.

Yes, I think I've raised this exact issue before. We've reached the
scaling limits of mailing lists for recording issues, so I just
created https://github.com/sorear/riscv-specs/issues/1 . (Please add
your favorite issues!)

Foundation people: I am open to alternative suggestions for where this
should live. Not Google Groups, though.

-s

Stefan O'Rear

unread,

Nov 3, 2016, 2:20:58 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 11:17 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> There is no need for early boot code to allocate memory--it could read
> config string pages into a buffer allocated as part of the initial .bss
> segment or even a special initbss segment, which may be thrown away after

What maximum size of config string are you, personally, willing to commit to?

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:28:49 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Why not? (In different environments, of course.) The sbi_sexec()
proposal I made envisions a single buffer that can be produced
differently for different supervisors. Multiple files would be
assembled into a single image by a bootloader in systems that use such
things (think "RISC-V PC"). Embedded systems could simply use a variant
(simplified) boot process. The "standard monitor architecture" that I
am turning over in my head would allow for a very small system to simply
embed its supervisor in the monitor's bootloader slot, while a larger
system (with actual disks) would put a bootloader there.

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:35:15 AM11/3/16

to Angelo Bulfone, Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Angelo Bulfone wrote:
> One major issue with requiring the kernel to allocate pages for the
> config string is the fact that a (proper) page frame allocator depends
> on the memory map of available RAM, which as currently specified, lies
> within the config string. In over words, the kernel needs the contents
> of the config string just to retrieve the config string. A simple
> solution would be to pre-map or identity map the pages containing the
> config string so that, assuming the kernel hasn't remaped anything, it
> can just access it normally.

Perhaps a special ELF ".confstr" segment? But the size of the
configuration string is not known in the general case, only for embedded
systems.

The SBI virtio approach that I have been advocating solves this a
different way--the kernel can simply put a page (or two) aside in its
.bss segment and use that to receive the configuration string (a page at
a time) for early parsing to find the physical RAM layout. Essentially,
the "minimal virtio" enables the supervisor to demand the config string
be placed at a location of the kernel's choosing in convenient bite-size
pieces.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 2:38:24 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 11:17 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> The last issue of starting a hypervisor from S-mode opens a can of
> worms--S-mode is not supposed to know whether or not a hypervisor is
> present. I worked around this in my sbi_{s,h}exec() proposal by proposing
> that an attempt to start a hypervisor unconditionally terminate the
> supervisor that calls sbi_hexec(), even if H-mode is not supported on the
> hardware at all.

I'm not crazy about this proposal.

Q: "I ran guestfish to copy some files out of a disk image I
downloaded and my VPS rebooted, help"

A: "This is by design; even attempting to start a nested hypervisor
will kill your kernel on RISC-V. If you want your VPS to not reboot,
add a modprobe blacklist for kvm; guestfish will automatically fall
back to software emulation."

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:41:28 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

That limit would be (2^XLEN) * 4KiB in my proposals, minimum (2^64) *
4KiB on RV32. The SBI virtio interface I propose allows the
configuration string to be read one page at a time, with a finite (and
very small--one page) buffer, parsing as you read it. Think of reading
from a pipe. You do not need to care how long the input is, as long as
you can find the details you need (physical RAM layout, in this case)
using only a finite amount of space.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 2:47:08 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

OK, I think I understand. Would still rather we not call it virtio,
because IMO the name mostly refers to the specification (
http://docs.oasis-open.org/virtio/virtio/v1.0/cs04/virtio-v1.0-cs04.html
), not any more general concept.

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:50:07 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

The RISC-V KVM implementation would *not* rely on H-mode, which most
processors are not expected to have, and which is currently only barely
defined as an outline. On the other hand, if I understand correctly,
RISC-V is fully (and efficiently!) Popek-Goldberg virtualizable. I
expect that guestfish would end up running the guest VM supervisor in
U-mode under a classic VMM model on RISC-V. Put simply, that is how
RISC-V KVM would work, and from what I have seen, I expect that RISC-V
will have better performance using a classic "trap-and-emulate" VMM than
most current systems have with their hypervisor interfaces. The SBI
virtio that I propose (or a similar proposal) will only further help
performance in these cases. (A guestfish kernel could expect--and
get--full virtio for basically the cost of ECALLs rather than hardware
traps.)

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 2:58:22 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Ooooooohhhhh.... Someone actually made a spec and called it that...

I had thought of virtio as a generic shortened "virtual I/O", meaning
I/O devices that may have no feasible implementation as physical
hardware at all, but are convenient in a paravirtualized environment (or
as paravirtualization-like extensions in a virtualized environment that
also provides emulated hardware). In other words, emulated hardware is
emulated hardware; virtio is something that real hardware would not
implement but is convenient (faster) for a VMM to provide.

Got a better name for what I proposed (message-id
<57F2FF26...@gmail.com>) as "SBI virtio"?

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 2:58:49 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 11:50 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> is fully (and efficiently!) Popek-Goldberg virtualizable. I expect that

I expect it to be pretty inefficient compared to a proper SIE design.
Every copy_from_user hits two privileged accesses to sstatus. A
system call hits sscratch (4 times), sepc (at least twice to do the
+4), scause (hopefully only once), sie (if you want system calls to be
interruptable, interrupts need to be manually re-enabled in the trap
handler). Every one of those requires a trap to the VMM; you'd
probably want to detect frequently trapping addresses and deploy
focused binary translation.

-s

Stefan O'Rear

unread,

Nov 3, 2016, 3:01:48 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 11:58 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Ooooooohhhhh.... Someone actually made a spec and called it that...

The kicker: virtio is not discoverable, and practical implementations
(qemu, google cloud services) are layered over virtual PCI to provide
said discoverability. So I thought you were talking about a kernel
having to run PCI discovery before it even had a memory map.

> I had thought of virtio as a generic shortened "virtual I/O", meaning I/O
> devices that may have no feasible implementation as physical hardware at
> all, but are convenient in a paravirtualized environment (or as
> paravirtualization-like extensions in a virtualized environment that also
> provides emulated hardware). In other words, emulated hardware is emulated
> hardware; virtio is something that real hardware would not implement but is
> convenient (faster) for a VMM to provide.
>
> Got a better name for what I proposed (message-id
> <57F2FF26...@gmail.com>) as "SBI virtio"?

I'll have to reread the proposal later.

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 3:11:07 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Stefan O'Rear wrote:
> On Wed, Nov 2, 2016 at 11:58 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> Ooooooohhhhh.... Someone actually made a spec and called it that...
>>
>
> The kicker: virtio is not discoverable, and practical implementations
> (qemu, google cloud services) are layered over virtual PCI to provide
> said discoverability. So I thought you were talking about a kernel
> having to run PCI discovery before it even had a memory map.
>

Oh no-no-no-no-no-no-no-no: the minimal implementation (which is also
supported in a full implementation) is basically an SBI call for "copy
one page from the config string to a one-page-long buffer at this address".

Named "SBI virtio" devices are discovered by parsing the configuration
string.

>> I had thought of virtio as a generic shortened "virtual I/O", meaning I/O
>> devices that may have no feasible implementation as physical hardware at
>> all, but are convenient in a paravirtualized environment (or as
>> paravirtualization-like extensions in a virtualized environment that also
>> provides emulated hardware). In other words, emulated hardware is emulated
>> hardware; virtio is something that real hardware would not implement but is
>> convenient (faster) for a VMM to provide.
>>
>> Got a better name for what I proposed (message-id
>> <57F2FF26...@gmail.com>) as "SBI virtio"?
>>
>
> I'll have to reread the proposal later.
>

Please do; comments will be helpful, since the asynchronous I/O facility
is still a pretty vague outline and I am planning another revision of
the proposal to flesh it out. In particular, I am planning for stream
device buffers to have separate control blocks (which may physically be
registers in an I/O coprocessor somewhere), while block I/O buffers
would be exactly page-aligned, with their control blocks included in the
iovec.

-- Jacob

Jacob Bachmeyer

unread,

Nov 3, 2016, 3:21:24 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

But a VMM could play tricks like identifying copy_{from,to}_user in the
guest supervisor and emulating the entire function. (Possibly even
replace its entry point with an interprocess-copy ECALL. Amusingly,
this would give Free systems (where this is safe) a performance boost
not available to closed systems, where the legality of such
modifications could be dubious.) I am still trying to devise a better
way to handle traps, because the current approach has a critical section
where a nested trap causes a crash. Interrupts are disabled but a
supervisor could still take a page fault.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 3:30:54 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Thu, Nov 3, 2016 at 12:21 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> But a VMM could play tricks like identifying copy_{from,to}_user in the
> guest supervisor and emulating the entire function. (Possibly even replace

What you have just described is called "previrtualization". Binary
translation gives the same benefits in a kernel-independent way.

> its entry point with an interprocess-copy ECALL. Amusingly, this would give
> Free systems (where this is safe) a performance boost not available to
> closed systems, where the legality of such modifications could be dubious.)

IANAL but I think vmware and the like have been doing this for many years.

> I am still trying to devise a better way to handle traps, because the
> current approach has a critical section where a nested trap causes a crash.
> Interrupts are disabled but a supervisor could still take a page fault.

If the supervisor accesses only valid addresses, it will never take a
page fault. This isn't hard to ensure, even if you're doing kernel
stack overflow checks.

-s

Jacob Bachmeyer

unread,

Nov 3, 2016, 3:37:52 AM11/3/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

I think that the problem can be avoided, if the user context save area
for the current task is guaranteed to always be present, but I was
trying to find a general solution.

-- Jacob

Stefan O'Rear

unread,

Nov 3, 2016, 3:47:57 AM11/3/16

to Jacob Bachmeyer, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

On Thu, Nov 3, 2016 at 12:37 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> I think that the problem can be avoided, if the user context save area for
> the current task is guaranteed to always be present, but I was trying to
> find a general solution.

If the problem you are trying to solve is "I swapped out my percpu
struct, how do I swap it back in", I think you made a mistake several
steps ago.

-s

Michael Clark

unread,

Nov 3, 2016, 4:33:22 AM11/3/16

to jcb6...@gmail.com, Angelo Bulfone, Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

I don’t know how this idea of embedding platform information in the guest ELF came about. Perhaps it is valid at the monitor level.

The ROM or hardware stage 0 loader loader needs non evaluable hardware physical addresses for I2C or SPI MMIO or whatever it needs to bitbang the stage 1 loader from NAND e.g. BBL. But these can be absolute addresses in the ROM or there could be a dedicated circuit that can bitbang and verify this loader. i.e. it doesn’t necessarily need to be software or shadow ROM. This level is “implementation defined”. I think Stage 1 or 2 (see stage model below) needs to be defined but Stage 0 can be implementation defined. The monitor should be defined.

The monitor is at the level that partitions and potentially starts an integrity manager running beside the hypervisor. I guess it needs to have the root config (for static or non dynamically evaluable hardware) but at this level if could either be in the ELF or “implementation defined”. The question is whether stage 0 has to provide config to stage 1 or stage 1 e.g. BBL contains the root config embedded in the ELF.

At the various levels (see stage model below), config is either augmented (dynamic discovery) or subdivided (partitioning) from the root config and config may be partially dynamically evaluable (e.g. USB devices and PCI devices). A kernel obviously wants the config passed in, not embedded in its ELF.

The vendor has a stage0 loader loader that is the minimal program to bit bang and verify say BBL from FLASH. The loader loader. I guess is in ROM in production but a fused out debug interface may be able to short circuit stage 0 and load stage 1 from remote. The loader loader needs a minimal root hardware config and it might want the config embedded in ROM as absolute addresses for the I2C or SPI or whatever MMIO interface (non discoverable hardware required to load and verify from NAND).

It’s going to load the first ELF and stage0 may not be updatable, so it does not need to be encapsulated in ELF. The non dynamically evaluable hardware like SPI or I2C addresses for the loader loader obviously needs to be in ROM, and it you don’t want root certs in firmware, then it needs to perform multiple rounds of a signature verification in a hardware circuit and potentially have fuses; so the loader loader (stage 0) can verify the integrity of the loader (stage 1) that it fetches from NAND. Authentication, not encryption.

There are two models

- Developer friendly - allow unlocking of the loader (at some level), but as soon as the chain of trust is broken you void your warranty e.g. Motorola (*1).

- Developer unfriendly - don’t allow unlocking, and you may or may not get sued by DMCA lawyers for circumventing loader verification e.g. name withheld.

As far as I can tell, RISC-V is open to either model. It’s up to the vendor what they put on their SOC. RISC-V is an ISA Standard and Platform (which obviously covers Boot and ABI).

These are four boot models.

Kernel boot

0. HW, ROM and/or NAND Stage 0

1. MONITOR / BBL (controls PMAs)

2. KERNEL

Multi boot

0. HW, ROM and/or NAND Stage 0

1. MONITOR / BBL (controls PMAs)

2. LOADER (U-Boot, Coreboot, etc)

3. KERNEL

Hypervisor boot

0. HW, ROM and/or NAND Stage 0

1. MONITOR / BBL (controls PMAs)

2. HYPERVISOR + INTEGRITY ENCLAVE

3. KERNEL

Hypervisor Multi boot

0. HW, ROM and/or NAND Stage 0

1. MONITOR / BBL (controls PMAs)

2. HYPERVISOR

3. LOADER (U-Boot, Coreboot, etc)

4. KERNEL

It is this way:

1). /loader to image/ or loader passing metadata to image

- root config string
- nvram environment
- boot command line
- state of registers when e_entry is called. i.e. sp pointing to scratch RAM with initial vector and processor state defined

2). /image to loader/ or loader reading metadata from image

- dependency on SBI version riscv-sbi-1.9.so or ELF Note: RV_SBI_VERSION 0x0109
- ELF OSABI type: monitor, hypervisor, supervisor

Stages

s0 - ROM and/or NAND Stage 0, has enough config to enable caches and load BBL

s1 - BBL - may have some config passed in from stage 0, and extend with discovery - perform memory evaluation, etc. BBL should be able to boot a kernel, hypervisor or multi-boot loader

s2 - LOADER/HYPERVISOR/SUPERVISOR - subdivided config passed in via a standard entry vector specified in the RISC-V Boot Protocol

[1] https://motorola-global-portal.custhelp.com/app/standalone/bootloader/unlock-your-device-a

Michael

ron minnich

unread,

Nov 3, 2016, 11:20:00 AM11/3/16

to jcb6...@gmail.com, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Wed, Nov 2, 2016 at 9:47 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:

There is an easier solution that does not rely on firmware assistance.
The kernel can walk any 4KiB mapping and simply count how many levels of
page tables it traverses before reaching a leaf.

I don't believe this works if the firmware doesn't use any 4 KiB PTEs. coreboot currently only uses 2 MiB PTEs and will in future use a mix of PTEs, using the biggest possible. If you have, e.g., 1 GiB + 2 MiB of RAM that is naturally aligned it doesn't make much sense for firmware to use more than 3 PTEs for that area. It certainly makes no sense at all to use 2^18 or so 4KiB PTES.

So far, in all this discussion, I've seen lots of proposed mechanisms that require adding code or using things not yet implemented or changing the RISCV spec. What I've proposed is trivial to implement and trivial to use and, further, is working right now. So, until I see something better, it's what I'll use. It's true that it only allows firmware to communicate the presence of 3, 4, 5, and 6 level page tables but that seems enough for now.

Compiling a kernel for only one or the other mode (sv39 vs. sv48) may be necessary for Linux but it's not at all necessary for Plan 9 -- so schemes that require such specialized compilation are very unattractive to me. The ELF idea is particularly problematic as it assumes the ELF headers are available at boot time, which is certainly not the case for, e.g., a kernel embedded in FLASH as a coreboot payload (Power[89] embed Linux in flash today for their boot loader). I like the config string in principle but the current design is not baked, and I'd rather have something immediately available to me rather than have to find the config string given that there is no standard for how to do that yet (EBDA anyone :-)?

ron

Samuel Falvo II

unread,

Nov 3, 2016, 11:30:41 AM11/3/16

to ron minnich, Jacob Bachmeyer, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Thu, Nov 3, 2016 at 8:19 AM, ron minnich <rmin...@gmail.com> wrote:
> I like the config string in principle but the
> current design is not baked, and I'd rather have something immediately
> available to me rather than have to find the config string given that there
> is no standard for how to do that yet (EBDA anyone :-)?

If only we were targeting the Atari ST, I'd recommend looking in the
TOS "cookie jar" structure. ;) But that brings us back to how to
find it in the first place. ;)

--
Samuel A. Falvo II

NoDot

unread,

Nov 3, 2016, 12:34:08 PM11/3/16

to RISC-V ISA Dev, sor...@gmail.com, michae...@mac.com, and...@sifive.com, rmin...@gmail.com, bon...@gnu.org, sam....@gmail.com, jcb6...@gmail.com

On Thursday, November 3, 2016 at 3:11:07 AM UTC-4, Jacob Bachmeyer wrote:

Please do; comments will be helpful, since the asynchronous I/O facility
is still a pretty vague outline and I am planning another revision of
the proposal to flesh it out.

How much of the VirtIO proposal linked can be mined for ideas? I tried to read it, but it's above my level...

Jacob Bachmeyer

unread,

Nov 4, 2016, 12:08:44 AM11/4/16

to Stefan O'Rear, Michael Clark, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

I was thinking more of per-task struct than per-cpu struct, but yes,
swapping out the context save area for a running task is pretty stupid
and catering to it is extremely difficult.

Jacob Bachmeyer

unread,

Nov 4, 2016, 12:13:08 AM11/4/16

to ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

ron minnich wrote:
> On Wed, Nov 2, 2016 at 9:47 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
>
> There is an easier solution that does not rely on firmware assistance.
> The kernel can walk any 4KiB mapping and simply count how many
> levels of
> page tables it traverses before reaching a leaf.
>
>
>
> I don't believe this works if the firmware doesn't use any 4 KiB PTEs.
> coreboot currently only uses 2 MiB PTEs and will in future use a mix
> of PTEs, using the biggest possible. If you have, e.g., 1 GiB + 2 MiB
> of RAM that is naturally aligned it doesn't make much sense for
> firmware to use more than 3 PTEs for that area. It certainly makes no
> sense at all to use 2^18 or so 4KiB PTES.

You are correct; I had assumed that the supervisor would have at least
one ELF segment that must be mapped using 4KiB pages.

> So far, in all this discussion, I've seen lots of proposed mechanisms
> that require adding code or using things not yet implemented or
> changing the RISCV spec. What I've proposed is trivial to implement
> and trivial to use and, further, is working right now. So, until I see
> something better, it's what I'll use. It's true that it only allows
> firmware to communicate the presence of 3, 4, 5, and 6 level page
> tables but that seems enough for now.
>
> Compiling a kernel for only one or the other mode (sv39 vs. sv48) may
> be necessary for Linux but it's not at all necessary for Plan 9 -- so
> schemes that require such specialized compilation are very
> unattractive to me. The ELF idea is particularly problematic as it
> assumes the ELF headers are available at boot time, which is certainly
> not the case for, e.g., a kernel embedded in FLASH as a coreboot
> payload (Power[89] embed Linux in flash today for their boot loader).
> I like the config string in principle but the current design is not
> baked, and I'd rather have something immediately available to me
> rather than have to find the config string given that there is no
> standard for how to do that yet (EBDA anyone :-)?

Currently, the specs only seem to cover the embedded case and the
"larger system" case is still being hashed out. And why is there any
problem with expecting RISC-V Linux to build as an ELF binary instead of
a vmlinuz compressed blob?

The model that I expect would be that the kernel in flash would be an
ELF image stored in flash.

-- Jacob

Andrew Waterman

unread,

Nov 4, 2016, 1:09:42 AM11/4/16

to Jacob Bachmeyer, ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Thu, Nov 3, 2016 at 9:13 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> ron minnich wrote:
>>
>> On Wed, Nov 2, 2016 at 9:47 PM Jacob Bachmeyer <jcb6...@gmail.com
>> <mailto:jcb6...@gmail.com>> wrote:
>>
>>
>> There is an easier solution that does not rely on firmware assistance.
>> The kernel can walk any 4KiB mapping and simply count how many
>> levels of
>> page tables it traverses before reaching a leaf.
>>
>>
>>
>> I don't believe this works if the firmware doesn't use any 4 KiB PTEs.
>> coreboot currently only uses 2 MiB PTEs and will in future use a mix of
>> PTEs, using the biggest possible. If you have, e.g., 1 GiB + 2 MiB of RAM
>> that is naturally aligned it doesn't make much sense for firmware to use
>> more than 3 PTEs for that area. It certainly makes no sense at all to use
>> 2^18 or so 4KiB PTES.
>
>
> You are correct; I had assumed that the supervisor would have at least one
> ELF segment that must be mapped using 4KiB pages.

The SBI should guarantee that the kernel text is mapped contiguously
in physical memory. So it actually is possible to figure this out
dynamically. Look at two adjacent leaf PTEs and examine the deltas
between their PPNs; that gives you the size of a leaf page. Then from
the depth of the tree you can infer the virtual address space size.

Note, I'm not arguing against an alternative mechanism to discover
this, just pointing out that it is possible at present.

>
>> So far, in all this discussion, I've seen lots of proposed mechanisms that
>> require adding code or using things not yet implemented or changing the
>> RISCV spec. What I've proposed is trivial to implement and trivial to use
>> and, further, is working right now. So, until I see something better, it's
>> what I'll use. It's true that it only allows firmware to communicate the
>> presence of 3, 4, 5, and 6 level page tables but that seems enough for now.
>>
>> Compiling a kernel for only one or the other mode (sv39 vs. sv48) may be
>> necessary for Linux but it's not at all necessary for Plan 9 -- so schemes
>> that require such specialized compilation are very unattractive to me. The
>> ELF idea is particularly problematic as it assumes the ELF headers are
>> available at boot time, which is certainly not the case for, e.g., a kernel
>> embedded in FLASH as a coreboot payload (Power[89] embed Linux in flash
>> today for their boot loader). I like the config string in principle but the
>> current design is not baked, and I'd rather have something immediately
>> available to me rather than have to find the config string given that there
>> is no standard for how to do that yet (EBDA anyone :-)?
>
>
> Currently, the specs only seem to cover the embedded case and the "larger
> system" case is still being hashed out. And why is there any problem with
> expecting RISC-V Linux to build as an ELF binary instead of a vmlinuz
> compressed blob?
>
> The model that I expect would be that the kernel in flash would be an ELF
> image stored in flash.
>
>
> -- Jacob
>

> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/581C0AD1.6070403%40gmail.com.

Jacob Bachmeyer

unread,

Nov 4, 2016, 2:03:55 AM11/4/16

to Michael Clark, Angelo Bulfone, Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Michael Clark wrote:
>> On 3 Nov 2016, at 7:35 PM, Jacob Bachmeyer <jcb6...@gmail.com

>> <mailto:jcb6...@gmail.com>> wrote:
>>
>> Angelo Bulfone wrote:
>>> One major issue with requiring the kernel to allocate pages for the
>>> config string is the fact that a (proper) page frame allocator
>>> depends on the memory map of available RAM, which as currently
>>> specified, lies within the config string. In over words, the kernel
>>> needs the contents of the config string just to retrieve the config
>>> string. A simple solution would be to pre-map or identity map the
>>> pages containing the config string so that, assuming the kernel
>>> hasn't remaped anything, it can just access it normally.
>>
>> Perhaps a special ELF ".confstr" segment? But the size of the
>> configuration string is not known in the general case, only for
>> embedded systems.
>>
>> The SBI virtio approach that I have been advocating solves this a
>> different way--the kernel can simply put a page (or two) aside in its
>> .bss segment and use that to receive the configuration string (a page
>> at a time) for early parsing to find the physical RAM layout.
>> Essentially, the "minimal virtio" enables the supervisor to demand
>> the config string be placed at a location of the kernel's choosing in
>> convenient bite-size pieces.
>
> I don’t know how this idea of embedding platform information in the
> guest ELF came about. Perhaps it is valid at the monitor level.

Some embedded hardware may not ship with monitors--the developer must
program the entire system from the monitor up. In this case, a
non-standard monitor and SEE may be implemented, requiring information
normally passed from the SEE to be embedded in the supervisor image. I
am not advocating this, but stating an expected reality.

The ".confstr" segment I mentioned off-hand would be similar to
".bss"--a region of address space reserved but not defined in the ELF
image. The SEE would copy the configuration string into the ".confstr"
segment. I still advocate the "SBI virtio" approach, where the
configuration string is copied a page at a time to a supervisor-assigned
buffer for early parsing.

> The ROM or hardware stage 0 loader loader needs non evaluable hardware
> physical addresses for I2C or SPI MMIO or whatever it needs to bitbang
> the stage 1 loader from NAND e.g. BBL. But these can be absolute
> addresses in the ROM or there could be a dedicated circuit that can
> bitbang and verify this loader. i.e. it doesn’t necessarily need to be
> software or shadow ROM. This level is “implementation defined”. I
> think Stage 1 or 2 (see stage model below) needs to be defined but
> Stage 0 can be implementation defined. The monitor should be defined.

I am piecing together a proposal for a standard monitor architecture for
general-purpose systems (i.e. RISC-V PCs). I envision boot0 as very
rigidly defined (to qualify for the "general-purpose" logo (also yet to
be defined) you must use exactly a boot0 image published by the
Foundation and you must put it in mask ROM) but very limited in
capabilities (diagnostic port, load boot1 from flash, verify/dump/log
digest of boot1, pass control to boot1). The diagnostic port provides
integrity verification and firmware recovery capabilities. The same
interface also allows a multi-socket system to use only a single flash
chip--the socket that has the firmware chip would use the "unbrick"
protocol to feed boot1 into the other processors over the diagnostic
network.

Yes, I said "network"--the diagnostic port would use a low-speed variant
of IEEE 1355 DS-SE (four wires, full-duplex, up to 200Mbps, far more
than RVDIAG would need) and an SMP-capable processor would have at least
two--one for RVDIAG and one or more for other processors. The RVDIAG
port would support both the full 1355 mode and a simpler unidirectional
asynchronous serial mode that simply dumps digests of each stage as the
system boots. I am still thinking about the network topology discovery
problem. (How does a diagnostic pad determine how many processors it is
connected to? For that matter, how does boot1 determine where to send
additional boot1 images?)

The boot1 component, also running on the bootstrap service processor,
would be responsible for basic hardware configuration, such as the main
system's DRAM controllers. Lastly, boot1 defines a monitor text
segment, loads the monitor into DRAM, and configures the PMP unit to
appropriately protect the monitor text and data segments. Reset is then
deasserted on the main processor(s). The boot1 component is, of course,
entirely implementation-specific. Note that boot1 is also a raw binary,
the only ELF loader is in the monitor that actually runs on the main
processor.

> The monitor is at the level that partitions and potentially starts an
> integrity manager running beside the hypervisor. I guess it needs to
> have the root config (for static or non dynamically evaluable
> hardware) but at this level if could either be in the ELF or
> “implementation defined”. The question is whether stage 0 has to
> provide config to stage 1 or stage 1 e.g. BBL contains the root config
> embedded in the ELF.

ELF should be used for applications, supervisors, and hypervisors. The
monitor itself should be a plain binary.

> At the various levels (see stage model below), config is either
> augmented (dynamic discovery) or subdivided (partitioning) from the
> root config and config may be partially dynamically evaluable (e.g.
> USB devices and PCI devices). A kernel obviously wants the config
> passed in, not embedded in its ELF.
>
> The vendor has a stage0 loader loader that is the minimal program to
> bit bang and verify say BBL from FLASH. The loader loader. I guess is
> in ROM in production but a fused out debug interface may be able to
> short circuit stage 0 and load stage 1 from remote. The loader loader
> needs a minimal root hardware config and it might want the config
> embedded in ROM as absolute addresses for the I2C or SPI or whatever
> MMIO interface (non discoverable hardware required to load and verify
> from NAND).
>
> It’s going to load the first ELF and stage0 may not be updatable, so
> it does not need to be encapsulated in ELF. The non dynamically
> evaluable hardware like SPI or I2C addresses for the loader loader
> obviously needs to be in ROM, and it you don’t want root certs in
> firmware, then it needs to perform multiple rounds of a signature
> verification in a hardware circuit and potentially have fuses; so the
> loader loader (stage 0) can verify the integrity of the loader (stage
> 1) that it fetches from NAND. Authentication, not encryption.

This is wrong for a general-purpose system--you do not care about
authentication, you care about what *exactly* is running. Integrity
verification and authentication are orthogonal. Integrity verification
is also much easier--there are no secrets that be cracked or stolen,
only a digest. Does it match what it should be? You should care what
the monitor *is*, not who made it (or stole your vendor's signing key).

> There are two models
>
> - Developer friendly - allow unlocking of the loader (at some level),
> but as soon as the chain of trust is broken you void your warranty

> e.g. Motorola.

This is what the boot1 log is intended for--a bad boot1 could
potentially misconfigure the system badly enough to fry hardware
(protection against this could be an advertisable feature: "our chips
cannot be damaged by bad firmware; the warranty is unaffected by use of
a custom boot1") so recording a digest of every boot1 seen (okay, so the
last "N" distinct digests seen) can provide a "warranty void if seal
broken" layer while also providing an assurance to the user that the
system's integrity has not been compromised in the past. All is well
and has been well (at this level, at least) if and only if you can
account for every digest in the log. (A system that has never had boot1
updated should only have one digest in the log.)

> - Developer unfriendly - don’t allow unlocking, and you may or may not
> get sued by DMCA lawyers for circumventing loader verification e.g.
> name withheld.

This model is evil. While we cannot prohibit it outright, we should not
encourage it in any way.

> As far as I can tell, RISC-V is open to either model. It’s up to the
> vendor what they put on their SOC. RISC-V is an ISA Standard and
> Platform (which obviously covers Boot and ABI).
>
> These are four boot models.
>
> Kernel boot
>
> 0. HW, ROM and/or NAND Stage 0
> 1. MONITOR / BBL (controls PMAs)
> 2. KERNEL
>

> [snip more complex models]

You are still confusing Physical Memory Attributes with Physical Memory
Protection. These are distinct in RISC-V and PMAs are hardwired, not
configurable. PMP is something that a monitor can control.

> It is this way:
>
> 1). /loader to image/ or loader passing metadata to image
>
> - root config string
> - nvram environment
> - boot command line
> - state of registers when e_entry is called. i.e. sp pointing to
> scratch RAM with initial vector and processor state defined
>
> 2). /image to loader/ or loader reading metadata from image
>
> - dependency on SBI version riscv-sbi-1.9.so or ELF Note:
> RV_SBI_VERSION 0x0109
> - ELF OSABI type: monitor, hypervisor, supervisor

This part is good thinking. Although some of the information in the
first category may also need to be available (and can change) during
runtime.

> Stages
>
> s0 - ROM and/or NAND Stage 0, has enough config to enable caches and
> load BBL
> s1 - BBL - may have some config passed in from stage 0, and extend
> with discovery - perform memory evaluation, etc. BBL should be able to
> boot a kernel, hypervisor or multi-boot loader
> s2 - LOADER/HYPERVISOR/SUPERVISOR - subdivided config passed in via a
> standard entry vector specified in the RISC-V Boot Protocol

In the outline I have, boot0 and boot1 run on a dedicated RV32 bootstrap
processor (boot0 can run on RV32E) with its own local SRAM. Optionally,
the bootstrap processor's SRAM could (physically) be part of one of the
main processor's caches, if the bootstrap processor shuts down after
loading the monitor.

-- Jacob

Jacob Bachmeyer

unread,

Nov 4, 2016, 2:12:10 AM11/4/16

to NoDot, RISC-V ISA Dev, sor...@gmail.com, michae...@mac.com, and...@sifive.com, rmin...@gmail.com, bon...@gnu.org, sam....@gmail.com

Probably very little, since that spec is for "virtual hardware" that
still tries to behave like hardware, while my proposal is more like the
POSIX I/O API. The operations are analogous: attach is similar to
open, detach is similar to close, read_page and write_page are pread and
pwrite in pages instead of octets, read_info is a text-based analogue to
stat. The asynchronous model is a blend of readv/writev, POSIX AIO, and
some effort to "keep the layer thin" above underlying hardware when the
device is not entirely virtual.

That both that spec and my proposal use the term "virtio" is an
unfortunate accident.

-- Jacob

Michael Clark

unread,

Nov 4, 2016, 4:03:54 AM11/4/16

to jcb6...@gmail.com, Angelo Bulfone, Stefan O'Rear, Andrew Waterman, ron minnich, Paolo Bonzini, Samuel Falvo II, RISC-V ISA Dev

Sent from my iPhone

> On 4/11/2016, at 7:03 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> You are still confusing Physical Memory Attributes with Physical Memory Protection. These are distinct in RISC-V and PMAs are hardwired, not configurable. PMP is something that a monitor can control.

My read is that PMP 'attributes' are protection specific PMAs and this is consistent with MTRR where a range can be changed to write combine to write through. i.e. Configurable. Some PMAs may indeed be static but protection is an attribute. That is my understanding of the spec.

I will re-read, but either way, PMP attributes are protection bits on a range along with other configurable or non-configurable attribute bits. Semantics.

ron minnich

unread,

Nov 4, 2016, 11:45:39 AM11/4/16

to Andrew Waterman, Jacob Bachmeyer, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Thu, Nov 3, 2016 at 10:09 PM Andrew Waterman <and...@sifive.com> wrote:

The SBI should guarantee that the kernel text is mapped contiguously
in physical memory. So it actually is possible to figure this out
dynamically. Look at two adjacent leaf PTEs and examine the deltas
between their PPNs; that gives you the size of a leaf page. Then from
the depth of the tree you can infer the virtual address space size.

This is a great trick, and I love it. But I can always cons up some wacko scenario where it could fail, possibly unrealistic, but possible. Let's take the easy one, already seen in practice: a machine with 1 GiB memory precisely, aligned at 2 Gib, with maybe one io device. coreboot sets up a single GiB PTE for ram, single 4K PTE for the device -- how do I tell that from a 512 GiB for ram and 2M pte for the device (yeah, I know, I can't buy 512 GiB for riscv yet but ...). Did I miss something?

One unstated assumpton with this technique is that there will be some page table that has two PTEs populated and that they are contiguous. One thing's for sure: you should plan on firmware to do things you don't expect, and you really need to clearly lay out your minimal expectations, such as this one. I may have missed it in the docs.

The page table design is great but we need to communicate more info to the supervisor than I see today, and I have yet to see anything that rises to the level of needing an SBI call. Could we just define the prototype of main() as follows:

void main(void *config_string, void *fdt)

and let firmware/hypervisor set those things up? I don't see this stated anywhere but it would be useful.

ron

Michael Clark

unread,

Nov 4, 2016, 2:53:19 PM11/4/16

to ron minnich, Andrew Waterman, Jacob Bachmeyer, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On 5 Nov 2016, at 4:45 AM, ron minnich <rmin...@gmail.com> wrote:

The page table design is great but we need to communicate more info to the supervisor than I see today, and I have yet to see anything that rises to the level of needing an SBI call. Could we just define the prototype of main() as follows:
void main(void *config_string, void *fdt)

I agree with this however with a very minor tweak we could use the standard ELF start protocol if the firmware is able to set up a temporary stack frame, and access other things we need like boot command line.

e.g. int main (int argc, char **argv);

/*

STACK TOP

env data

arg data

padding, align 16

auxv table, AT_NULL terminated

envp array, null terminated

argv pointer array, null terminated

argc <- SP

*/

The stack frame for an ELF _start is is documented here:

https://github.com/michaeljclark/riscv-meta/blob/master/src/elf/riscv-elf.h

We don’t need to implement (3) getenv and (3) getauxval. These can be inlined pretty easily by walking memory from argv. The implementation of these two functions is trivial even for the smallest embedded system. The config parser is more complicated.

We can add any RISC-V specific AT variables.

AT_PHDR is already in the auxv table so the kernel can get access to it’s ELF. Some operating systems could check for NOTES or a .ramdisk section.

Of course we can add AT_RISCV_VM but ideally the kernel can access the sstatus.VM field.

AT_RISCV_CONFIG would be logical. We have multiple parser implementations.

AT_RISCV_MODE would also help, i.e. whether to access sstatus.VM and friends or ustatus.VM and friends (for KVM)

sstatus.VM and or ustatus.VM should ideally be readable by the supervisor, but AT_RISCV_MODE would help KVM mod one bit in the kernel CSRs. AT_RISCV_MODE and the top of stack should ideally be wiped, but the kernel needs to know whether it is running as a Supervisor or as a User (KVM).

I think this mostly fills the need for kernel to access its ELF, config string (Auxiliary Vector), environment (NVRAM), and boot command line (argc, argv).

We don’t need a full libc, but usually an embedded system needs some soft of printf which would use the SBI console functions.

We have to remember that when the kernel was written, ELF wasn’t around and it was a.out. ELF start protocol seems custom made.

I believe that “spike" does this already, but it reminds me; I have to properly set up the stack frame in the RISC-V emulator I am working on. I’m likely going to copy spike’s behaviour, so ultimately this means it’s probably up to Andrew and Palmer et al… we’ll see what they do…

Michael.

Michael Clark

unread,

Nov 4, 2016, 3:00:41 PM11/4/16

to ron minnich, Andrew Waterman, Jacob Bachmeyer, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

The principle being use a standard protocol that exists within the domain “if it is fit for purpose” versus inventing a new one.

RISC-V could potentially be quite tidy. I likely tidiness.

The Linux kernel was written when a.out and COFF were still a thing so they didn’t have a nice clean way of starting i.e. they were starting in 16-bit “real mode” with asm in a 512-byte boot sector (less space for a partition table).

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/F607547C-9D34-4B3A-91CE-28626DD674A5%40mac.com.

Andrew Waterman

unread,

Nov 4, 2016, 3:46:22 PM11/4/16

to ron minnich, Jacob Bachmeyer, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Fri, Nov 4, 2016 at 8:45 AM, ron minnich <rmin...@gmail.com> wrote:
>
>
> On Thu, Nov 3, 2016 at 10:09 PM Andrew Waterman <and...@sifive.com> wrote:
>>
>>
>>
>> The SBI should guarantee that the kernel text is mapped contiguously
>> in physical memory. So it actually is possible to figure this out
>> dynamically. Look at two adjacent leaf PTEs and examine the deltas
>> between their PPNs; that gives you the size of a leaf page. Then from
>> the depth of the tree you can infer the virtual address space size.
>>
>>
>
>
> This is a great trick, and I love it. But I can always cons up some wacko
> scenario where it could fail, possibly unrealistic, but possible. Let's take
> the easy one, already seen in practice: a machine with 1 GiB memory
> precisely, aligned at 2 Gib, with maybe one io device. coreboot sets up a
> single GiB PTE for ram, single 4K PTE for the device -- how do I tell that
> from a 512 GiB for ram and 2M pte for the device (yeah, I know, I can't buy
> 512 GiB for riscv yet but ...). Did I miss something?
>
> One unstated assumpton with this technique is that there will be some page
> table that has two PTEs populated and that they are contiguous. One thing's
> for sure: you should plan on firmware to do things you don't expect, and you
> really need to clearly lay out your minimal expectations, such as this one.
> I may have missed it in the docs.

Good point. Obviously I have never implemented the proposed hack...

>
> The page table design is great but we need to communicate more info to the
> supervisor than I see today, and I have yet to see anything that rises to
> the level of needing an SBI call. Could we just define the prototype of
> main() as follows:
> void main(void *config_string, void *fdt)
>
> and let firmware/hypervisor set those things up? I don't see this stated
> anywhere but it would be useful.

I've been thinking of something along those lines. It's fundamentally
not too different than Michael Clark's proposal of using the ELF AUX
vector. But if we can get away with only passing the config string,
this would be the simplest ABI.

>
> ron

>
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit

> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAP6exYKke_7Mqkrj9sfzkLZ0CeRrssH19TYqG10i4AWz%3Dw5o_w%40mail.gmail.com.

ron minnich

unread,

Nov 4, 2016, 3:46:35 PM11/4/16

to Michael Clark, Andrew Waterman, Jacob Bachmeyer, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

I'll get to your ELF suggestion after a short diversion, I promise.

As I've been learning, pointers are problematic between machine and supervisor modes. I just today pushed a CL for coreboot

as in the previous few years RISCV work we had just naively created a 1:1 virtual address map for RAM. oops. As of now, to start, we're mapping the first 2G of ram (which is 2G to 4G range) to the to high 2G of virtual memory, i.e. the top of the negative address space.

This is all old news to all of you, but to this recovering x86 hacker, starting a kernel with the right virtual address is nice.

It does mean, however, configuration information (as from firmware to kernel) should follow a few well-known rules. One of those

rules is that you don't want to pass a lot of pointer-filled structs, because either firmware will have to translate it for H and S mode (and screw it up) or S and H mode will have to translate M mode addresses (and screw it up).

Simple self-contained pointer-free formats like the config string, or u-boot FDT, or coreboot tables, are great. You hand *one* or at most *two* pointers to pass in a0 and a1, and that's it. Those are easy to turn into pointers. I realize the current config string definition has issues, but I like the idea of a string.

Someone will now mention ACPI. ACPI has lots of defects as a way to communicate information from firmware to kernel, and I'd just as soon we avoid it if possible. We can maybe discuss that on another thread, since this thread began as "how do I communicate 1 bit of information from firmware to kernel". And I do mean one bit, literally :-)

But as attractive as your ELF proposal is, when I look at your repo, I see a struct chock-full of pointers because it's emulating a process stack. Firmware is not a kernel, and a kernel is not a process, and I think the analogy is necessarily imperfect and does not work, based on 7 years of experience with it.

For the first few years of LinuxBIOS existence, starting in 1999, we settled on ELF as the standard format for payloads and booting (part of what made kexec the way it was, as it happens; ebeiderman hoped we could settle on ELF as the standard format for starting everything). We dropped ELF in 2006 as our standard firmware payload format for a simple reason: people keep messing up ELF parsers and compilers keep messing up creating ELF files (search for openbsd elf exploit -- if those guys can't get this perfect, what hope have I). There's a lot of subtle issues in parsing ELF and it's easy to write seemingly perfect ELF parsers or generators that have bad problems. You don't want to learn that after you just flashed thousands of machines with a new firmware image. You want all that checking done first, and then translated into a simple format, then burned into flash. We decided we did not want to parse or generate ELF in firmware and dropped that format in 2006. We replaced it with a far simpler format called SELF (Simple ELF).

So, as interesting as all these ideas revolving around ELF annotations are, ultimately I don't think they're the right way to go for passing info from firmware to hypervisor/kernel.

thanks

ron

Alex Elsayed

unread,

Nov 4, 2016, 3:55:26 PM11/4/16

to isa...@groups.riscv.org

On Friday, 4 November 2016 19:46:22 PDT ron minnich wrote:

<snip>

> For the first few years of LinuxBIOS existence, starting in 1999, we settled
> on ELF as the standard format for payloads and booting (part of what made
> kexec the way it was, as it happens; ebeiderman hoped we could settle on
> ELF as the standard format for starting everything). We dropped ELF in 2006
> as our standard firmware payload format for a simple reason: people keep
> messing up ELF parsers and compilers keep messing up creating ELF files
> (search for openbsd elf exploit -- if those guys can't get this perfect,
> what hope have I). There's a lot of subtle issues in parsing ELF and it's
> easy to write seemingly perfect ELF parsers or generators that have bad
> problems. You don't want to learn that after you just flashed thousands of
> machines with a new firmware image. You want all that checking done first,
> and then translated into a simple format, then burned into flash. We
> decided we did not want to parse or generate ELF in firmware and dropped
> that format in 2006. We replaced it with a far simpler format called SELF
> (Simple ELF).
>
>
> So, as interesting as all these ideas revolving around ELF annotations are,
> ultimately I don't think they're the right way to go for passing info from
> firmware to hypervisor/kernel.

So, this is fascinating info I'd not seen before! Would you mind giving some
additional details on SELF? In particular, is it a subset of ELF (such that a
valid SELF is a valid ELF), or is it disjoint?

ron minnich

unread,

Nov 4, 2016, 5:03:29 PM11/4/16

to Alex Elsayed, isa...@groups.riscv.org

Here's a bit more info. I reworked the coreboot ELF parsing a few years ago, and the parsing bits with a long note are here.

https://github.com/coreboot/coreboot/blob/fec0328c5f653233859d4aec7dae0b94acb67e97/util/cbfstool/elfheaders.c

We'd made mistakes in the ELF parser, resulting in endianness and word size issues which we hit hard on the ARM V8 port. Note that the 'mistakes' ELF parser was in use for over 5 years and the problems that cropped up only showed up when we added ARM V8. It's really easy to write broken ELF parsers, or to break good ones. I've lost track of how many broken ELF parsers I've had to deal with, and I'm hardly an expert. That's why I would rather not see them in firmware :-)

The various structs are here: https://github.com/coreboot/coreboot/blob/master/src/commonlib/include/commonlib/cbfs_serialized.h

Fortunately Vladimir ("Mr. Grub") reviewed my code as I wrote it and with luck there are not too many remaining errors. But, key point, even if there are mistakes in the ELF parsing code, *those errors are not in shipping firmware*.

The various bits of coreboot are assembled into a rom image by the cbfstool in util/cbfstool. The romstage is actually a small file system like image that contains 'files' loaded at different times.I believe cbfs is now recursive: cbfs images can contain cbfs images.

cbfs is placed so it contains the POR vector, and a small bootblock is placed so that at POR the processor can jump to it. This means, for example, the x86 bootblock must be in top 64k of flash because x86 starts out in 16-bit mode at 0xffff0. Control is transferred from bootblock to the 'romstage', which trains or sets up RAM; a ramstage is copied to RAM, then loads the payload (linux kernel, chromeos bootloader, harvey kernel, grub, whatever) and starts it. The stages are described in ram with this struct:

struct cbfs_stage {

uint32_t compression; /** Compression type */

uint64_t entry; /** entry point */

uint64_t load; /** Where to load in memory */

uint32_t len; /** length of data to load */

uint32_t memlen; /** total length of object in memory */

} __attribute__((packed));

which is pretty easy to parse. The alignments and such are optimized for the CPU in use.

Ramstage passes coreboot tables to the payload. In my heart of hearts I wish we'd just use a text protobuf, or a JSON string, or a string for init; but if not that, coreboot tables, and if not that FDT, and if not that, ABA -- anything but ACPI. ACPI is just too defective a design to stick on a neat architecture like RISCV.

I'll look at the doc that was mentioned. coreboot is open source (GPL V2), in use on over 20M systems at this point, and runs on 5 architectures on real products -- riscv would be the 6th, power9 the 7fh. We've got a lot of knowledge and experience we'd like to bring to the riscv community.

Hope this helps.

ron

Jacob Bachmeyer

unread,

Nov 4, 2016, 6:38:27 PM11/4/16

to Andrew Waterman, ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Andrew Waterman wrote:
> On Fri, Nov 4, 2016 at 8:45 AM, ron minnich <rmin...@gmail.com> wrote:
>
>> The page table design is great but we need to communicate more info to the
>> supervisor than I see today, and I have yet to see anything that rises to
>> the level of needing an SBI call. Could we just define the prototype of
>> main() as follows:
>> void main(void *config_string, void *fdt)
>>
>> and let firmware/hypervisor set those things up? I don't see this stated
>> anywhere but it would be useful.
>>
>
> I've been thinking of something along those lines. It's fundamentally
> not too different than Michael Clark's proposal of using the ELF AUX
> vector. But if we can get away with only passing the config string,
> this would be the simplest ABI.
>

The problem with passing the config string to the supervisor's entry
point, whether as an argument or in the ELF auxv, is that device hotplug
could *change* the config string while the system is running. Further,
some supervisors expect other information, like the kernel command line
in Linux. I suggest keeping the standard "main(int argc, char ** argv,
char ** envp)" prototype and making the configuration string accessible
through SBI calls (the minimal form of the "SBI virtio" I proposed
earlier is one way to do this--I have since found that "virtio" was a
poor choice of name for that proposal). I still do not know what use
the environment would be to a supervisor, but it seems reasonable to
have. Perhaps the initial configuration string could be the first entry
in the environment?

-- Jacob

Michael Clark

unread,

Nov 4, 2016, 6:51:40 PM11/4/16

to jcb6...@gmail.com, Andrew Waterman, ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

We understand there is a hotplug use case, and the config string can describe MMIO regions of devices present at boot time.

- hotplug use case
- config string describes MMIO regions of devices present at boot time
- this will include an interrupt controller that is Virtual or Physically optimised
- outbound signalling for interrupt acknowledge is easier using a CSR than an MMIO due to the 4K page dirty handling

I’m trying to see what requirement makes us implement some dangerous proc filesystem type interface in the SBI (versus in the Hypervisor).

- device plug could be an interrupt on one of the devices set up at boot
- this would be similar to dynamically discoverable hardware i.e. PCI host bridge, or USB host bridge
- a Virtual host bridge could receive interrupts for virtual device hotplug
- this supports virtual PCI and or USB plugging

The reason why I am push this way is to reduce the API attack surface on the SBI interface, especially anything more complex than interrupt and circular ring buffers (mimicking hardware). I don’t thing we want to add a proc_fs type interface to the SBI. I think that would be bad.

Michael.

Michael Clark

unread,

Nov 4, 2016, 6:55:35 PM11/4/16

to jcb6...@gmail.com, Andrew Waterman, ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On 5 Nov 2016, at 11:51 AM, Michael Clark <michae...@mac.com> wrote:

On 5 Nov 2016, at 11:38 AM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

Andrew Waterman wrote:
On Fri, Nov 4, 2016 at 8:45 AM, ron minnich <rmin...@gmail.com> wrote:

The page table design is great but we need to communicate more info to the
supervisor than I see today, and I have yet to see anything that rises to
the level of needing an SBI call. Could we just define the prototype of
main() as follows:
void main(void *config_string, void *fdt)

and let firmware/hypervisor set those things up? I don't see this stated
anywhere but it would be useful.

I've been thinking of something along those lines. It's fundamentally
not too different than Michael Clark's proposal of using the ELF AUX
vector. But if we can get away with only passing the config string,
this would be the simplest ABI.

The problem with passing the config string to the supervisor's entry point, whether as an argument or in the ELF auxv, is that device hotplug could *change* the config string while the system is running. Further, some supervisors expect other information, like the kernel command line in Linux. I suggest keeping the standard "main(int argc, char ** argv, char ** envp)" prototype and making the configuration string accessible through SBI calls (the minimal form of the "SBI virtio" I proposed earlier is one way to do this--I have since found that "virtio" was a poor choice of name for that proposal). I still do not know what use the environment would be to a supervisor, but it seems reasonable to have. Perhaps the initial configuration string could be the first entry in the environment?

We understand there is a hotplug use case, and the config string can describe MMIO regions of devices present at boot time.

- hotplug use case
- config string describes MMIO regions of devices present at boot time
- this will include an interrupt controller that is Virtual or Physically optimised
- outbound signalling for interrupt acknowledge is easier using a CSR than an MMIO due to the 4K page dirty handling

being very brief here. i.e. with respect to Virtual and Physical PLIC. e.g. CSR versus MMIO register (for outbound message signalled interrupt acknowledge).

I think I had an earlier proposal about mfromhost/mtohost becoming mvhisend/mvhirevc including retaining the CSR number for 1.7 compat. i.e. PV + PVHVM + HVM. A full implementation would implement the PLIC mmio region but use mvhisend/mvhirevc for faster signalling (ecall entry cost vs full page fault, and detecting PLIC address in a page fault).

I’m trying to see what requirement makes us implement some dangerous proc filesystem type interface in the SBI (versus in the Hypervisor).

- device plug could be an interrupt on one of the devices set up at boot
- this would be similar to dynamically discoverable hardware i.e. PCI host bridge, or USB host bridge
- a Virtual host bridge could receive interrupts for virtual device hotplug
- this supports virtual PCI and or USB plugging

The reason why I am push this way is to reduce the API attack surface on the SBI interface, especially anything more complex than interrupt and circular ring buffers (mimicking hardware). I don’t thing we want to add a proc_fs type interface to the SBI. I think that would be bad.

Michael.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.

To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/5523F420-C60D-474F-93AA-7BEC878F405D%40mac.com.

ron minnich

unread,

Nov 4, 2016, 6:57:13 PM11/4/16

to jcb6...@gmail.com, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Fri, Nov 4, 2016 at 3:38 PM Jacob Bachmeyer <jcb6...@gmail.com> wrote:

The problem with passing the config string to the supervisor's entry
point, whether as an argument or in the ELF auxv, is that device hotplug
could *change* the config string while the system is running.

I have not seen anything in the docs I have that hotplug would change the config string. I just looked again. I would expect any kernel to make a copy of the config string for many reasons:

o firmware may modify the string, but it may not. Generally you want to copy config tables and strings out of where they are, as it could be slow memory or a SPI controller doing slow SPI cycles under the hood.

o The kernel wants to interpret the string, and is likely going to create a modified or interpreted version at the very least.

I don't see this hotplug problem occurring in practice. Any well designed kernel is going to copy that data to a safe place.

Further,
some supervisors expect other information, like the kernel command line
in Linux.

That's what multiboot is all about. Boot loaders generally provide that info, not firmware. And bootloaders should not be mucking with the firmware info. They're different things.

I suggest keeping the standard "main(int argc, char ** argv,
char ** envp)"

I just can't agree with this. [argc, argv] implies parsing has been done. It generally has not and probably should not have been. envp doesn't even make sense to me. And the aux vector is built around magic numbers designed to be meaningful to programs.

It just seems to me that [argv, argc, envp] is a use case that does not apply to the firmware to kernel interface.

But I have my biases too. I don't like firmware callbacks and would like to see them minimized. It's just another source of problems, always has been.

ron

Michael Clark

unread,

Nov 4, 2016, 8:04:51 PM11/4/16

to ron minnich, jcb6...@gmail.com, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Likewise. I agree. The linux kernel VDSO for kernel entry was vulnerable when it was at a fixed address. It is a known address for a syscall ROP. One just needs a gadget to get a7 with the correct syscall number.

We could almost avoid the SBI if we implemented the console as a virtual UART passed in the config string, using the PLIC (and potentially a Virtual PLIC variant that has the message signal interrupt acknowledge using a CSR vs MMIO). I remember writing an email about this a while back.

It’s possible to do this with circular ring buffers defined in the config string and a Virtual PLIC (fine tuned version of the Hardware PLIC with some registers moved to CSRs that are slower to emulate using page faults).

The VDSO is still present on my Debian image but I’m not sure if it includes syscall entry anymore? Perhaps it just contains the time? And it is at a randomised address.

mclark@minty:~/src/c+fmt$ ldd /bin/ls

	linux-vdso.so.1 (0x00007ffc34d24000)

	libselinux.so.1 => /lib/x86_64-linux-gnu/libselinux.so.1 (0x00007fa157c82000)

	libc.so.6 => /lib/x86_64-linux-gnu/libc.so.6 (0x00007fa1578e1000)

	libpcre.so.3 => /lib/x86_64-linux-gnu/libpcre.so.3 (0x00007fa15766e000)

	libdl.so.2 => /lib/x86_64-linux-gnu/libdl.so.2 (0x00007fa15746a000)

	/lib64/ld-linux-x86-64.so.2 (0x0000560fd57ad000)

	libpthread.so.0 => /lib/x86_64-linux-gnu/libpthread.so.0 (0x00007fa15724d000)

Jacob Bachmeyer

unread,

Nov 4, 2016, 8:05:45 PM11/4/16

to Michael Clark, Andrew Waterman, ron minnich, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

The interface that I proposed has very little attack surface. The
minimal case that provides access to the (hardware) configuration string
is nothing more than "copy a page from the configuration string to this
page-aligned buffer". The full case also offers very little for an
attacker: sbi_vio_attach() reads a (printable ASCII or UTF-8 text)
string from supervisor memory and can return NULL if anything is fishy.
The other synchronous calls all take sbi_vio handles returned from
sbi_vio_attach() and transfer data to or from buffers in supervisor
memory. Screwing this up would be very similar to an OS kernel that
screws up read(2) or write(2)--highly unlikely in my view.

The greatest attack surface is the asynchronous I/O interface, which
reads an iovec from supervisor memory describing multiple I/O
operations, each of which describes its own buffers (all in supervisor
memory). Block devices use linear per-request buffers--a block read is
"read page(s) X:X+N from device B into PPN Y:Y+N" (although the PPN is
obtained from the page tables--the iovec entry contains a page-aligned
supervisor virtual address). Stream devices use circular buffers for
async I/O, but the iovec entry, buffer descriptor, and actual buffer are
all distinct objects to allow the latter two to be semi-transparently
mapped as MMIO. This enables a device (such as a NIC) with its own
dual-port memory to directly offer its buffers to the central processor
for zero-copy packet processing, with the buffer descriptor similarly
aliased to device registers that track the buffer state, all
semi-transparently to a supervisor. (It is only semi-transparent
because the supervisor must map buffer descriptors and buffers as
indicated for direct access to actually be used, otherwise the SEE must
copy the data into the supervisor's buffer. There is a "hot buffer
descriptor" status flag in the iovec entry--if set after
sbi_vio_start_io() returns, updates to the buffer descriptor have
immediate effect, otherwise sbi_vio_start_io() must be called again to
ensure that they are effective.)

Asynchronous I/O can cause asynchronous page faults, if the supervisor
does not ensure that all active I/O buffers remain mapped while I/O
transactions are in progress. This is bad, but can also be considered a
malfunctioning supervisor, since async I/O is similar to DMA hardware,
and may actually be implemented by the SEE configuring DMA-capable I/O
hardware.

Asynchronous completion notification does not need an "interrupt
controller" at all, even if it is delivered as an "interrupt", which is
good, because emulating an interrupt controller is inefficient.

Your model with complex emulated hardware and device discovery
procedures is far more complex with far more attack surface on both
sides of the interface. It looks simpler at first glance, but emulating
hardware is much more complex than handling environment calls. Remember
the "VENOM" exploit on the QEMU FDC emulation? When was the last
similar exploit against open(2)/read(2)/write(2)/close(2) on any serious
kernel? (Or even on Windows, for that matter?)

Also, you seem to be slightly confused--if a hypervisor is present, the
hypervisor is the SEE and the hypervisor *provides* the SBI. SBI calls
*are* hypercalls.

-- Jacob

Michael Clark

unread,

Nov 4, 2016, 8:26:18 PM11/4/16

to ron minnich, Jacob Bachmeyer, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

With a virtio host bridge defined in the config, then hotplug of virtual devices would simply be an interrupt with a message in a circular buffer. That we we don’t need to expose any code entry points with special function pages like old 16-bit Amigas, which now thinking twice about it is bad. How did SBI get in there? This was something ARM did but have since removed? i.e. special utility function pages. These functions should be part of a userspace ABI, not placed into bare privileged processes. The only interface to the higher privilege level should be trap and interrupt, MMIO pages and CSRs.

The VDSO is dangerous:

http://v0ids3curity.blogspot.co.nz/2014/12/return-to-vdso-using-elf-auxiliary.html

There shouldn’t be any pages in a guest with code in them and ecall should be the only way to call the supervisor/hypervisor/monitor. There should only be MMIO pages or CSRs and interrupts for signalling outside of the guest.

This ELF start protocol is simple, and the kernel can wipe the scratch space after parsing the config or flattened device tree (there is string parsing in the kernel, but having the supervisor place code into a bare thread is IMHO a bad idea, given a chance to think twice about it). This of course didn’t exist when COM files where loaded into a common address space (the COM compat prefix still exists in modern EXE files, which is horrible).

We have to start the kernel, and it’s also not a 512-byte MBR was some asm in it. It should be something sane e.g.

int main(int argc, char** argv, char** envp)

{

Elf64_auxv_t *auxv;

while(*envp++ != NULL); /*from stack diagram above: *envp = NULL marks end of envp*/

for (auxv = (Elf64_auxv_t *)envp; auxv->a_type != AT_NULL; auxv++)

/* auxv->a_type = AT_NULL marks the end of auxv */

{

if( auxv->a_type == AT_RISCV_CONFIG)

printf("AT_RISCV_CONFIG is: 0x%x\n", auxv->a_un.a_val);

}

Jacob Bachmeyer

unread,

Nov 4, 2016, 8:27:34 PM11/4/16

to ron minnich, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

ron minnich wrote:
> On Fri, Nov 4, 2016 at 3:38 PM Jacob Bachmeyer <jcb6...@gmail.com
> <mailto:jcb6...@gmail.com>> wrote:
>
>
> The problem with passing the config string to the supervisor's entry
> point, whether as an argument or in the ELF auxv, is that device
> hotplug
> could *change* the config string while the system is running.
>
>
> I have not seen anything in the docs I have that hotplug would change
> the config string. I just looked again. I would expect any kernel to
> make a copy of the config string for many reasons:
> o firmware may modify the string, but it may not. Generally you want
> to copy config tables and strings out of where they are, as it could
> be slow memory or a SPI controller doing slow SPI cycles under the hood.
> o The kernel wants to interpret the string, and is likely going to
> create a modified or interpreted version at the very least.
>
> I don't see this hotplug problem occurring in practice. Any well
> designed kernel is going to copy that data to a safe place.

Which is the exact problem: *after* a hotplug event, how does the
kernel get the new configuration string?

>
> Further,
> some supervisors expect other information, like the kernel command
> line
> in Linux.
>
>
> That's what multiboot is all about. Boot loaders generally provide
> that info, not firmware. And bootloaders should not be mucking with
> the firmware info. They're different things.

The model I advocate is that the bootloader is just another supervisor
and uses an SBI call to replace itself with the actual kernel.

>
> I suggest keeping the standard "main(int argc, char ** argv,
> char ** envp)"
>
>
> I just can't agree with this. [argc, argv] implies parsing has been
> done. It generally has not and probably should not have been. envp
> doesn't even make sense to me. And the aux vector is built around
> magic numbers designed to be meaningful to programs.

Only a bootloader would parse arguments--the first bootloader (read from
flash by the initial firmware) is given argc=0, argv=NULL. If you flash
a kernel in that slot, it does not get a command line.

The environment is something that I admit lacks an envisioned use. A
NULL pointer can be passed as envp if it is not implemented. I only
suggest it be included because (1) it is a very small increase in
complexity on top of allowing any arguments at all to be passed from
bootloader to kernel, (2) the RISC-V SEE concept in general seems to be
more "developed" than most current boot firmware, and (3) someone
smarter than me might have a good use for it.

What I advocate should be in the aux vector are bits of information
meaningful to programs and difficult to otherwise acquire--virtual
addresses for the initial page tables, (possibly, although I still
advocate the page-at-a-time "SBI virtio" approach) a virtual address for
a copy of the configuration string, (possibly, although I still advocate
moving it into sptbr) paging depth, and similar.

> It just seems to me that [argv, argc, envp] is a use case that does
> not apply to the firmware to kernel interface.

How about bootloader to kernel, which I propose use the same interface?

> But I have my biases too. I don't like firmware callbacks and would
> like to see them minimized. It's just another source of problems,
> always has been.

I agree with a distaste for firmware callbacks, but expect (perhaps
naively) RISC-V to "get it right this time", noting that no previous
firmware interface (except possibly coreboot, but it is extremely
minimal) has been openly developed like this.

-- Jacob

Samuel Falvo II

unread,

Nov 4, 2016, 8:30:50 PM11/4/16

to Michael Clark, ron minnich, Jacob Bachmeyer, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Fri, Nov 4, 2016 at 5:26 PM, Michael Clark <michae...@mac.com> wrote:
> That we we don’t need to expose any code entry points with special function
> pages like old 16-bit Amigas, which now thinking twice about it is bad.

Former Amiga employee here. AmigaOS never supported paging, so I'm
not sure what you're referring to with respect to "special function
pages," or why precisely they're so bad.

Jacob Bachmeyer

unread,

Nov 4, 2016, 8:39:59 PM11/4/16

to Michael Clark, ron minnich, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Michael Clark wrote:
> With a virtio host bridge defined in the config, then hotplug of
> virtual devices would simply be an interrupt with a message in a
> circular buffer. That we we don’t need to expose any code entry points
> with special function pages like old 16-bit Amigas, which now thinking
> twice about it is bad. How did SBI get in there? This was something
> ARM did but have since removed? i.e. special utility function pages.
> These functions should be part of a userspace ABI, not placed into
> bare privileged processes. The only interface to the higher privilege
> level should be trap and interrupt, MMIO pages and CSRs.
>
> The VDSO is dangerous:
>
> http://v0ids3curity.blogspot.co.nz/2014/12/return-to-vdso-using-elf-auxiliary.html
>
> There shouldn’t be any pages in a guest with code in them and ecall
> should be the only way to call the supervisor/hypervisor/monitor.
> There should only be MMIO pages or CSRs and interrupts for signalling
> outside of the guest.

In some ways, SBI is worse: SBI is always located at the top of the
address space, to allow it to be reached with JAL from x0. On the other
hand, there are SBI functions that are likely to be on hot paths and
allowing those to run entirely in S-mode if the hardware can support it
is a good idea. Put simply, a sane SEE must be careful with what it
puts in the SBI to avoid making useful ROP gadgets.

> This ELF start protocol is simple, and the kernel can wipe the scratch
> space after parsing the config or flattened device tree (there is
> string parsing in the kernel, but having the supervisor place code
> into a bare thread is IMHO a bad idea, given a chance to think twice
> about it). This of course didn’t exist when COM files where loaded
> into a common address space (the COM compat prefix still exists in
> modern EXE files, which is horrible).

I proposed that the SEE wipe all memory not part of the new supervisor
in sbi_sexec().

-- Jacob

Michael Clark

unread,

Nov 4, 2016, 8:58:48 PM11/4/16

to Samuel Falvo II, ron minnich, Jacob Bachmeyer, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Well, they were around in a time when executables where trusted. i.e. when we had POP sending plaintext passwords over unencrypted sockets. Times have changed. We have to change with the times. Nothing against Amiga, but protected (versus paged) memory is important these days. Paging is not strictly necessary for all applications. Segmentation can be more useful.

Michael Clark

unread,

Nov 4, 2016, 9:04:08 PM11/4/16

to RISC-V ISA Dev, ron minnich, Jacob Bachmeyer, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, Samuel Falvo II

With this approach or “something different but along the same lines”, there is nothing constant, we don’t have to define what SP is nor any of the MMIO ranges, rather the symbolic meaning of the parameters for devices. i.e. any IO region can “channel shift”

It’s fine if it’s not ELF aux vector however there are two pointers (they could be two registers in fact, which would be simpler, but not symmetrical)

AT_RISCV_CONFIG

AT_PHDR

We can place boot entropy in the remaining registers, and get command line, NVRAM, config, etc, but we also want to get a pointer to the start of the ELF. We may use .ramdisk. or .signature NOTES so we want to get the PHdr.

We can forger about AUX vectors, and instead of using SP, use A0 and A1 however this drops the number of bits of entropy. i.e. we only need to use one register. Sure we could invent something new…

Samuel Falvo II

unread,

Nov 4, 2016, 9:16:57 PM11/4/16

to Jacob Bachmeyer, Michael Clark, ron minnich, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Fri, Nov 4, 2016 at 5:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> In some ways, SBI is worse: SBI is always located at the top of the address
> space, to allow it to be reached with JAL from x0. On the other hand, there
> are SBI functions that are likely to be on hot paths and allowing those to
> run entirely in S-mode if the hardware can support it is a good idea. Put
> simply, a sane SEE must be careful with what it puts in the SBI to avoid
> making useful ROP gadgets.

AAHH, I see where Michael was going now. He's referring to AmigaOS
"libraries," which are statically-linked binary blobs which expose a
jump table to the rest of the system for interoperability.

While custom Amiga software can provide their own jump tables if they
desired (I almost always did this), please note that AmigaOS also
supported a mode where the exec.library built the jump tables for you.
This was required to support placing libraries into ROM, because
nobody at Commodore could predict where in the ROM image a library
would end up. This is, essentially, a form of boot-time linking.

The same method (trusted software building jump table into trusted
code) can be used here in the SEE. If you want to permit an
extensible SBI, you have virtually no choice *but* to support
jump-tables to arbitrary executables. AmigaOS libraries are
"insecure" because of two reasons: SetFunction() call to hot-patch
jump tables post-boot, and the fact that applications can tweak the
exec.library's Library list. The former can be prevented by simply
NOT offering such a function in the SEE, and the second isn't an issue
since the SEE (implemented via M- or H-mode software) is isolated from
the S-mode software, and by extension, U-mode as well.

To summarize:

1. The RISC-V jump tables don't have to be actual jump instructions
to S-mode code. You can very easy make each "library vector offset"
(LVO) 8 bytes long, supporting an ECALL instruction in the first slot,
and a 32-bit function ID in the second. This allows the SEE to be the
sole provider of SBI jump tables if you want. It also frees up a
register that would normally be used to specify the function ID.

2. The SEE can be solely responsible for building these jump tables,
just as AmigaOS was for supporting ROM-resident libraries. This is
not entirely unlike asking a COM object if it supports an interface.
"Do you support feature foo?" "Yes, here's the jump table I've
constructed that lets you use it." Can S-mode software hack this
table? Sure; it's in S-mode accessible/executable memory. But, since
each LVO is an ECALL, it *cannot* break out of its S-mode sandbox.
Unless, of course, the SEE is built to do otherwise, but then as you
say, that's not secure, and the SEE itself is to blame.

3. The "openness" of set of SBIs offered can be controlled entirely
by the SEE; if you ONLY expose libraries supported by the SEE, then
that's all that you get.

If you're concerned with someone patching an LVO to call back into
S-mode in a kind of MitM attack, be aware that you could do that
anyway with any other kind of SBI call, as long as the interface sat
in writeable RAM. I think the original intent was for SBI's jump table
to be in ROM, but honestly, any method of making the SBI pages
non-writeable even to S-mode would work to make this secure. This can
be through special PMP-protected regions of RAM, for example (arguably
the simplest solution; let the MMU translate the page to a physical
address, then use memory-type CSRs accessible ONLY to M-mode software
write-protect regions of physical address space).

Michael Clark

unread,

Nov 4, 2016, 9:17:46 PM11/4/16

to jcb6...@gmail.com, ron minnich, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

> On 5 Nov 2016, at 1:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
> Michael Clark wrote:
>> With a virtio host bridge defined in the config, then hotplug of virtual devices would simply be an interrupt with a message in a circular buffer. That we we don’t need to expose any code entry points with special function pages like old 16-bit Amigas, which now thinking twice about it is bad. How did SBI get in there? This was something ARM did but have since removed? i.e. special utility function pages. These functions should be part of a userspace ABI, not placed into bare privileged processes. The only interface to the higher privilege level should be trap and interrupt, MMIO pages and CSRs.
>>
>> The VDSO is dangerous:
>>
>> http://v0ids3curity.blogspot.co.nz/2014/12/return-to-vdso-using-elf-auxiliary.html
>>
>> There shouldn’t be any pages in a guest with code in them and ecall should be the only way to call the supervisor/hypervisor/monitor. There should only be MMIO pages or CSRs and interrupts for signalling outside of the guest.
>
> In some ways, SBI is worse: SBI is always located at the top of the address space, to allow it to be reached with JAL from x0. On the other hand, there are SBI functions that are likely to be on hot paths and allowing those to run entirely in S-mode if the hardware can support it is a good idea. Put simply, a sane SEE must be careful with what it puts in the SBI to avoid making useful ROP gadgets.

There is an understandable need to encapsulate IPI and console, however one approach is to have a lib that can access struct riscv_config* to find the uart, MMIO address or CSR. i.e. the hardware specific method of implementing IPI and console writes.

The other approach is to link the SBI to the kernel. By it’s nature, it is the part of the code that is not running in the monitor.

It’s an interesting problem. I’m now thinking of SBI as riscv-sbi.so much like linux-vdso.so.

The escalation is from user to root, root to kernel, and then once in kernel we have SBI access X page at a fixed location. If we had to hypercall, then the ecall instances would be at unknown offsets.

The problem is the arity. ecall type and id. Could be a trap, could be implemented in hardware. There could even be hardware ecall variants that use register state, however they would be force the processor to serialise.

Samuel Falvo II

unread,

Nov 4, 2016, 9:18:36 PM11/4/16

to Michael Clark, ron minnich, Jacob Bachmeyer, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On Fri, Nov 4, 2016 at 5:58 PM, Michael Clark <michae...@mac.com> wrote:
> Well, they were around in a time when executables where trusted. i.e. when we had POP sending plaintext passwords over unencrypted sockets. Times have changed. We have to change with the times. Nothing against Amiga, but protected (versus paged) memory is important these days. Paging is not strictly necessary for all applications. Segmentation can be more useful.

Yes. Are you aware that AmigaOS 4.x is memory protected? ;) Granted
it needs a PowerPC to run, but still; Linux is not the sole purveyor
of protected memory runtimes. ;)

Michael Clark

unread,

Nov 4, 2016, 9:18:36 PM11/4/16

to Samuel Falvo II, Jacob Bachmeyer, ron minnich, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Yes. Exactly! The SBI is a library.

Michael Clark

unread,

Nov 4, 2016, 9:47:59 PM11/4/16

to Samuel Falvo II, ron minnich, Jacob Bachmeyer, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

and FreeBSD, and macOS, and Windows, and even Solaris to some degree… don’t know why the NSA went after Solaris…

https://techcrunch.com/2016/08/16/everything-you-need-to-know-about-the-nsa-hack-but-were-afraid-to-google/

but now we’ve changed the SBI to a library, it all makes much better sense. i.e. it can be linked into the Operating System and accessed via a relative address. It’s not such a big change. It just needs a LICENSE that allows it to remain compatible with BSD i.e. like the Dual GPL/BSD licensed code in the ACPI stack.

We are not longer putting X pages at known addresses into out “bare” process.

I’d prefer that the lights were out, we had X or R only ELF mapped at some random location (with linked SBI) in a 128-bit address space and we can load a 112-bit hypercall key from the SBI via a relative call to some X-only page, that was emitted randomised by the privilege level above that linked our process into the address space. A 4K page that is mapped in PT_LOAD hole between DATA and TEXT in our ELF, which is at a completely random location. We just have SP (for which get’s us command line, ENV, AT_PHDR, AT_RISCV_CONFIG) and boot register entropy, and we feel our way from there…

The physical address space of a supervisor of course will have some offsets that will be discoverable if we are started with the correct value of SP. It’s a good question as to whether we are responsible for parsing the config and setting up PTEs for these mappings. If we are using virtual addresses, Uart IO may need MMIO pages mapped for the SBI to actually work.

I imagine it’s up to the kernel to parse the config and then set up page tables to map physical IO to whatever virtual address it prefers.

A kernel doesn’t need to know anything fixed other than how much scratch space is in the initial stack frame. 4K might be too small. 2M would mean one megapage for sv39?. We could communicate the value in the Auxiliary vector?

--

TOP OF STACK

AUX_DATA

AT_PHDR /* pointer to loaded ELF PHdr (guest) */

AT_RV_CONFIG /* pointer to config (host) */

AT_RV_SCRATCH /* scratch stack frame size */

ENV /* NVRAM */

ARGV /* boot command line */

ARGC

SP

Alex Elsayed

unread,

Nov 4, 2016, 9:49:18 PM11/4/16

to isa...@groups.riscv.org

IMO, the config string changing at runtime would be an *insane* thing to do:

- There's no mechanism to alert S-mode that this has happened
- S-mode would not be able to tell what had changed unless it had saved the
prior version to diff against
- In Device Tree, hotplug is handled in two ways:
- On discoverable buses, where the bus itself is in device tree, and
normal discovery proceeds from there. No change needed to DT on hotplug.
- On non-discoverable buses, via Device Tree Overlays, which are separate
files. No change needed to DT on hotplug.

It would be error-prone, racy, and worst of all _unnecessary_.

Alex Elsayed

unread,

Nov 4, 2016, 10:02:12 PM11/4/16

to isa...@groups.riscv.org

This would make sbi_sexec() useless for various use cases that the nearest
construct (kexec) is currently put to:

- Crash kernel (needs old kernel kept to inspect it)
- Linux-as-bootloader (currently done on shipping Power machines!)
- (proposed) Checkpoint with CRIU, store in tmpfs, kexec _passing the tmpfs
across_, restore for seamless kernel upgrades

In addition, exactly how well would that work on (say) 256GB-of-RAM big iron?

IMO, that idea is completely nonviable.

signature.asc

Jacob Bachmeyer

unread,

Nov 4, 2016, 10:08:22 PM11/4/16

to Michael Clark, Samuel Falvo II, ron minnich, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Michael Clark wrote:
>
>> On 5 Nov 2016, at 2:18 PM, Samuel Falvo II <sam....@gmail.com

>> <mailto:sam....@gmail.com>> wrote:
>>
>> On Fri, Nov 4, 2016 at 5:58 PM, Michael Clark <michae...@mac.com
>> <mailto:michae...@mac.com>> wrote:
>>> Well, they were around in a time when executables where trusted.
>>> i.e. when we had POP sending plaintext passwords over unencrypted
>>> sockets. Times have changed. We have to change with the times.
>>> Nothing against Amiga, but protected (versus paged) memory is
>>> important these days. Paging is not strictly necessary for all
>>> applications. Segmentation can be more useful.
>>
>> Yes. Are you aware that AmigaOS 4.x is memory protected? ;) Granted
>> it needs a PowerPC to run, but still; Linux is not the sole purveyor
>> of protected memory runtimes. ;)
>
> and FreeBSD, and macOS, and Windows, and even Solaris to some degree…
> don’t know why the NSA went after Solaris…
>
> https://techcrunch.com/2016/08/16/everything-you-need-to-know-about-the-nsa-hack-but-were-afraid-to-google/

The only mention of Solaris I found in that article seemed to be for
some kind of packet-relay program, not an actual exploit. Obviously, it
was ported to Solaris because that is what one of their targets uses.
The NSA cannot exactly install an OS of their choice in someone else's
network... well, not if they want their intrusion to remain unnoticed. :)

> but now we’ve changed the SBI to a library, it all makes much better
> sense. i.e. it can be linked into the Operating System and accessed
> via a relative address. It’s not such a big change. It just needs a
> LICENSE that allows it to remain compatible with BSD i.e. like the
> Dual GPL/BSD licensed code in the ACPI stack.
>
> We are not longer putting X pages at known addresses into out “bare”
> process.

The v1.9.1 SBI still has its entry points accessible using JALR from
x0. Further details have not been specified; I have assumed that the
SEE is responsible for linking (possibly via a relocation-like
mechanism) SBI calls by name. I apologize if this assumption was the
source of the "SBI is now a library" idea that seems to be circulating,
and want to clarify that I see no problem with the current "JALR x0"
approach.

Ick! Too much complexity! Too many places for bugs to hide! Too many
places to hide backdoors!

On a more serious note, there seems to be a tension here between "make
it transparent" (and easy to verify) and "make it obscure" (and harder
to exploit). I support the former stance, reasoning that it is better
to be able to reliably detect an intrusion than to "make intrusion
impossible" and give an intruder more places to hide than you can check.

-- Jacob

Stefan O'Rear

unread,

Nov 4, 2016, 10:18:39 PM11/4/16

to Jacob Bachmeyer, Rick O'Connor, Michael Clark, Samuel Falvo II, ron minnich, Andrew Waterman, Paolo Bonzini, RISC-V ISA Dev

Jacob, Michael: This discussion is miles off-course from anything that
could rationally be required for a MVP of priv-2.0 and I want it to
stop. Now.

Rick: Would it be possible to set up an riscv-overflow@ list for
people to discuss SBI licenses, the NSA, speculative applications of
128-bit addressing, inventing new protocols for handling hotplug on
non-discoverable busses, etc? The SNR of isa-dev is unacceptably low
lately.

-s

Jacob Bachmeyer

unread,

Nov 4, 2016, 10:30:20 PM11/4/16

to Samuel Falvo II, Michael Clark, ron minnich, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Samuel Falvo II wrote:
> On Fri, Nov 4, 2016 at 5:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>
>> In some ways, SBI is worse: SBI is always located at the top of the address
>> space, to allow it to be reached with JAL from x0. On the other hand, there
>> are SBI functions that are likely to be on hot paths and allowing those to
>> run entirely in S-mode if the hardware can support it is a good idea. Put
>> simply, a sane SEE must be careful with what it puts in the SBI to avoid
>> making useful ROP gadgets.
>>
>
> AAHH, I see where Michael was going now. He's referring to AmigaOS
> "libraries," which are statically-linked binary blobs which expose a
> jump table to the rest of the system for interoperability.
>
> While custom Amiga software can provide their own jump tables if they
> desired (I almost always did this), please note that AmigaOS also
> supported a mode where the exec.library built the jump tables for you.
> This was required to support placing libraries into ROM, because
> nobody at Commodore could predict where in the ROM image a library
> would end up. This is, essentially, a form of boot-time linking.
>

Thanks for the history lesson. I did not know that AmigaOS libraries
were implemented this way.

> The same method (trusted software building jump table into trusted
> code) can be used here in the SEE. If you want to permit an
> extensible SBI, you have virtually no choice *but* to support
> jump-tables to arbitrary executables. AmigaOS libraries are
> "insecure" because of two reasons: SetFunction() call to hot-patch
> jump tables post-boot, and the fact that applications can tweak the
> exec.library's Library list. The former can be prevented by simply
> NOT offering such a function in the SEE, and the second isn't an issue
> since the SEE (implemented via M- or H-mode software) is isolated from
> the S-mode software, and by extension, U-mode as well.
>

Fortunately, I do not think that the SBI needs to support that kind of
extensibility.

> To summarize:
>
> 1. The RISC-V jump tables don't have to be actual jump instructions
> to S-mode code. You can very easy make each "library vector offset"
> (LVO) 8 bytes long, supporting an ECALL instruction in the first slot,
> and a 32-bit function ID in the second. This allows the SEE to be the
> sole provider of SBI jump tables if you want. It also frees up a
> register that would normally be used to specify the function ID.
>

It is easier than that--the SEE can examine the {m,h}epc CSR to
determine which jump slot was used. No function ID needed. Then
overwrite the {m,h}epc CSR from the link register and execute xRET to
return as if from a function call.

> 2. The SEE can be solely responsible for building these jump tables,
> just as AmigaOS was for supporting ROM-resident libraries. This is
> not entirely unlike asking a COM object if it supports an interface.
> "Do you support feature foo?" "Yes, here's the jump table I've
> constructed that lets you use it." Can S-mode software hack this
> table? Sure; it's in S-mode accessible/executable memory. But, since
> each LVO is an ECALL, it *cannot* break out of its S-mode sandbox.
> Unless, of course, the SEE is built to do otherwise, but then as you
> say, that's not secure, and the SEE itself is to blame.
>

The trick is that the SBI region can also contain actual subroutines
that use non-standard means to safely implement, say, remote SFENCE.VM.
If an implementation has hardware support for this, such as some
non-standard CSRs accessible to S-mode that can be written to perform
remote TLB shootdown, then the SBI call can simply do those writes, let
the hardware handle the work, and return, avoiding the overhead of a
trap and IPI. Similarly, hardware may be able to perform "fast IPI" if
the target hart is currently available and trap to the SEE if it is not.

> 3. The "openness" of set of SBIs offered can be controlled entirely
> by the SEE; if you ONLY expose libraries supported by the SEE, then
> that's all that you get.
>
> If you're concerned with someone patching an LVO to call back into
> S-mode in a kind of MitM attack, be aware that you could do that
> anyway with any other kind of SBI call, as long as the interface sat
> in writeable RAM. I think the original intent was for SBI's jump table
> to be in ROM, but honestly, any method of making the SBI pages
> non-writeable even to S-mode would work to make this secure. This can
> be through special PMP-protected regions of RAM, for example (arguably
> the simplest solution; let the MMU translate the page to a physical
> address, then use memory-type CSRs accessible ONLY to M-mode software
> write-protect regions of physical address space).
>

I think that the concern Michael has is related to exploits against the
supervisor from U-mode. The problem with putting the SBI at a known and
fixed address is that an exploit payload may be able to use
return-oriented programming to abuse bits of code ("gadgets") in the SBI
region to attack the supervisor. An SBI jump table partially mitigates
this by allowing the actual S-mode SBI routines to be placed somewhere else.

Now that I think about it, if the SBI region is fully
supervisor-accessible, a supervisor could itself copy the SBI
implementations (at least the ones that are more than just ECALL) to
some random location and replace the SBI page with a jump table. If the
only instructions in the known page are JAL and ECALL, a ROP chain would
be very difficult to build.

Michael, what do you think of this? Replacing the known SBI page with a
jump table to the actual implementations that have been copied elsewhere?

-- Jacob

Michael Clark

unread,

Nov 4, 2016, 10:34:20 PM11/4/16

to jcb6...@gmail.com, Samuel Falvo II, ron minnich, Andrew Waterman, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

On 5 Nov 2016, at 3:08 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

--
TOP OF STACK
AUX_DATA
AT_PHDR         /* pointer to loaded ELF PHdr (guest) */
AT_RV_CONFIG    /* pointer to config (host) */
AT_RV_SCRATCH   /* scratch stack frame size */
ENV             /* NVRAM */
ARGV            /* boot command line */
ARGC
SP

Ick! Too much complexity! Too many places for bugs to hide! Too many places to hide backdoors!

I don’t think setting SP and using this ELF start protocol is particularly complex. It is simple and elegant. Also there is nothing constant.

With repsect to X only pages and hypercall keys. I don’t see this as being a feature in stock kernels. It could be added in a binary translation layer around ecall to keep it transparent and simplify the implementation. There is a relatively simple proof that if any payload based exploit needs to call an ABI endpoint, and there are no known gadget addresses and the entry vectors (on multiple dimensions) have all been randomised with ’n’ bits of entropy then no payload can execute. i.e. it is many orders of magnitude harder to exploit than ASLR. if we get above 2^112 then it becomes a quantum problem. Crypto-binary translation is a relatively new field so you can’t exactly judge implementations as there are not many on the market to speak of.

Jacob Bachmeyer

unread,

Nov 4, 2016, 10:37:04 PM11/4/16

to Michael Clark, ron minnich, Andrew Waterman, Samuel Falvo II, Paolo Bonzini, Stefan O'Rear, RISC-V ISA Dev

Michael Clark wrote:
>> On 5 Nov 2016, at 1:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
>>
>> Michael Clark wrote:
>>
>>> With a virtio host bridge defined in the config, then hotplug of virtual devices would simply be an interrupt with a message in a circular buffer. That we we don’t need to expose any code entry points with special function pages like old 16-bit Amigas, which now thinking twice about it is bad. How did SBI get in there? This was something ARM did but have since removed? i.e. special utility function pages. These functions should be part of a userspace ABI, not placed into bare privileged processes. The only interface to the higher privilege level should be trap and interrupt, MMIO pages and CSRs.
>>>
>>> The VDSO is dangerous:
>>>
>>> http://v0ids3curity.blogspot.co.nz/2014/12/return-to-vdso-using-elf-auxiliary.html
>>>
>>> There shouldn’t be any pages in a guest with code in them and ecall should be the only way to call the supervisor/hypervisor/monitor. There should only be MMIO pages or CSRs and interrupts for signalling outside of the guest.
>>>
>> In some ways, SBI is worse: SBI is always located at the top of the address space, to allow it to be reached with JAL from x0. On the other hand, there are SBI functions that are likely to be on hot paths and allowing those to run entirely in S-mode if the hardware can support it is a good idea. Put simply, a sane SEE must be careful with what it puts in the SBI to avoid making useful ROP gadgets.
>>
>
> There is an understandable need to encapsulate IPI and console, however one approach is to have a lib that can access struct riscv_config* to find the uart, MMIO address or CSR. i.e. the hardware specific method of implementing IPI and console writes.
>

I see the SBI console interface as only for low-level debugging--think
serial console, not framebuffer console.

> The other approach is to link the SBI to the kernel. By it’s nature, it is the part of the code that is not running in the monitor.
>
> It’s an interesting problem. I’m now thinking of SBI as riscv-sbi.so much like linux-vdso.so.
>

I think of it as something similar, except that it uses a special
calling sequence ("JALR x0") that requires at least part of it (a jump
table!) to be mapped at the top of the address space.

> The escalation is from user to root, root to kernel, and then once in kernel we have SBI access X page at a fixed location. If we had to hypercall, then the ecall instances would be at unknown offsets.
>

I think that I understand your concern--an attacker need only know what
SEE is in use to look up the SBI region contents and search for ROP
gadgets. I suggested limiting the contents of the fixed page to JAL and
ECALL in another branch of this thread. What do you think of that?

> The problem is the arity. ecall type and id. Could be a trap, could be implemented in hardware. There could even be hardware ecall variants that use register state, however they would be force the processor to serialise.
>

The reason to use JALR to an SBI region instead of directly using ECALL
is that some SBI calls may not need to trap at all on some hardware.

-- Jacob

Jacob Bachmeyer

unread,

Nov 4, 2016, 10:41:53 PM11/4/16

to Alex Elsayed, isa...@groups.riscv.org

Such a mechanism would have to be added.

> - S-mode would not be able to tell what had changed unless it had saved the
> prior version to diff against
>

Except that the parsed config string is sitting around somewhere, so the
supervisor *does* have something to diff against.

> - In Device Tree, hotplug is handled in two ways:
> - On discoverable buses, where the bus itself is in device tree, and
> normal discovery proceeds from there. No change needed to DT on hotplug.
> - On non-discoverable buses, via Device Tree Overlays, which are separate
> files. No change needed to DT on hotplug.
>

This is interesting, but how does Device Tree handle hotpluggable RAM?

On a side note, the "SBI virtio" proposal I made envisioned a similar
concept of "if this bus is accessed using virtio, its devices appear in
this secondary configuration string".

> It would be error-prone, racy, and worst of all _unnecessary_.
>

Yes; yes; not yet certain.

-- Jacob

Jacob Bachmeyer

unread,

Nov 4, 2016, 11:10:00 PM11/4/16

to Alex Elsayed, isa...@groups.riscv.org

Alex Elsayed wrote:
> On Friday, 4 November 2016 19:39:56 PDT Jacob Bachmeyer wrote:
>> I proposed that the SEE wipe all memory not part of the new supervisor
>> in sbi_sexec().
>>
>
> This would make sbi_sexec() useless for various use cases that the nearest
> construct (kexec) is currently put to:
>
> - Crash kernel (needs old kernel kept to inspect it)
>

This limitation was acknowledged (in message-id
<57E5DD24...@gmail.com>) in the discussions when sbi_sexec() was
proposed (in message-id <57E326FB...@gmail.com>). The crash kernel
mechanism would not be able to use sbi_sexec() except for rebooting
afterwards. This imposes compatibility requirements between host kernel
and crash kernel (same VM mode, same width, etc.) but should not be a
problem in general--and a known clean slate after a crash is probably a
good idea.

> - Linux-as-bootloader (currently done on shipping Power machines!)
>

Huh? The whole purpose of sbi_sexec() is for a bootloader. Prepare the
new supervisor (and optional initramfs) in a buffer, pass it to
sbi_sexec(), and the new system is started, with no other trace of the
old environment.

> - (proposed) Checkpoint with CRIU, store in tmpfs, kexec _passing the tmpfs
> across_, restore for seamless kernel upgrades
>

Still possible, just include the tmpfs in the buffer and unpack it on
the receiving end.

> In addition, exactly how well would that work on (say) 256GB-of-RAM big iron?
>

Presumably 256GiB RAM big iron also has enough parallelism available to
clear most of that RAM quickly.

> IMO, that idea is completely nonviable.
>

I remain unconvinced.

-- Jacob

Jacob Bachmeyer

unread,

Nov 4, 2016, 11:24:48 PM11/4/16

to Stefan O'Rear, Rick O'Connor, Michael Clark, Samuel Falvo II, ron minnich, Andrew Waterman, Paolo Bonzini, RISC-V ISA Dev

Stefan O'Rear wrote:
> Jacob, Michael: This discussion is miles off-course from anything that
> could rationally be required for a MVP of priv-2.0 and I want it to
> stop. Now.
>

Request to drop thread branch tentatively honored--the NSA bit was
pretty far into the weeds and the branch does seem to have been going
round-and-round with no real progress.

On the other hand, this list is for discussions of the RISC-V ISA, and
the ISA is more than just priv-2.0. Further, since the SBI is part of
the privileged ISA, I see discussions of boot processes as in-scope even
for priv-2.0, since the SBI and boot process are closely related.

-- Jacob

Andrew Lutomirski

unread,

Dec 1, 2016, 7:54:49 PM12/1/16

to RISC-V ISA Dev, michae...@mac.com, sor...@gmail.com, rmin...@gmail.com, bon...@gnu.org, sam....@gmail.com

On Wednesday, November 2, 2016 at 8:22:51 PM UTC-7, andrew wrote:

On Wed, Nov 2, 2016 at 8:14 PM, Michael Clark <michae...@mac.com> wrote:
>
>
>
> Sent from my iPhone
>> On 3/11/2016, at 10:01 AM, Stefan O'Rear <sor...@gmail.com> wrote:
>>
>>> On Wed, Nov 2, 2016 at 1:45 PM, ron minnich <rmin...@gmail.com> wrote:
>>> an SBI call I would rather not see.
>>>
>>> I was surprised to see that the # page level mappings was not communicated
>>> to s-mode in some way as a bit in sstatus. This is the kind of thing a
>>> kernel needs to know.
>>
>> My understanding of the situation is that you're expected to just pick
>> a VM mode, then put it in the ELF header somewhere so that the
>> bootloader can set it.
>>
>> See also https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/cV9DEHo1XYU/-PTKEEVICwAJ
>
> It doesn't seem appropriate to put a dynamic property into a static ELF attribute. It means the same kernel cannot run on hardware with different paging extensions.

Not necessarily... Sv48 systems will generally support Sv39 (the
hardware cost is immeasurable), so kernels that require less address
space can run on systems that support more.

Hmm, time for some x86 history.

On x86 32-bit, there are two rather different paging modes: the normal (older) mode and PAE (fancy) mode. They work quite differently. On both Linux and Windows (or at least older Windows -- I don't know what newer Windows versions do), you are expected to compile a kernel that matches the paging mode, but at least a non-PAE kernel will *run* on PAE hardware, albeit with less capability. This is generally considered to be a big mess. Everyone would prefer a single kernel image that uses the full platform capabilities of whatever platform it's on.

You're advocating the same thing for RISC-V. This will annoy distributions, IMO. Please just let the supervisor efficiently figure out and maybe even *control* the paging mode. Kernels can support both modes and pick their favorite from those supported on the hardware they're running on.

--Andy

Michael Clark

unread,

Dec 2, 2016, 6:38:27 AM12/2/16

to Andrew Lutomirski, RISC-V ISA Dev, sor...@gmail.com, rmin...@gmail.com, bon...@gnu.org, sam....@gmail.com

The supervisor should be able to change the VM mode. Somehow... maybe SBI... it requires a trampoline procedure and a cache line aligned code sequence or an SBI call.

It would still be possible to read the boot command line, environment (NVRAM), and Auxiliary vector variables AT_BASE (pointer to start of ELF), AT_RISCV_CONFIG (pointer to config) as an entry protocol with scratch page tables set up and sp pointing to argc just below top of this scratch stack.

S-mode boot1 can set up new PTEs and csrrw stvec, csrrw sptbr, csrrw sstatus.VM, sfence.vm, j stvec and then trap into the new mode as subsequent instruction fetches will cause a fetch fault and jump to stvec in the new VM mode or reach the jump instruction which will have a TLB miss and activate the new paging mode.

The tiny trampoline code sequence probably has to fit in a cache line unless VM only takes effect after sfence.VM. This may not be the case with current implementations for which the VM mode may take effect immediately i.e. the next TLB miss.

In fact if it traps on either csrrw sstatus.VM, sfence.vm or the jump succeeds then we would have made a successful mode change.

Sent from my iPhone

Andrew Lutomirski

unread,

Dec 2, 2016, 11:23:09 AM12/2/16

to Michael Clark, Paolo Bonzini, rmin...@gmail.com, RISC-V ISA Dev, sam....@gmail.com, sor...@gmail.com

On Dec 2, 2016 3:38 AM, "Michael Clark" <michae...@mac.com> wrote:
>
> The supervisor should be able to change the VM mode. Somehow... maybe SBI... it requires a trampoline procedure and a cache line aligned code sequence or an SBI call.
>
> It would still be possible to read the boot command line, environment (NVRAM), and Auxiliary vector variables AT_BASE (pointer to start of ELF), AT_RISCV_CONFIG (pointer to config) as an entry protocol with scratch page tables set up and sp pointing to argc just below top of this scratch stack.
>
> S-mode boot1 can set up new PTEs and csrrw stvec, csrrw sptbr, csrrw sstatus.VM, sfence.vm, j stvec and then trap into the new mode as subsequent instruction fetches will cause a fetch fault and jump to stvec in the new VM mode or reach the jump instruction which will have a TLB miss and activate the new paging mode.
>
> The tiny trampoline code sequence probably has to fit in a cache line unless VM only takes effect after sfence.VM. This may not be the case with current implementations for which the VM mode may take effect immediately i.e. the next TLB miss.

This is probably the first case where I've thought that x86's approach is straight-up better. On x86:

1. Identity-map a trampoline.
2. Turn off paging.
3. Load new page tables that still identity-map the trampoline.
4. Turn paging back on.

If RISC-V lets S-mode control the paging format and one of the choices is no paging at all, then there's no problem here.

Samuel Falvo II

unread,

Dec 2, 2016, 11:51:28 AM12/2/16

to Michael Clark, Andrew Lutomirski, RISC-V ISA Dev, Stefan O'Rear, ron minnich, Paolo Bonzini

On Fri, Dec 2, 2016 at 3:38 AM, Michael Clark <michae...@mac.com> wrote:
> The tiny trampoline code sequence probably has to fit in a cache line unless
> VM only takes effect after sfence.VM.

This might be true for virtually addressed caches; however, for
physically addressed caches (which I believe is the norm on most CPUs
I've seen of late), I don't see why this restriction needs to be in
place. This is why identity mapping should be sufficient: it maps a
common 4KB (or more) chunk of code and/or data that serves as a
"wormhole" from one VM setting to the next.

Stefan O'Rear

unread,

Dec 2, 2016, 11:53:29 AM12/2/16

to Michael Clark, Andrew Lutomirski, RISC-V ISA Dev, ron minnich, Paolo Bonzini, Samuel Falvo II

On Fri, Dec 2, 2016 at 3:38 AM, Michael Clark <michae...@mac.com> wrote:

> The tiny trampoline code sequence probably has to fit in a cache line unless
> VM only takes effect after sfence.VM. This may not be the case with current
> implementations for which the VM mode may take effect immediately i.e. the
> next TLB miss.

What I've received from the team is that VM changes are intended to
take effect **immediately** - a write to SPTBR behaves as a jump if
the mapping of the page containing PC changes, and presumably the same
is true for the page depth / page format fields if we make them S-mode
writable.

-s

ron minnich

unread,

Dec 2, 2016, 12:07:50 PM12/2/16

to RISC-V ISA Dev

Harvey is going to take a very different approach.

o I'm planning to communicate the sv39 vs. sv48 info from coreboot in the config string, which is nice and simple.

o After talking it over with Andrew, I'd like to experiment with getting rid of the amd64-inspired negative-address-space inspired hack where we use the top 2G of virtual address space to map the first 2G of DRAM. Instead, harvey will use the coreboot-provided page tables which map physical address space starting at, e.g., ffffff80_00000000 in sv39. Right now, the gcc toolchain for riscv can't build a kernel with that virtual address, it gets all kinds of relocation errors, but I'm hoping to see that fixed (and really hoping to create a repro ....)

o as for sv39 vs. sv48, I'm just going to have harvey dynamically work out which mode is available and use it. The Plan 9 mmu code harvey uses virtualizes the MMU in a very different way from Linux, and the PTE management code is hence isolated to one .c file. It's pretty clean and it's why I was able to write new MMU code for the Blue Gene flavor of PPC in an evening. 1 MiB PTE support took about an hour.

I'm only writing this as I see two contradictory themes in these discussions:

- design for 2080

- with software structures and ideas designed in 1980

The discussions immediately devolve from an idea to "here's how we put that in the ELF AUX vector". I don't see that as sustainable. At some point we need to break with the past. It really pays to put more weight into working what we're trying to do before we immediately jump to how we implement it in today's linux boot flow. I realize we all want this today, and we want riscv to succeed RIGHT NOW, but still ... I feel like we ought to slow down just a bit.

ron

Michael Clark

unread,

Dec 2, 2016, 12:11:56 PM12/2/16

to Andrew Lutomirski, Paolo Bonzini, rmin...@gmail.com, RISC-V ISA Dev, sam....@gmail.com, sor...@gmail.com

Yes that is also good. It takes 3 states and 2 transitions to remap the original physical pages somewhere else.

Faulting on fetch would also work but likely has implementation defined behaviour as to which instruction will fault (dependent on VM coherency).

Identity mapping is safer but one may need to reflect on the current mapping versus fall into a calculated trap.

On masking the VM field to allow access to it in S-mode sstatus. I don't know if relaxing constraints makes us non-compliant with the standard (more permissive would need an S-mode test for no VM field access to fail a more permissive implementation). Presently setting VM would require an SBI call e.g. sbi_set_vm(int) which would provide backwards compat for existing silicon (new silicon can just set the field in S-mode). In fact an SBI implementation could try and set the field, read it back, and if it fails perform an upcall. Multi-model firmware with backwards compat.

We're going to need to apply that kind of thinking for changes after the first frozen version of the privileged spec but at present it may be possible to make this kind of change (breaking existing silicon). It may or may not make sense to be an SBILIB call?

Sent from my iPhone

Jacob Bachmeyer

unread,

Dec 2, 2016, 5:16:10 PM12/2/16

to Stefan O'Rear, Michael Clark, Andrew Lutomirski, RISC-V ISA Dev, ron minnich, Paolo Bonzini, Samuel Falvo II

Changing paging depth in a supervisor (presumably during a user process
switch) does not introduce the same problems, since the supervisor is
almost certainly mapped at the same addresses in all processes. The
convention of putting the supervisor in the "negative" address space is
helpful here.

Changing VM mode in a supervisor opens a can of worms. I favor the
current model where VM is set by the SEE and not available to the
supervisor. What really helps here is that the SEE can freely set the
supervisor's VM mode _without_ _being_ _affected_ _by_ _it_. This
simplifies the code and avoids the "magic dances" that have been
mentioned on this thread.

-- Jacob