Proposal: Add hardware support for tagged pointers in RV64/RV128

253 views
Skip to first unread message

Jacob Bachmeyer

unread,
Sep 23, 2016, 9:31:29 PM9/23/16
to RISC-V ISA Dev
Since the compressed instruction set extensions envision the use of
tagged pointers, particularly on RV128, I propose optionally relaxing
the constraint that unused address bits must be equal to the highest bit
actually translated in RV64 and RV128. This will improve the efficiency
of programs that use tagged pointers by eliminating the need to keep an
address mask in a user-visible register.

This proposal makes no change to RV32 (where all virtual address bits
are currently used in paging translation) and (like my proposal for
multi-width execution) takes advantage of the additional bits in mstatus
in wider ISAs. I propose adding two one-bit fields to mstatus and
reserving an additional two bits for analogous fields.

The new field UAM ("user address mask") appears at bit 40 in mstatus,
hstatus, and sstatus and is read-only at bit 40 in ustatus. The new
field SAM ("supervisor address mask") appears at bit 41 in mstatus,
hstatus, and sstatus. Bits 42 and 43 are reserved for analogous HAM
("hypervisor address mask"), and MAM ("monitor address mask") fields.

When UAM is clear, U-mode address translation proceeds as currently
proposed. When UAM is set, address bits not used in translation are
ignored and no fault is raised. UAM is writable only by S-mode and
above because its correct use requires cooperation from the supervisor.

When SAM is clear, S-mode address translation proceeds as currently
proposed. When SAM is set, address bits not used in translation are
ignored and no fault is raised. A supervisor using tagged pointers may
not use more tag bits than are available in the largest address space
used by any user process, since the supervisor shares address space with
all user processes.

This proposal envisions an analogous implementation for HAM, but MAM, if
implemented, would require an additional CSR to store the M-mode address
mask, since an implicit mask derived from paging is not available for
M-mode. A monitor using tagged pointers would have to specify its own
address mask.

Each of these fields only affects accesses made with its own privilege
mode. Setting UAM has no effect in S-mode and SAM has no effect in
U-mode. These fields are effective, however, in M-mode with MPRV set,
where they have their normal effects if MPP is S or U. Likewise, MAM
and HAM, if and when implemented, do not affect S-mode or U-mode in any way.

Because the expected hardware cost is very small, consisting of storage
for the two bits and a few gates to mask the bad-address-form access
faults, I propose that support for UAM and SAM be mandatory if paging is
implemented.

-- Jacob

Michael Clark

unread,
Sep 24, 2016, 12:11:39 AM9/24/16
to jcb6...@gmail.com, RISC-V ISA Dev

This proposal in some form is a good idea assuming the editors like the name. I may get attached to it.

Sign extended full pointers (versus signed offsets that are a fraction of XLEN) will be very wasteful on RV128. The gap between 48 bit address space and 128-bit integrals is very large. The ability to use these 16-bits with SV48 on RV64 will also be very useful.

The semantics need to be clearly defined. Misaligned accesses on mask boundaries comes to mind. Implementations using the “non canonical address  mode” will also have to  accommodate their own mechanisms for pointer comparison. It needs a clear statement that address mask flags only apply when translation is active.

I imagine an implementation may combine this mode with a custom extension where a pair of CSRs contains a protection key and mask, whereby an exception is raised if the address does not match the protection key. Otherwise there are many random pointers that alias the same call target. A protection key and mask would allow some tag bits to be used for GC or other memory tracing features. A protection key mechanism is distinct from ASID in that it could be exposed to U mode, and used in combination with ‘offsets’ (that limit leakage to lower 64-bits on RV128) could provide one a reasonable amount of entropy for an ASLR mechanism on 128-bit ABI with 64-bit offsets.

A binary translator on an architecture that mandates “canonical addresses” will be able to retranslate inserting masking loads and stores when this feature is active.

Although this raises a separate issue that I had been thinking about (for userland sandboxes), and that is an option to make FENCE.I require S mode privileges and have an option for it fault in U mode. I don’t know if FENCI.I virtualizable? It seems to be a candidate for a trapping instruction in U mode. Alternatively a strict loader could raise a verifier exception for a well formed subset of RISC-V in ELF upon seeing FENCE.I. It did occur to me that FENCE.I is something that may wish to be administratively disabled on a per process basis. i.e. exceptions to be made for JITs. Self-modifying code is pretty obscure these days, and is usually an obfuscation or attack vector.

Michael.

Alex Bradbury

unread,
Sep 24, 2016, 2:36:10 AM9/24/16
to jcb6...@gmail.com, RISC-V ISA Dev
On 24 September 2016 at 02:31, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> Since the compressed instruction set extensions envision the use of tagged
> pointers, particularly on RV128, I propose optionally relaxing the
> constraint that unused address bits must be equal to the highest bit
> actually translated in RV64 and RV128. This will improve the efficiency of
> programs that use tagged pointers by eliminating the need to keep an address
> mask in a user-visible register.
> This proposal makes no change to RV32 (where all virtual address bits are
> currently used in paging translation) and (like my proposal for multi-width
> execution) takes advantage of the additional bits in mstatus in wider ISAs.
> I propose adding two one-bit fields to mstatus and reserving an additional
> two bits for analogous fields.

Support for ignoring bits unused in address translation is something
I've been advocating internally at lowRISC, though have been waiting
for the chance to write up a full proposal and trial an implementation
to spread more widely. Due to the obvious potential confusion with the
tagged memory locations we support in lowRISC, I'm tempted to refer to
proposals like this as offering "embedded metadata", but tags does
seem to be the accepted term now AArch64 has supported it. I'm very
glad to see this is something others are interested in.

Let's first take a step back though - there seem to me to be 3
different ways of supporting this "tagged pointer" use case:

1. Do like AArch64 and define a number of bits that can safely be used
for this (top 8). The full 64-bit address space can not be used in the
future without causing issues for software that exploited this
ability. Krste pushed against it for this reason ("History will repeat
itself - we have a longer-term view where 56 bits won't be enough.
Not to mention the complexities and non-portability along the way."
The thread was "RISC-V Draft Compressed ISA Version 1.7 Released" on
isa-dev but unfortunately I can't find a public archive with messages
from that time).

2. Do like your proposal does (as I understand it) and have the number
of available bits for use as a tag dependent on the supported virtual
address space. A program written assuming an Sv39 system and hence 24
bits of tag is going to have a bad time upon a move to Sv48 or higher

3. Support setting a mask for address translation that is independent
of the virtual addressing mode. More complex than the above two
options as you have to consider the case where bits are masked that
otherwise would be used in virtual address translation. The advantage
is processes might opt-in to masking on a case-by-case basis and make
the choice of trading off virtual address space for tags. e.g. even
under Sv48 I could make use of 24-bit tags. It does of course add more
per-process state and would need co-operation from mmap and friends

For all options (even 1. and 2. that have minimal implementation
complexity), I feel it would be useful to justify the potential hassle
in terms of benchmarks.

Alex

Michael Clark

unread,
Sep 24, 2016, 4:31:16 AM9/24/16
to Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
I don’t completely totally agree with restricting bits at any special level other than the “bits above the PTE width” i.e. the width of the active translation. e.g. 25 bits with sv39 on RV64, and 16 bits with sv48 on RV64. I think we agree from reading below.

As an OS is “wired” for sv48, it is never ever going to use the upper 16 bits, and if it is changed it is a very deliberate change made by the OS engineers and “regular” programs will see the extra address space using malloc/mmap.

If this special feature applies “per process” and the flags is enabled only for “special” processes that require it, then a GC or whatever that uses the extended bits is going to need explicit changes anyway, and this would be at the same frequency of major expansions of address space. At present these GC’s are hobbled. They can perform tagging by mmap of a /dev/shmfs file at multiple offsets “below” the canonical limit, in effect reducing their address space. These “special” process would be using this feature with knowledge that they would have to change their mask when an operating system is changed some years in the future.

Wrap or mask was actually used as a security mechanism and many people have complained about its removal.

Windows effectively tags/zones the virtual address space below the canonical bits to get around the hardware enforcement of sign extension.


I for one don’t completely believe in the rationale for “canonical addresses”.

Of course we should not tag physical address space, and any sane OS would switch this “translation mask bit” during context switch for processes that need it. A GC in something like the JVM or Go is very much like a user-land micro kernel and would likely to need to be changed to support a different address space layout on a regular OS, and this would be somewhat paced by the introduction of page table format e.g. sv57 and sv64.

2. Do like your proposal does (as I understand it) and have the number
of available bits for use as a tag dependent on the supported virtual
address space. A program written assuming an Sv39 system and hence 24
bits of tag is going to have a bad time upon a move to Sv48 or higher

So do I. I totally agree.

3. Support setting a mask for address translation that is independent
of the virtual addressing mode. More complex than the above two
options as you have to consider the case where bits are masked that
otherwise would be used in virtual address translation. The advantage
is processes might opt-in to masking on a case-by-case basis and make
the choice of trading off virtual address space for tags. e.g. even
under Sv48 I could make use of 24-bit tags. It does of course add more
per-process state and would need co-operation from mmap and friends

Setting an explicit mask would be nice too.

The addition to the mask is the protection key to cause traps on static bits, this requires its own mask to distinguish between static match bits (key bits) and tag bits (ignored bits).

My thinking for an RV128 ABI was around loading a 64-bit protection key in X-only code using immediate instructions and using 64-bit “offsets” in code. Code that leaked an “offset” wouldn’t leak the protection key. e.g. sizeof(uintptr_t) != sizeof(xlentype_t). Compliers would have to treat pointers slightly differently i.e. as stored offsets relative to a base pointer: gp.

A 128-bit ABI in 2020 is likely going to have less than 64 address space bits but we will be doing 256-bit bigint adds/shifts/ands/multiplies and what not for typical ECC implementations like ECDHE and Curve25519 which is 255 bits.

For all options (even 1. and 2. that have minimal implementation
complexity), I feel it would be useful to justify the potential hassle
in terms of benchmarks.

Will having this feature disabled have any effect on performance?

What would the cost of comparing and/or masking VA upper bits be in hardware?

I read the paper on the high cost of distance and memory versus the low cost of ALUs. e.g. using adders

Would register relative loads and stores be a RISC type addition? or would we instead fuse an add taking up slightly more icache bandwidth for register relative LOADS and STORES for an RV128 ABI using 64-bit offsets e.g. This is RVC I guess? RVC can use 32-bit reserved space?

LD rd, rs2(rs1)
SD rs2, rs3(rs1)

Note: it’s not strictly a complex addressing mode (CISC) as it only applies to LOAD and STORE.

Cheers,
Michael.

Alex Bradbury

unread,
Sep 24, 2016, 4:44:03 AM9/24/16
to Michael Clark, jcb6...@gmail.com, RISC-V ISA Dev
To clarify, I think the hardware cost should be negligible. It's more
a question of whether the performance benefit of not having to mask
away bits used for embedded pointer metadata and potential additional
use-cases it allows justifies the complexity in exposing this to
software. The baselines to compare against would be using embedded
pointer metadata and inserting extra instructions to do the mask, and
the same thing but with microarchitectural support for fusing the mask
with the load.

Alex

Michael Clark

unread,
Sep 24, 2016, 5:28:00 AM9/24/16
to Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
Then it is definitely a win to do this in hardware if it doesn't add too much complexity.

There are already a few implementations that insert masks in software for ARM and x86-64 (Google's NaCl sandbox) and I am interested in the ability to do this on RISC-V. If the hardware supported this, it may in fact spawn quite a lot of compiler security innovation. I can see that QEMU for example  would need to insert masking micro ops in loads and stores to support this for hardware emulation.

I would like to find the LLVM developer presentation on their plea for hardware assist for ASAN. The pointer tag feature may not help with their use case as they need to store poisoned metadata 4 bits per 8 bytes (minimum malloc size). They have a shadow memory region that is >>4 (a sixteenth memory overhead) to store the count of tail bytes poisoned in an 8 byte region. They insert reads to the poisoned meta data on loads and stores. Their use case needs genuine tagged memory which is a bit different to tagged pointers.

I can see the need for tagged memory but it is significantly more complex than address masking and trapping logic for poisoned memory is also quite special purpose.

I suspect the GC folk will benefit from address masking as I imagine they would store the generation.

I wonder if some day there will be regular ASIC with a small amount of asynchronous FPGA or micro code in the LOAD STORE unit to implement these types of policies at runtime. Upon searching, I couldn't find the HW mechanism the ASAN developers had proposed. Cache line poisoning and fine grained TLB invalidation to cause a MISS on poisoned memory without a TLB flush?

Michael

Cesar Eduardo Barros

unread,
Sep 24, 2016, 9:20:54 AM9/24/16
to Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
Option 2 is a non-starter. The page table format is a kernel
implementation detail which shouldn't be exposed to user space.

Option 1 hard-codes a limit in the process address space, in a way which
could hinder future experimentation with larger address spaces. 56 bits
might seem to be a lot, until you start thinking about sparse mappings
(either of a huge sparse file, or several large mappings spread around
the address space to give them room to grow or for ASLR). Even for
RV128, permanently forbidding several bits from being used could prevent
interesting ideas.

Option 3 could be done in a simple way with a bit of cooperation between
the kernel and the hardware. The hardware would have a per-process flag
to ignore the unused bits, like in option 2. The kernel would then
manipulate the page tables for that process to present the illusion of
the used bits also being ignored, up to the mask requested by the
process. For instance, for a process which requests 20 bits available
for tags, running on Sv48 which only has 16 bits available, the kernel
would configure the uppermost page table for the process so that entries
0-15, 16-31, 32-47, and so on, point to the same 16 next level page
tables, so the end result is as if these 4 bits of the address were also
being ignored.

Since this would need the kernel to set up the page tables in a special
way, it should not be the default. That is, unless the kernel says so,
canonical (sign-extended) virtual addresses should still be required.

--
Cesar Eduardo Barros
ces...@cesarb.eti.br

Samuel Falvo II

unread,
Sep 24, 2016, 11:30:45 AM9/24/16
to Cesar Eduardo Barros, Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
This all seems very complicated, requiring complex software
interaction. Why not just add 32 S-mode CSRs, one per X-register,
which contains an AND-mask to be applied to the corresponding
X-register during Lx or Sx instructions? Benefits:

1) Hardware investment is minimal while maximizing software flexibility.
2) Bitmask applies only during loads and stores; they're ignored for
ALU operations. This eases the burden of masking in software when
calculating effective addresses with tagged numbers.
3) It is reconfigurable dynamically at runtime. U-code cannot change
the bitmasks; but, if it has appropriate permissions, can ask S-code
to change them on its behalf without incurring additional heavy-weight
overheads.
4) Each X-register has its own XMask register, allowing for either
uniform or segmented windows into address spaces with different memory
(virtual) memory semantics.
5) With minimally clever coding, it *could* be used to implement
limited sub-page protection adequate for implementing circular data
structures like byte-queues.

Cons:

1) It'd be a bear to implement on an FPGA due to doubling the number
of flip-flops required for the register set.
2) It adds one gate delay to the hot critical path of instruction
execution. Can easily be pipelined, but then, that adds one pipeline
stage for branch prediction to have to cover up.
3) Consumes a lot of CSR space without multiplexing, or requires
indirect access to the masks if you do multiplex.
> --
> You received this message because you are subscribed to the Google Groups
> "RISC-V ISA Dev" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to isa-dev+u...@groups.riscv.org.
> To post to this group, send email to isa...@groups.riscv.org.
> Visit this group at
> https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
> To view this discussion on the web visit
> https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/cfe5d6b0-2344-3a3f-cc71-c206147e303b%40cesarb.eti.br.



--
Samuel A. Falvo II

Alex Bradbury

unread,
Sep 24, 2016, 12:21:28 PM9/24/16
to Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On 24 September 2016 at 14:20, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> Em 24-09-2016 03:36, Alex Bradbury escreveu:
>
> Option 2 is a non-starter. The page table format is a kernel implementation
> detail which shouldn't be exposed to user space.

As a counter-argument to this, userland software already does (ab)use
of knowledge of the virtual address space on other platforms. See e.g.
SpiderMonkey (https://bugzilla.mozilla.org/show_bug.cgi?id=1143022) or
the *San suite of tools https://reviews.llvm.org/D23811.

> Option 3 could be done in a simple way with a bit of cooperation between the
> kernel and the hardware. The hardware would have a per-process flag to
> ignore the unused bits, like in option 2. The kernel would then manipulate
> the page tables for that process to present the illusion of the used bits
> also being ignored, up to the mask requested by the process. For instance,
> for a process which requests 20 bits available for tags, running on Sv48
> which only has 16 bits available, the kernel would configure the uppermost
> page table for the process so that entries 0-15, 16-31, 32-47, and so on,
> point to the same 16 next level page tables, so the end result is as if
> these 4 bits of the address were also being ignored.
>
> Since this would need the kernel to set up the page tables in a special way,
> it should not be the default. That is, unless the kernel says so, canonical
> (sign-extended) virtual addresses should still be required.

In case it wasn't clear from my message, option 3 is definitely along
the lines I've been thinking, I hadn't thought of your idea of
simplifying the case where the mask is larger than 64-VA_size by
having the kernel co-operate with page mappings - I like that a lot.

Alex

Alex Bradbury

unread,
Sep 24, 2016, 12:29:53 PM9/24/16
to Samuel Falvo II, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On 24 September 2016 at 16:30, Samuel Falvo II <sam....@gmail.com> wrote:
> This all seems very complicated, requiring complex software
> interaction. Why not just add 32 S-mode CSRs, one per X-register,
> which contains an AND-mask to be applied to the corresponding
> X-register during Lx or Sx instructions? Benefits:

If I understand correctly, this is basically a variant of my 'option
3' suggestion but with a different mask per register. Although this
increases flexibility, the downsides of all that extra per-process
state seem rather high. I'm struggling to think of a killer use-case
for having a different mask for reach register?

Alex

Samuel Falvo II

unread,
Sep 24, 2016, 1:23:23 PM9/24/16
to Alex Bradbury, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On Sat, Sep 24, 2016 at 9:29 AM, Alex Bradbury <a...@asbradbury.org> wrote:
> If I understand correctly, this is basically a variant of my 'option
> 3' suggestion but with a different mask per register. Although this
> increases flexibility, the downsides of all that extra per-process
> state seem rather high. I'm struggling to think of a killer use-case
> for having a different mask for reach register?

Two responses to this question. The first response addresses why
multiple-masks are useful, without regard to any particular mapping or
quantity of mask registers.

In 1984, Commodore-Amiga asked Microsoft for a version of BASIC to use
on the Amiga 1000. They came up with AmigaBASIC 1.0, which shipped
with AmigaOS 1.1 IIRC, definitely with AmigaOS 1.2. (AmigaOS 1.0 had
ABasiC, which they used as MS was late in its delivery.) This product
worked pretty great on the 68000 and 68010 CPUs. It crashed hard on
68020 and later CPUs though. The reason is that it used the upper
8-bits of some types of pointers as flags.

If the 68000 had mask registers for its address registers, AND it
could have allocated A0-A3, for example, to address memory permanently
in the lower 16MB of address space, AND A4-A7 to talk to the full
address space, AND it allocated variable space in lower 16MB of
memory, then with careful design and address register usage,
AmigaBASIC could still run on contemporary, 68060-based machines
equipped with 4GB memory.

Second response is to address why mask-per-register instead of a
global mask register.

Maximum flexibility at near-zero cost in hardware. Unlike software,
you can't just respin hardware when you need a new feature (unless
you're working with FPGAs; however, even then, the cost is at least
100x that of software in terms of time required to implement and
verify it). In the case of AmigaOS, you need two kinds of pointers
(constrained and unconstrained). What happens if you need a third?
Do you want to call into the OS to change the mask register all the
time? Does that mask register affect the PC? etc. Keeping one mask
per register puts the burden on the ABI and the compiler to decide how
many memory regions you support, not on the CPU designer.

If I may explain by way of analogy, one thing that the Commodore and
Amiga chipsets taught me is that Unix philosophy applies to hardware
as much as it applies to software. While 80% of your use-cases can be
covered with a special purpose implementation of a circuit (e.g.,
VIC-II chip in C64), the other 20% is often what defines the unique
selling points of a platform.

The Commodore-Amiga exemplifies this. Back when the Amiga was still
known as the Lorraine and prior to Commodore purchasing the company,
the Denise circuit (display enabler; crudely, the equivalent to the
the RAMDAC in VGA terms) was limited to displaying a total of 64
colors on the screen at once: 32 colors from the palette registers,
plus their half-bright equivalents, themselves taken out of a palette
of 4096 colors (this is, of course, extra-half-bright mode). This
alone would have been sufficient to wow an audience back in 1983.

After seeing some professional flight simulators, Jay Miner had the
idea of adding a different data path in the Denise circuit. A rather
simple circuit, it treats pixel data as instructions on how to modify
the RAMDAC's current setting. It allowed a kind of on-the-fly video
decompression, allowing it to render *all* 4096 colors on the screen
at once, in the same amount of space as a typical 64-color display
would take (a HUGE memory savings back then). It took only a few
registers, some small amount of decode logic, and some reset logic
tied into HSYNC signal to reset the RAMDAC to 000 (black). When HAM
mode was enabled, it swapped out the normal palette LUT circuitry for
the HAM circuitry (what would synthesize as a simple MUX in an FPGA
today).

When Commodore asked to switch the video circuit to produce RGB
instead of HSV video, Miner wanted to remove HAM mode, because it was
designed to take advantage of the color space of HSV. He knew it'd
look worse with raw RGB values. But, doing so would have left a hole
in the middle of DENISE chip, thus requiring a completely new layout
to cut costs, and would have added months to the fabrication of the
chip. They left it in to save time and money.
https://en.wikipedia.org/wiki/Hold-And-Modify

That fortunate accident gave the Commodore-Amiga a 12 year *dominance*
in desktop video. It lead directly to the development of the Video
Toaster, Lightwave 3D (the ray tracer for the Video Toaster that is
now a multi-thousand-dollar per seat investment in many Hollywood
studios), and many popular science fiction shows, video games, and
other multimedia artifacts.

It was that 20% "edge case that nobody would use" that gave the Amiga
the reputation it earned as a serious desktop multimedia power-house.
But it was the "this is a swappable component that we can put in or
out at will" that lead Miner to add HAM mode to begin with.

Moral: In our never-ending drive to support Unix and Unix-like memory
models, we need to remember that it's valuable, even in a processor
context, that the Unix philosophy applies to hardware just as much as
it does to software. I wrote in a message some time ago that the
RISC-V specs should not be in the business of limiting opportunities
for R&D. Just because neither you nor I can arrive at a good use-case
for having per-register masks doesn't mean some enterprising language
designer, game developer, or other software developer cannot either.

Alex Elsayed

unread,
Sep 24, 2016, 3:14:36 PM9/24/16
to isa...@groups.riscv.org

I suspect that the idea with this is that this would be something that could differ per-process, and thus could be specified in the binary format (something something ELF?).

 

Then it'd be less about "a move to Sv48 or higher" and more about "realizing that this particular program actually does need that much memory sometimes."

 

It's still a footgun for future-proof code, but less of one than it otherwise might be - programs would encounter it individually and organically, as they grew to use larger amounts of memory / operate on larger data sets, without it impeding ports to a new architecture.

 

Personally, I'd specify the metadata as specifying _minimum number of tag bits_, as that's independent of CPU architecture. The user code, then, would compute `(sizeof(*void) * CHAR_BIT) - DESIRED_TAG_BITS` to find how much it should left-shift its tags by.

 

Heck, you could do this:

 

```C

#ifndef TAG_BITS

#define TAG_BITS 0

#endif

#define TAG_MASK ((1 << TAG_BITS) - 1)

#define POINTER_BITS \

(sizeof(*void) * CHAR_BIT) - TAG_BITS

#define POINTER_MASK ((((uintptr_t)1) << POINTER_BITS) - 1)

#define SET_TAG(ptr, tag) \

((((ptr) & POINTER_MASK) | (((tag) & TAG_MASK) << POINTER_BITS))

```

 

That, at least, minimizes portability issues caused by _improved hardware_ - moving from RV64 to RV128 would essentially never break such code, so long as the supervisor supports per-process page table depth.

 

Similarly, by using a field in the binary format, attempting to run such a piece of code can emit clear, early errors to the user.

Alex Bradbury

unread,
Sep 24, 2016, 7:09:01 PM9/24/16
to Alex Elsayed, RISC-V ISA Dev
On 24 September 2016 at 20:14, Alex Elsayed <etern...@gmail.com> wrote:
> On Saturday, 24 September 2016 07:36:06 PDT Alex Bradbury wrote:>
>> 2. Do like your proposal does (as I understand it) and have the number
>> of available bits for use as a tag dependent on the supported virtual
>> address space. A program written assuming an Sv39 system and hence 24
>> bits of tag is going to have a bad time upon a move to Sv48 or higher
>
> I suspect that the idea with this is that this would be something that could
> differ per-process, and thus could be specified in the binary format
> (something something ELF?).
>
>
>
> Then it'd be less about "a move to Sv48 or higher" and more about "realizing
> that this particular program actually does need that much memory sometimes."

That's a good point, in my description there's the implicit assumption
that there is no support for different processes supporting different
virtual address sizes. This isn't supported right now (see
https://groups.google.com/a/groups.riscv.org/d/msg/isa-dev/42IM6tstYOg/ePRyYIenDAAJ),
but on reflection I think adding support for it should be considered
along with the other approaches I mentioned (let's call it option
2.b!).

Alex

Michael Clark

unread,
Sep 24, 2016, 7:40:26 PM9/24/16
to Alex Bradbury, Samuel Falvo II, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
I’m definitely leaning towards an arbitrary mask, as this allows the OS designer to use either a fixed mask e.g. upper 8 bits, or choose an application specific mask.

I also like the idea of two masks. trapping mask and ignore mask. There is also the distinction between match bits and ignore bits. i.e. one part of the mask may have to match a special “random” number.


The per register mask use case is this (static setup by dynamic linker in a POSIX OS with a typical ABI):

gp: trapping mask covering process mapped TEXT(.rodata,.bss,.text,.data,etc) plus HEAP
ra: trapping mask covering process mapped TEXT (OS widens mask to cover heap for a JIT process)
sp: trapping mask covering thread stack segment
tp: trapping mask covering thread heap local data segment
x5-x32: trapping mask over heap, ignore mask selectively configured for tags (GC asks OS)


The per register mask use case is this (static setup by dynamic linker in a POSIX OS with atypical ABI):

gp: trapping mask over process mapped TEXT(.rodata,.bss,.text,.data,etc) plus HEAP
ra: trapping mask over process mapped TEXT (OS widens mask to cover heap for a JIT process)
sp: trapping mask over thread stack segment
tp: trapping mask over thread heap local data segment
x5-x32: compiler loads memory allocation metadata into fat pointers carrying bounds [offset:size:ptr]
(likely needs adder in address decode, user mode faults to allocator, size could be stored negated or some such)


I like the idea of having an adder vs simple masking function, however it does have carry propagation overhead, it allows protection of non-square bounds. Geometric allocators are a special case. If there was a pair of ALUs per register we could switch their mode from {atomic, fused}, {ADD,XOR,AND} and configure them for trap or ignore. We would need two of them. Yes “killer bounds protection”. Very close to having tiny asynchronous FPGA in the address decode distributed in the register file (versus centralised and blocking). It would be possible to build a C hardware VM with a cooperative allocator. LOAD-POINTER, STORE-POINTER compressing pointer bounds into a FAT pointer.


Having the non trapping mask per process would help the GC tag case. i.e. uncanonical pointers, but I think a per process mask should be arbitrary. Hardcoding magic numbers is a bad idea!



Earth. Mice. Hitchhiker’s Guide to the Galaxy. This next part is complete fiction:

- 42 appears to be the magic number for an atypical compressed pointer on RV128I. 128 - 42 * 3 = 2: Leaves two tag bits for a GC (4 generations).

- 4TiB is enough per process address space for a hardware sandboxed process assuming this is only “virtual address space” on processes that opts-in to “killer bounds protection”.

- Processes that opt-out can trap on canonical pointers and be restricted to 48 or 57 bits of address space. e.g. 256TiB or 128PiB just like canonical computers.

- Address decoder is arranged nicely, it could set the masks on an RV128 reg file such that the 42 bit address space is only applicable to registers holding these these atypical pointers, and have this function ignored for scalar registers. i.e. the hidden ALU group bit. LOAD-POINTER vs LOAD-INTEGRAL

- Magic numbers, 42 bit constant offset shifters. 2 left over tag bits for GC. optional functions and per process PTE formats.


I think 4TiB is plenty of per process address space for a process on a hand phone running anarch128. Imagine the canonical pointer on RV128I. Likely going to have in the order of ~88 bits all set to 1.

A 0b1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 prefix on every pointer in memory.

Now tell me who is crazy?


Michael

Jacob Bachmeyer

unread,
Sep 24, 2016, 8:09:54 PM9/24/16
to Michael Clark, RISC-V ISA Dev
Michael Clark wrote:
>
>> On 24 Sep 2016, at 1:31 PM, Jacob Bachmeyer <jcb6...@gmail.com
Misaligned accesses crossing into the tag bits... that is something I
had not considered, but the proposal does cover it as written. Such an
access is exactly equivalent to a misaligned access that crosses the top
of the address space at 0xffffffffffffffff in RV64. The same fault is
thrown. (Now the question is what fault is thrown in that case? What
if the hardware normally supports misaligned accesses?)

The canonical use case is a Lisp runtime, where pointers with different
type tags are not really comparable and two pointers with different type
and the same address is an error. I expect that most uses for tagged
pointers have similar constraints.

Since the flags are defined in terms of their effect on address
translation, it should be obvious that they have no effect if address
translation is not active. (Except the hazily envisioned MAM, which
does affect M-mode addresses even though M-mode never uses translation
when MPRV is clear, but MAM is not actually defined in this proposal,
merely reserved.) I have no objection to explicit clarification along
these lines.

> I imagine an implementation may combine this mode with a custom
> extension where a pair of CSRs contains a protection key and mask,
> whereby an exception is raised if the address does not match the
> protection key. Otherwise there are many random pointers that alias
> the same call target. A protection key and mask would allow some tag
> bits to be used for GC or other memory tracing features. A protection
> key mechanism is distinct from ASID in that it could be exposed to U
> mode, and used in combination with ‘offsets’ (that limit leakage to
> lower 64-bits on RV128) could provide one a reasonable amount of
> entropy for an ASLR mechanism on 128-bit ABI with 64-bit offsets.

Where you say "protection key", I would say "type tag for function
pointer", but otherwise I agree that that could be a further extension.

Or add "These flags only affect data accesses and have no effect on
instruction fetch--the program counter must hold a canonical address.
If it does not, an instruction access fault is thrown." to this
proposal. A separate proposal could add "code type tag" CSRs which, if
set to a non-zero value, require that instruction addresses contain
those tag bits instead of being in canonical form. This would be
another example of configuration that can only be done at a higher
privilege level and would not be available at all in machine mode. But
that is no problem, since machine mode instruction fetch always uses
Mbare anyway. In fact, making those CSRs inaccessible to the privilege
level they affect would be advisable, since they could otherwise provide
an information leak, although AUIPC will tend to leak the current code
tag bits, but perhaps in a manner less useful to an attacker.

> A binary translator on an architecture that mandates “canonical
> addresses” will be able to retranslate inserting masking loads and
> stores when this feature is active.
>
> Although this raises a separate issue that I had been thinking about
> (for userland sandboxes), and that is an option to make FENCE.I
> require S mode privileges and have an option for it fault in U mode. I
> don’t know if FENCI.I virtualizable? It seems to be a candidate for a
> trapping instruction in U mode. Alternatively a strict loader could
> raise a verifier exception for a well formed subset of RISC-V in ELF
> upon seeing FENCE.I. It did occur to me that FENCE.I is something that
> may wish to be administratively disabled on a per process basis. i.e.
> exceptions to be made for JITs. Self-modifying code is pretty obscure
> these days, and is usually an obfuscation or attack vector.

The problem is that FENCE.I only guarantees that instruction fetch will
see modifications. Do enough other accesses to flush caches and you
achieve the same effect, just with less efficiency, which I doubt will
cause an exploit writer to fret overmuch. Lack of FENCE.I does not
guarantee that modifications to program text will not be seen. An
implementation could do an implicit "FENCE; FENCE.I; SFENCE.VM" after
every STORE. Simple embedded processors, with only internal SRAM,
probably will. (But those probably will not support S-mode at all, so no
SFENCE.VM.)

-- Jacob

Jacob Bachmeyer

unread,
Sep 24, 2016, 8:39:49 PM9/24/16
to Alex Bradbury, RISC-V ISA Dev
I agree that this first option is a bad idea. Tag length should not be
a permanent part of the ISA.

> 2. Do like your proposal does (as I understand it) and have the number
> of available bits for use as a tag dependent on the supported virtual
> address space. A program written assuming an Sv39 system and hence 24
> bits of tag is going to have a bad time upon a move to Sv48 or higher
>

There is a tacit assumption that page table depth will be moved to
supervisor control. Otherwise, an Sv48 supervisor would have to refuse
to load a program that uses 24-bit tags. UAM can only be set by the
supervisor, so a process using this must be in some way marked as "uses
tagged pointers with N-bit tags". The supervisor in turn will not
"promote" that process to a translation mode incompatible with that
constraint, instead reporting "out of address space" or "out of memory"
on a call to mmap() or similar that would violate the constraint.

> 3. Support setting a mask for address translation that is independent
> of the virtual addressing mode. More complex than the above two
> options as you have to consider the case where bits are masked that
> otherwise would be used in virtual address translation. The advantage
> is processes might opt-in to masking on a case-by-case basis and make
> the choice of trading off virtual address space for tags. e.g. even
> under Sv48 I could make use of 24-bit tags. It does of course add more
> per-process state and would need co-operation from mmap and friends
>

This 3rd option is essentially a means to squeeze every last bit of
address space compatible with a given tag length. It requires
considerably more hardware and allocating CSRs, which my proposal
intentionally avoids in order to justify requiring support for UAM/SAM.

> For all options (even 1. and 2. that have minimal implementation
> complexity), I feel it would be useful to justify the potential hassle
> in terms of benchmarks.
>

Some incorrect programs would run longer before crashing, if the
supervisor sets UAM. Most programs use only canonical addresses and
would be completely unaffected because they would run with UAM clear.
Some types of garbage collectors and language runtimes (the proposal was
inspired by a common implementation technique for Lisp) would benefit.
Supervisors must already consider sstatus as part of the user context to
support user-mode interrupts, so this proposal does not increase
context-switch overhead. The flags are defined as masking non-canonical
address faults, so their presence and use cannot reduce performance in
any reasonable implementation.

Benchmarks for this would have to be theoretical or based on instruction
count, since I believe that RISC-V would be the first to implement this
in this way. In some applications, 8-bit tags may have far less
performance benefit than 16-bit or 24-bit tags.

-- Jacob

Michael Clark

unread,
Sep 24, 2016, 9:04:45 PM9/24/16
to jcb6...@gmail.com, Alex Bradbury, RISC-V ISA Dev

On 25 Sep 2016, at 1:39 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

3. Support setting a mask for address translation that is independent
of the virtual addressing mode. More complex than the above two
options as you have to consider the case where bits are masked that
otherwise would be used in virtual address translation. The advantage
is processes might opt-in to masking on a case-by-case basis and make
the choice of trading off virtual address space for tags. e.g. even
under Sv48 I could make use of 24-bit tags. It does of course add more
per-process state and would need co-operation from mmap and friends
 

This 3rd option is essentially a means to squeeze every last bit of address space compatible with a given tag length.  It requires considerably more hardware and allocating CSRs, which my proposal intentionally avoids in order to justify requiring support for UAM/SAM.

Something simple to start width that is not mutually exclusive with a design that adds CSRs or register decode logic (this would likely be a vendor/model specific extension).

The simple option to disable “canonical address checking” in hardware is a logical first step.

Michael.

Michael Clark

unread,
Sep 24, 2016, 9:08:45 PM9/24/16
to jcb6...@gmail.com, RISC-V ISA Dev

On 25 Sep 2016, at 1:09 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:

A binary translator on an architecture that mandates “canonical addresses” will be able to retranslate inserting masking loads and stores when this feature is active.

Although this raises a separate issue that I had been thinking about (for userland sandboxes), and that is an option to make FENCE.I require S mode privileges and have an option for it fault in U mode. I don’t know if FENCI.I virtualizable? It seems to be a candidate for a trapping instruction in U mode. Alternatively a strict loader could raise a verifier exception for a well formed subset of RISC-V in ELF upon seeing FENCE.I. It did occur to me that FENCE.I is something that may wish to be administratively disabled on a per process basis. i.e. exceptions to be made for JITs. Self-modifying code is pretty obscure these days, and is usually an obfuscation or attack vector.

The problem is that FENCE.I only guarantees that instruction fetch will see modifications.  Do enough other accesses to flush caches and you achieve the same effect, just with less efficiency, which I doubt will cause an exploit writer to fret overmuch.  Lack of FENCE.I does not guarantee that modifications to program text will not be seen.  An implementation could do an implicit "FENCE; FENCE.I; SFENCE.VM" after every STORE.  Simple embedded processors, with only internal SRAM, probably will. (But those probably will not support S-mode at all, so no SFENCE.VM.)

Yes. You are right. This can pretty much be achieved by the supervisor not allowing mmap PROT_EXEC e.g. process capabilities.

I guess FENCE.I is in this case a nop. Someone has probably thought this through although I did consider whether a trapping FENCE.I would help a binary translator (imagine RISC-V to RISC-V bin trans on RISC-V hardware to keep things simple. e.g. RVC on RV64G, which can’t be handled by trap and emulate).

Jacob Bachmeyer

unread,
Sep 24, 2016, 10:16:27 PM9/24/16
to Alex Bradbury, Michael Clark, RISC-V ISA Dev
My proposal has one big advantage over the macro-op fusion approach:
fusing the mask and load operations still requires that the mask be in
an architectural register, while my proposal uses an implicit mask.
Secondarily, omitting the mask step reduces code size, since masking the
address expands LOAD/STORE into 2-op sequences, even if the decoder will
combine them.

However, my proposal and macro-op fusion are not mutually
exclusive--ignoring unused bits instead of requiring addresses be in
canonical form does not preclude fusing mask-and-load sequences into
MASKED-POINTER-LOAD operations. Fusing mask-and-store into
MASKED-POINTER-STORE, however requires a third read port on the register
file: MASKED-POINTER-STORE requires the pointer, the mask, and the
value to store.

Similar sequences can also be used for tasks such as aligning a pointer
to some boundary. For example, a pointer can be aligned to the next
8-byte boundary by adding seven and masking off the three least
significant bits. This "ADDI; ANDI" sequence could easily be combined
into an ALIGN-POINTER operation. For the particular case of align-to-8,
a "C.ADDI; C.ANDI" sequence fits in 32 bits. A particularly powerful
out-of-order microarchitecture could even fuse ALIGN-POINTER and a
following access into ALIGNED-POINTER-LOAD or ALIGNED-POINTER-STORE and
perform the pointer alignment while queuing the load or store. Since
ALIGN-POINTER takes its parameters from immediates rather than
registers, no additional register file ports are required.
ALIGNED-POINTER-LOAD must read the pointer, use offset and alignment
parameters from immediates, and clobber the pointer with the loaded
value in order to complete with only one register write.
ALIGNED-POINTER-STORE must read the pointer and value, use offset and
alignment parameters from immediates, and write the aligned pointer to
the register file while queuing the memory write. ALIGNED-POINTER-STORE
need not stall execution.

I note that extensive use of macro-op fusion can easily turn a RISC-V
implementation into a very CISC-like model. :)


-- Jacob

Jacob Bachmeyer

unread,
Sep 24, 2016, 10:32:58 PM9/24/16
to Cesar Eduardo Barros, Alex Bradbury, RISC-V ISA Dev
What? You are confusing the page table format with the virtual address
width. User space is inevitably exposed to the virtual address width.
If you only have 39-bit virtual addresses, you cannot mmap() something
at 0x000A987654321000, for example.

> Option 1 hard-codes a limit in the process address space, in a way
> which could hinder future experimentation with larger address spaces.
> 56 bits might seem to be a lot, until you start thinking about sparse
> mappings (either of a huge sparse file, or several large mappings
> spread around the address space to give them room to grow or for
> ASLR). Even for RV128, permanently forbidding several bits from being
> used could prevent interesting ideas.

This would be why no one is seriously proposing option 1. The core
RISC-V team has already considered and rejected that approach, for good
reason.

> Option 3 could be done in a simple way with a bit of cooperation
> between the kernel and the hardware. The hardware would have a
> per-process flag to ignore the unused bits, like in option 2. The
> kernel would then manipulate the page tables for that process to
> present the illusion of the used bits also being ignored, up to the
> mask requested by the process. For instance, for a process which
> requests 20 bits available for tags, running on Sv48 which only has 16
> bits available, the kernel would configure the uppermost page table
> for the process so that entries 0-15, 16-31, 32-47, and so on, point
> to the same 16 next level page tables, so the end result is as if
> these 4 bits of the address were also being ignored.

This approach of page table aliasing would allow an Sv48 supervisor to
support a program using 24-bit tags, even if paging depth is not moved
to supervisor control, at the cost of reduced TLB efficiency. Thanks
for suggesting it.

> Since this would need the kernel to set up the page tables in a
> special way, it should not be the default. That is, unless the kernel
> says so, canonical (sign-extended) virtual addresses should still be
> required.

The proposal as written places UAM under supervisor control for exactly
this reason--the kernel must not select a paging depth that would cause
virtual addresses to overlap with the tag bits used by a process.


-- Jacob

Jacob Bachmeyer

unread,
Sep 24, 2016, 10:39:45 PM9/24/16
to Michael Clark, Alex Bradbury, RISC-V ISA Dev
Michael Clark wrote:
>
>> On 25 Sep 2016, at 1:39 PM, Jacob Bachmeyer <jcb6...@gmail.com
Indeed, explicit mask CSRs could be added, provided that their initial
state is no-effect, but those CSRs would become part of the user context
that the supervisor must swap. UAM is part of sstatus, which
supervisors must already save/restore to support U-mode interrupts.


-- Jacob

Jacob Bachmeyer

unread,
Sep 24, 2016, 11:30:12 PM9/24/16
to Samuel Falvo II, Cesar Eduardo Barros, Alex Bradbury, RISC-V ISA Dev
Samuel Falvo II wrote:
> This all seems very complicated, requiring complex software
> interaction.

What is complicated about a user program having a "uses 16 tag bits"
note in its ELF header and the kernel ensuring that that program is
never run with virtual addresses wider than 48 bits? If the program
does not request tag bits, then it would run with UAM clear and
non-canonical addresses would raise faults.

> Why not just add 32 S-mode CSRs, one per X-register,
> which contains an AND-mask to be applied to the corresponding
> X-register during Lx or Sx instructions? Benefits:
>
>

An alternate proposal to consider! I will make a comparison one step at
a time:

> 1) Hardware investment is minimal while maximizing software flexibility.
>

My proposal requires even less hardware, only two additional register
cells in mstatus, some wiring, and a few gates to inhibit non-canonical
address faults.

Compare that to an additional 32 (or possibly 128, if this functionality
is also available to higher privilege modes) XLEN-bit CSRs (so 32*XLEN
(up to 128*XLEN) additional register cells and the associated CSR access
logic) and XLEN additional AND gates in the address calculation path. I
assume that the appropriate mask register value is implicitly driven on
an internal bus to the address calculation logic on every memory access.

> 2) Bitmask applies only during loads and stores; they're ignored for
> ALU operations. This eases the burden of masking in software when
> calculating effective addresses with tagged numbers.
>

Almost the same as my proposal, except that I propose that the mask only
apply as a virtual address is translated. Computing effective addresses
with tagged numbers will corrupt the tags, unless software is careful to
preserve them, another reason that my proposal ignores the tag bits
instead of requiring them to have particular (software-defined) values.

> 3) It is reconfigurable dynamically at runtime. U-code cannot change
> the bitmasks; but, if it has appropriate permissions, can ask S-code
> to change them on its behalf without incurring additional heavy-weight
> overheads.
>

Almost exactly the same--U-mode cannot change UAM in my proposal either,
since correctness with UAM set requires that the supervisor ensure that
virtual address space is limited in the way that U-mode expects.

> 4) Each X-register has its own XMask register, allowing for either
> uniform or segmented windows into address spaces with different memory
> (virtual) memory semantics.
>

This is the big difference. I propose that paging implicitly define a
mask that applies to all address translations, while this alternate
approach permits each architectural register to have an independent
mask. This alternate approach is far more general, but is no better for
the use case of tagged pointers, where all the masks would have the same
value.

> 5) With minimally clever coding, it *could* be used to implement
> limited sub-page protection adequate for implementing circular data
> structures like byte-queues.
>

Support for circular data structures is common in DSP architectures.
This alternate proposal might be useful for a RISC-V DSP extension.

> Cons:
>
> 1) It'd be a bear to implement on an FPGA due to doubling the number
> of flip-flops required for the register set.
>

This doubled resource cost on an FPGA directly translates to doubling
the silicon area required in an ASIC implementation. My proposal
requires two additional flip-flops inside the existing mstatus CSR.

> 2) It adds one gate delay to the hot critical path of instruction
> execution. Can easily be pipelined, but then, that adds one pipeline
> stage for branch prediction to have to cover up.
>

My proposal may or may not add a gate delay, depending on how exactly an
implementation detects and traps on non-canonical addresses.

> 3) Consumes a lot of CSR space without multiplexing, or requires
> indirect access to the masks if you do multiplex.
>

My proposal defines two bits in the existing mstatus CSR.

This alternate proposal also has a fourth drawback: the new XMask CSRs
are part of the user context and must be saved/restored on every context
switch. My proposal adds bits to an existing CSR (sstatus aliasing part
of mstatus) that supervisor context switch must already save and restore
to support user-mode interrupts.


In conclusion, I suggest that this alternate proposal be considered as
part of a future RISC-V DSP extension and that a similar model be
considered for a RISC-V vector extension, since the ability for a vector
unit to access memory in "odd" patterns using address masks may be
useful enough to outweigh the cost of the mask registers. Putting the
masks in the vector unit could also avoid consuming so much CSR space,
depending on how exactly vector unit mask registers are accessed.


-- Jacob

Andrew Waterman

unread,
Sep 26, 2016, 1:52:47 AM9/26/16
to Samuel Falvo II, Alex Bradbury, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
Hey y'all,


On Saturday, September 24, 2016, Samuel Falvo II <sam....@gmail.com> wrote:
On Sat, Sep 24, 2016 at 9:29 AM, Alex Bradbury <a...@asbradbury.org> wrote:
> If I understand correctly, this is basically a variant of my 'option
> 3' suggestion but with a different mask per register. Although this
> increases flexibility, the downsides of all that extra per-process
> state seem rather high. I'm struggling to think of a killer use-case
> for having a different mask for reach register?

Two responses to this question.  The first response addresses why
multiple-masks are useful, without regard to any particular mapping or
quantity of mask registers.

In 1984, Commodore-Amiga asked Microsoft for a version of BASIC to use
on the Amiga 1000.  They came up with AmigaBASIC 1.0, which shipped
with AmigaOS 1.1 IIRC, definitely with AmigaOS 1.2.  (AmigaOS 1.0 had
ABasiC, which they used as MS was late in its delivery.)  This product
worked pretty great on the 68000 and 68010 CPUs.  It crashed hard on
68020 and later CPUs though.  The reason is that it used the upper
8-bits of some types of pointers as flags.

I just wanted to echo Sam's concern here.

Not-trapping on unused address bits is a well-established anti-pattern. I was shocked that ARM chose to include that feature in v8.

--
You received this message because you are subscribed to the Google Groups "RISC-V ISA Dev" group.
To unsubscribe from this group and stop receiving emails from it, send an email to isa-dev+u...@groups.riscv.org.
To post to this group, send email to isa...@groups.riscv.org.
Visit this group at https://groups.google.com/a/groups.riscv.org/group/isa-dev/.
To view this discussion on the web visit https://groups.google.com/a/groups.riscv.org/d/msgid/isa-dev/CAEz%3Dson9YL9CMsufG_KuPbehkmTXZvdqARwwSRRifk2%3Do6wfSw%40mail.gmail.com.

Cesar Eduardo Barros

unread,
Sep 26, 2016, 6:33:31 AM9/26/16
to Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
Em 24-09-2016 10:20, Cesar Eduardo Barros escreveu:
> Em 24-09-2016 03:36, Alex Bradbury escreveu:
>> 3. Support setting a mask for address translation that is independent
>> of the virtual addressing mode. More complex than the above two
>> options as you have to consider the case where bits are masked that
>> otherwise would be used in virtual address translation. The advantage
>> is processes might opt-in to masking on a case-by-case basis and make
>> the choice of trading off virtual address space for tags. e.g. even
>> under Sv48 I could make use of 24-bit tags. It does of course add more
>> per-process state and would need co-operation from mmap and friends
>
> Option 3 could be done in a simple way with a bit of cooperation between
> the kernel and the hardware. The hardware would have a per-process flag
> to ignore the unused bits, like in option 2. The kernel would then
> manipulate the page tables for that process to present the illusion of
> the used bits also being ignored, up to the mask requested by the
> process. For instance, for a process which requests 20 bits available
> for tags, running on Sv48 which only has 16 bits available, the kernel
> would configure the uppermost page table for the process so that entries
> 0-15, 16-31, 32-47, and so on, point to the same 16 next level page
> tables, so the end result is as if these 4 bits of the address were also
> being ignored.
>
> Since this would need the kernel to set up the page tables in a special
> way, it should not be the default. That is, unless the kernel says so,
> canonical (sign-extended) virtual addresses should still be required.
>

Thinking more about it, there's a problem with option 3 I hadn't
considered: it's a per-process global variable. Consider what happens if
a process uses three different libraries: one which doesn't use tagged
pointers, one which wants 16-bit tags, and one which wants 24-bit tags.
To which value should the mask be set?

The more I think about it, the more sense it makes to keep the current
"trap on non-canonical addresses" behavior, and do the masking manually
in each library. The only thing needed then would be a way to ask the
kernel in each mmap (or similar) call to allocate memory below a certain
threshold; that would be generically useful for all architectures.

Alex Bradbury

unread,
Sep 26, 2016, 6:44:23 AM9/26/16
to Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On 26 September 2016 at 11:33, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> Thinking more about it, there's a problem with option 3 I hadn't considered:
> it's a per-process global variable. Consider what happens if a process uses
> three different libraries: one which doesn't use tagged pointers, one which
> wants 16-bit tags, and one which wants 24-bit tags. To which value should
> the mask be set?

Yes, I've thought about this too. In this case, the person compiling
the binary would be expected to generate an ELF requesting 24-bit
tags. It is a potential hassle, but I'm not sure the concern outweighs
the potential advantages of embedding metadata in pointers. As you
suggest in your next paragraph, having software mask where necessary
is a viable solution that requires no hardware changes. Because of
this, the question that needs to be answered is the cost this entails.

Alex

Alex Bradbury

unread,
Sep 26, 2016, 6:48:59 AM9/26/16
to Andrew Waterman, Samuel Falvo II, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On 26 September 2016 at 06:52, Andrew Waterman <and...@sifive.com> wrote:
> Hey y'all,
>
> On Saturday, September 24, 2016, Samuel Falvo II <sam....@gmail.com>
> wrote:
>>
>> On Sat, Sep 24, 2016 at 9:29 AM, Alex Bradbury <a...@asbradbury.org> wrote:
>> > If I understand correctly, this is basically a variant of my 'option
>> > 3' suggestion but with a different mask per register. Although this
>> > increases flexibility, the downsides of all that extra per-process
>> > state seem rather high. I'm struggling to think of a killer use-case
>> > for having a different mask for reach register?
>>
>> Two responses to this question. The first response addresses why
>> multiple-masks are useful, without regard to any particular mapping or
>> quantity of mask registers.
>>
>> In 1984, Commodore-Amiga asked Microsoft for a version of BASIC to use
>> on the Amiga 1000. They came up with AmigaBASIC 1.0, which shipped
>> with AmigaOS 1.1 IIRC, definitely with AmigaOS 1.2. (AmigaOS 1.0 had
>> ABasiC, which they used as MS was late in its delivery.) This product
>> worked pretty great on the 68000 and 68010 CPUs. It crashed hard on
>> 68020 and later CPUs though. The reason is that it used the upper
>> 8-bits of some types of pointers as flags.
>
>
> I just wanted to echo Sam's concern here.
>
> Not-trapping on unused address bits is a well-established anti-pattern. I
> was shocked that ARM chose to include that feature in v8.

I would highlight that most of the solutions being discussed in this
thread would maintain binary portability. We're in complete agreement
that having a piece of software written assuming it can trample over
24bits of every 64-bit software suddenly fail when running on an Sv48
machine would be a non-starter. [Of course even if the hardware
supports masking you'd need software support to limit the range of
addresses that mmap returns, but software on other archs will need
this anyway, e.g. SpiderMonkey may want to continue using 16-bits of
embedded metadata when archs move to a larger virtual address space].

Alex

Alex Bradbury

unread,
Sep 26, 2016, 6:52:39 AM9/26/16
to Andrew Waterman, Samuel Falvo II, Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
Misfiring neuron, should read "We're in complete agreement that having
a piece of software written assuming it can trample over 24bits of
every 64-bit _pointer_ suddenly fail when running on an Sv48 machine
would be a non-starter"

Alex

Cesar Eduardo Barros

unread,
Sep 26, 2016, 7:11:06 PM9/26/16
to Alex Bradbury, jcb6...@gmail.com, RISC-V ISA Dev
In a world with dynamically loaded libraries, I can ask: which binary?
Suppose I write a program which uses a third-party Javascript library
which uses tagged pointers. Suppose that the program also uses another
library written by someone else, which depends on yet another library
which also wants to use tagged pointers (with a different tag size).

Now suppose I am running a program, which loads a plugin which is linked
to a shared library which uses a bigger tag size than is currently being
used by the main program, and to make it worse the main program already
has allocated data on addresses high enough to conflict with the new tag
size.

This can be a problem even with static linking: which compilation unit
defines the tag size to be used?

The only sane way is to either have the tag size fixed for everyone
(option 1, which is bad for other reasons), or to have the tag size be a
property of the allocation, not of the process.

That is, in the same process one could have pointers with a 16-bit tag,
pointers with an 8-bit tag, canonical pointers (no tag), and so on, and
the tag size (actually the "reserved high bits") is chosen when
allocating, something like a MAP_48BIT/MAP_56BIT/etc (note, however,
that the x86-specific MAP_32BIT is a misnomer, since it limits the
allocation to 31 bits). Whoever allocated the memory, and thus is
responsible for it, knows how much tag bits it's using and can mask them
by hand.

Jacob Bachmeyer

unread,
Sep 26, 2016, 8:49:19 PM9/26/16
to Andrew Waterman, Samuel Falvo II, Alex Bradbury, Cesar Eduardo Barros, RISC-V ISA Dev
Andrew Waterman wrote:
> Hey y'all,
>
> On Saturday, September 24, 2016, Samuel Falvo II <sam....@gmail.com
> <mailto:sam....@gmail.com>> wrote:
>
> On Sat, Sep 24, 2016 at 9:29 AM, Alex Bradbury <a...@asbradbury.org
> <javascript:;>> wrote:
> > If I understand correctly, this is basically a variant of my 'option
> > 3' suggestion but with a different mask per register. Although this
> > increases flexibility, the downsides of all that extra per-process
> > state seem rather high. I'm struggling to think of a killer use-case
> > for having a different mask for reach register?
>
> Two responses to this question. The first response addresses why
> multiple-masks are useful, without regard to any particular mapping or
> quantity of mask registers.
>
> In 1984, Commodore-Amiga asked Microsoft for a version of BASIC to use
> on the Amiga 1000. They came up with AmigaBASIC 1.0, which shipped
> with AmigaOS 1.1 IIRC, definitely with AmigaOS 1.2. (AmigaOS 1.0 had
> ABasiC, which they used as MS was late in its delivery.) This product
> worked pretty great on the 68000 and 68010 CPUs. It crashed hard on
> 68020 and later CPUs though. The reason is that it used the upper
> 8-bits of some types of pointers as flags.
>
>
> I just wanted to echo Sam's concern here.
>
> Not-trapping on unused address bits is a well-established
> anti-pattern. I was shocked that ARM chose to include that feature in v8.


That is why my proposal makes UAM a supervisor-controlled option. With
full 64-bit virtual addresses, the supervisor can fill the top level(s)
of page tables with alias entries to produce the same effect as this
proposal. In Sv39 as it currently stands, this type of aliasing is not
an option and a program using tagged pointers loses an architectural
register (to store the mask) and needs an extra instruction on every
heap access to mask out the pointer tag. Alternately, every heap access
requires a separate read to load the mask from memory. Tagged pointers
are a bit of a specialist feature, I admit, and the example use case for
my proposal is a Lisp runtime. Others have cited garbage collectors as
potential users of this feature.

In this proposal, hardware is required (because the hardware cost is
trivial compared to a 64-bit datapath) to support the implicit address
mask flag, but it has no effect unless set by the supervisor. No sane
supervisor would set UAM unless a program requests it by declaring the
use of tagged pointers. If UAM is clear, unused address bits must form
the sign-extension of the virtual address or a fault is taken. If UAM
is set, unused address bits are ignored. UAM defaults clear. UAM is
part of sstatus and is therefore a per-process flag, saved and restored
during supervisor context switch. (Rationale: sstatus contains
ustatus, which holds the user-mode interrupt enable flags. A supervisor
must swap at least ustatus during context switch to be compatible with
user-level interrupts. It may as well swap sstatus, which will also
allow threads to be preempted in the supervisor, even during access to
user memory, by saving/restoring PUM. I believe that Linux currently
preempts threads, even in the kernel, so swapping sstatus is no
additional burden.)

(-->_I want to emphasize this: Unlike ARMv8, where no-trap unused
address bits are part of the ISA, this proposal does not preclude future
growth of the virtual address space._<--) Programs that do not use tag
bits can use the full 64-bit virtual address space if the hardware
supports it, while programs that do use tag bits can trade address space
for tag bits. Because the supervisor can put aliases in the page
tables, a process can use any number of tag bits regardless of current
paging depth. Because the supervisor knows how many tag bits U-mode
expects, it can ensure that no mappings conflict with the tag bits. If
all implementations had 64-bit virtual addresses, this proposal would be
meaningless because the supervisor could build aliases into the page
tables to provide tag bits to U-mode. The proposed option to ignore
unused address bits allows the same flexibility when the hardware
implements a smaller virtual address space.


-- Jacob

Jacob Bachmeyer

unread,
Sep 26, 2016, 11:03:03 PM9/26/16
to Cesar Eduardo Barros, Alex Bradbury, RISC-V ISA Dev
Cesar Eduardo Barros wrote:
> Em 26-09-2016 07:44, Alex Bradbury escreveu:
>> On 26 September 2016 at 11:33, Cesar Eduardo Barros
>> <ces...@cesarb.eti.br> wrote:
>>> Thinking more about it, there's a problem with option 3 I hadn't
>>> considered:
>>> it's a per-process global variable. Consider what happens if a
>>> process uses
>>> three different libraries: one which doesn't use tagged pointers,
>>> one which
>>> wants 16-bit tags, and one which wants 24-bit tags. To which value
>>> should
>>> the mask be set?
>>
>> Yes, I've thought about this too. In this case, the person compiling
>> the binary would be expected to generate an ELF requesting 24-bit
>> tags. It is a potential hassle, but I'm not sure the concern outweighs
>> the potential advantages of embedding metadata in pointers. As you
>> suggest in your next paragraph, having software mask where necessary
>> is a viable solution that requires no hardware changes. Because of
>> this, the question that needs to be answered is the cost this entails.
>
> In a world with dynamically loaded libraries, I can ask: which binary?

Generally, the main program binary that was passed to execve(2) would
determine the number of tag bits for the process.

> Suppose I write a program which uses a third-party Javascript library
> which uses tagged pointers. Suppose that the program also uses another
> library written by someone else, which depends on yet another library
> which also wants to use tagged pointers (with a different tag size).

Then both libraries must declare that they use tagged pointers and the
linker must propagate that to the program executable. Alternately, the
dynamic loader could note the largest tag size used by any of the
initially-loaded libraries and ask the kernel for tags of that size.
This proposal defines mechanism, not policy. How exactly a supervisor
determines which processes use tagged pointers and the tag size is
beyond the scope of the RISC-V ISA.

> Now suppose I am running a program, which loads a plugin which is
> linked to a shared library which uses a bigger tag size than is
> currently being used by the main program, and to make it worse the
> main program already has allocated data on addresses high enough to
> conflict with the new tag size.

For plugins, the tag size would be part of the host program's ABI.
Loading a plugin with a larger tag size than the host program is not
allowed. This is not entirely new--libpthread cannot be dynamically
loaded at runtime on GNU systems. Either the initial program links
libpthread, or POSIX threads may not be used in the process.

> This can be a problem even with static linking: which compilation unit
> defines the tag size to be used?

The largest tag size needed by any compiliation unit will be placed in
the final binary.



-- Jacob

Stefan O'Rear

unread,
Sep 26, 2016, 11:41:59 PM9/26/16
to jcb6...@gmail.com, Andrew Waterman, Samuel Falvo II, Alex Bradbury, Cesar Eduardo Barros, RISC-V ISA Dev
On Mon, Sep 26, 2016 at 5:49 PM, Jacob Bachmeyer <jcb6...@gmail.com> wrote:
> load the mask from memory. Tagged pointers are a bit of a specialist
> feature, I admit, and the example use case for my proposal is a Lisp
> runtime. Others have cited garbage collectors as potential users of this
> feature.

Can you give a _specific_, _recent_ example of a program that benefits
from this feature?

-s

Alex Bradbury

unread,
Sep 27, 2016, 1:41:29 AM9/27/16
to Cesar Eduardo Barros, jcb6...@gmail.com, RISC-V ISA Dev
On 27 September 2016 at 00:11, Cesar Eduardo Barros
<ces...@cesarb.eti.br> wrote:
> Em 26-09-2016 07:44, Alex Bradbury escreveu:
>>
>> On 26 September 2016 at 11:33, Cesar Eduardo Barros
>> <ces...@cesarb.eti.br> wrote:
>>>
>>> Thinking more about it, there's a problem with option 3 I hadn't
>>> considered:
>>> it's a per-process global variable. Consider what happens if a process
>>> uses
>>> three different libraries: one which doesn't use tagged pointers, one
>>> which
>>> wants 16-bit tags, and one which wants 24-bit tags. To which value should
>>> the mask be set?
>>
>>
>> Yes, I've thought about this too. In this case, the person compiling
>> the binary would be expected to generate an ELF requesting 24-bit
>> tags. It is a potential hassle, but I'm not sure the concern outweighs
>> the potential advantages of embedding metadata in pointers. As you
>> suggest in your next paragraph, having software mask where necessary
>> is a viable solution that requires no hardware changes. Because of
>> this, the question that needs to be answered is the cost this entails.
>
>
> In a world with dynamically loaded libraries, I can ask: which binary?
> Suppose I write a program which uses a third-party Javascript library which
> uses tagged pointers. Suppose that the program also uses another library
> written by someone else, which depends on yet another library which also
> wants to use tagged pointers (with a different tag size).

Using embedded pointer metadata is something any library should think
long and hard about, as it can break encapsulation and force changes
in the main program. This is really no different to library authors
making the right decision about creating threads internally, forking,
or linking with something like boehm gc. It's also not a problem that
goes away if the library performs masking manually - unless there's a
complete separation between memory allocated from outside and inside
the given library (often desirable, but rare and not easy to verify)
you run the risk that one of the bits thought to be unused isn't e.g.
mmap gives a page where the 49th bit is used in virtual address
translation. The solution to this problem, even on machines that
unconditionally trap when non-canonical addresses are used is
obviously setting the virtual address space across the whole binary
and to consider linking a library that requires 24-bit pointer
metadata with a host process that demands only 16-bit to be ABI
incompatible. Once you have that mechanism, I'm not seeing a strong
argument as to why unused bits in the virtual address shouldn't be
ignored.

Alex

Michael Clark

unread,
Sep 27, 2016, 3:04:13 PM9/27/16