Google Groups no longer supports new Usenet posts or subscriptions. Historical content remains viewable.
Dismiss
Groups keyboard shortcuts have been updated
Dismiss
See shortcuts

Tonight's tradeoff

1,073 views
Skip to first unread message

Robert Finch

unread,
Nov 12, 2023, 10:47:14 PM11/12/23
to
Branch miss logic versus clock frequency.

The branch miss logic for the current OoO version of Thor is quite
involved. It needs to back out the register source indexes to the last
valid source before the branch instruction. To do this in a single
cycle, the logic is about 25+ logic levels deep. I find this somewhat
unacceptable.

I can remove a lot of logic improving the clock frequency substantially
by removing the branch miss logic that resets the registers source id to
the last valid source. Instead of stomping on the instruction on a miss
and flushing the instructions in a single cycle, I think the predicate
for the instructions can be cleared which will effectively turn them
into a NOP. The value of the target register will be propagated in the
reorder buffer meaning the registers source id need not be reset. The
reorder buffer is only eight entries. So, on average four entries would
be turned into NOPs. The NOPs would still propagate through the reorder
buffer so it may take several clock cycles for them to be flushed from
the buffer. Meaning the branch latency for miss-predicted branches would
be quite high. However, if the clock frequency can be improved by 20%
for all instructions, much of the lost performance on the branches may
be made up.

EricP

unread,
Nov 13, 2023, 11:10:51 AM11/13/23
to
Basically it sounds like you want to eliminate the checkpoint and rollback,
and instead let resources be recovered at Retire. That could work.

However you are not restoring the Renamer's future Register Alias Table (RAT)
to its state at the point of the mispredicted branch instruction, which is
what the rollback would have done, so its state will be whatever it was at
the end of the mispredicted sequence. That needs to be re-sync'ed with the
program state as of the branch.

That can be accomplished by stalling the front end, waiting until the
mispredicted branch reaches Retire and then copying the committed RAT,
maintained by Retire, to the future RAT at Rename, and restart front end.
The list of free physical registers is then all those that are not
marked as architectural registers.
This is partly how I handle exceptions.

Also you still need a mechanism to cancel start of execution of the
subset of pending uOps for the purged set. You don't want to launch
a LD or DIV from the mispredicted set if it has not already started.
If you are using a reservation station design then you need some way
to distribute the cancel request to the various FU's and RS's,
and wait for them to clean themselves up.

Note that some things might not be able to cancel immediately,
like an in-flight MUL in a pipeline or an outstanding LD to the cache.
So some of this will be asynchronous (send cancel request, wait for ACK).

There are some other things that might need cleanup.
A Return Stack Predictor might be manipulated by the mispredicted path.
Not sure how to handle that without a checkpoint.
Maybe have two copies like RAT, a future one maintained by Decode and
a committed one maintained by Retire, and copy the committed to future.


MitchAlsup

unread,
Nov 13, 2023, 2:53:55 PM11/13/23
to
Robert Finch wrote:

> Branch miss logic versus clock frequency.

> The branch miss logic for the current OoO version of Thor is quite
> involved. It needs to back out the register source indexes to the last
> valid source before the branch instruction. To do this in a single
> cycle, the logic is about 25+ logic levels deep. I find this somewhat
> unacceptable.
<
When you launch a predicted branch into execution (prelude to signaling
recovery is required), while the branch is determining whether to backup
(or not) have the branch recovery logic setup the register indexes such
that::
a) if the branch succeeds keep the current map
b) if the branch fails, you are 1 multiplexer delay from having the state
you want.
<
That is move the setup to repair the previous clock.

MitchAlsup

unread,
Nov 13, 2023, 3:04:43 PM11/13/23
to
I, personally, don't use a RAT--I use a CAM based architectural decoder
for operand read and a standard physical equality decoder for writes.
<
Every cycle the CAM.valid bits are block loaded into a history table
and if you need to return the CAMs to the checkpointed mappings, you
take the valid bits from the history table and write the CAM.valid
bits back into the physical register file. Presto, the map is how it
used to be.
<
Can even be made to be performed in 0-cycles. {yes: 0 not 1 cycles}
<
> That can be accomplished by stalling the front end, waiting until the
> mispredicted branch reaches Retire and then copying the committed RAT,
> maintained by Retire, to the future RAT at Rename, and restart front end.
> The list of free physical registers is then all those that are not
> marked as architectural registers.
<
Sounds slow.
<
> This is partly how I handle exceptions.

> Also you still need a mechanism to cancel start of execution of the
> subset of pending uOps for the purged set. You don't want to launch
> a LD or DIV from the mispredicted set if it has not already started.
> If you are using a reservation station design then you need some way
> to distribute the cancel request to the various FU's and RS's,
> and wait for them to clean themselves up.
<
I use the concept of an execution window to do this both at the reservation
station and function units. There is an insert pointer and a consistent
pointer RS is only allowed to launch when the instruction is between.
FU are only allowed to calculate so long as the instruction remains
between these 2 pointers. The 2 pointers (4-bits each) are broadcast
around the machine every cycle. Each station and unit decide for themselves.

> Note that some things might not be able to cancel immediately,
> like an in-flight MUL in a pipeline or an outstanding LD to the cache.
> So some of this will be asynchronous (send cancel request, wait for ACK).
<
If an instruction that should not have its result delivered is delivered,
it is delivered to the physical register it was assigned at its issue time.
But since the value had not been delivered, that register is not in the
pool of assignable registers, so no dependency has been created.
<
> There are some other things that might need cleanup.
> A Return Stack Predictor might be manipulated by the mispredicted path.
<
Do these with a linked list and you can backup a misprediced return
to a mispredicted call.
<
> Not sure how to handle that without a checkpoint.
<
Every (non exceptional) flow altering instruction needs a checkpoint.
Predicated strings of instructions use a light weight checkpoint;
predicted branches use a heavy weight version.

Robert Finch

unread,
Nov 15, 2023, 1:21:22 AM11/15/23
to
Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
very good there are a few issues with it. The ROB is used to store
register values and that is effectively a CAM. It is not very resource
efficient in an FPGA. I have been researching an x86 OoO implementation
(https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
and it turns out to be considerably smaller than Thor. There are more
efficient implementations for components than what is currently in use.

Thor2025 will use a PRF approach although using a PRF seems large to me.
To reduce the size and complexity of the register file, separate
register files will be used for float and integer operations, along with
separate register files for vector mask registers and subroutine link
registers. This set of register files limits the GPR file to only 3
write ports and 18 read ports to support all the functional units.
Currently the register file is 10r2w.

The trade-off is block RAM usage instead of LUTs.

While having separate registers files seems like a step backwards, it
should ultimately make the hardware more resource efficient. It does
impact the ISA spec.

MitchAlsup

unread,
Nov 15, 2023, 2:17:52 PM11/15/23
to
Robert Finch wrote:

> Decided to shelve Thor2024 and begin work on Thor2025. While Thor2024 is
> very good there are a few issues with it. The ROB is used to store
> register values and that is effectively a CAM. It is not very resource
> efficient in an FPGA. I have been researching an x86 OoO implementation
> (https://www.stuffedcow.net/files/henry-thesis-phd.pdf ) done in an FPGA
> and it turns out to be considerably smaller than Thor. There are more
> efficient implementations for components than what is currently in use.

> Thor2025 will use a PRF approach although using a PRF seems large to me.
<
I have a PRF design I could show you--way to big for comp.arch and
with the requisite figures.

Robert Finch

unread,
Nov 17, 2023, 10:39:51 PM11/17/23
to
Still digesting the PRF diagram.

Decided to go with a unified register file, 27r3w so far. Having
separate register files would not reduce the number of read ports
required and would add complexity to the processor.

Loads, FPU operations and flow control (FCU) operations all share the
third write port of the register file. The other two write ports are
dedicated to the ALU results. I think this will be okay given <1% of
instructions would be FCU updates. Loads are about 25%, and FPU depends
on the application.

The ALUs/FPU/Loads have five input operands including the 3 source
operands, a target operand, and a mask register. Stores do not need a
target operand. FCU ops are non-masked so do not need a mask register or
target operand input.

Not planning to implement the vector register file as it would be immense.

Robert Finch

unread,
Nov 18, 2023, 5:58:47 AM11/18/23
to
Changed the moniker of my current processor project from Thor to Qupls
(Q-Plus). I wanted a five- letter name beginning with ‘Q’. For a moment
I thought of calling it Quake but thought that would be too confusing.
One must understand the magic behind name choices.

The current design uses instruction postfixes of 32, 48, 80, and 144
bits which provide constants of 23, 39, 64 and 128 bits. Two bits in the
instruction indicate the postfix size. 64 and 128-bit constants have
seven extra unused bits available. The fields available being 71 and 135
bits.

Somewhat ugly, but it is desired to keep instructions a multiple of
16-bits in size. The shortest instruction is a NOP which is 16-bits so
that it may be used for alignment.

I almost switched to 96-bit floats which seem appealing, but once again
remembered that the progression of 32, 64, 128-bit floats work very well
for the float approximations.

Branches are 48-bit, being a combination of a compare and a branch with
a 24-bit target address field. Other flow control ops like JSR and JMP
are also 48-bit to keep all flow controls at 48-bit for simplified decoding.

Most instructions are 32-bits in size.

Sticking with a 64-register unified register file.

Removed the vector operations. There is enough play in the ISA to add
them at a later date if desired.

Loads and stores support two address mode, d(Rn) and d(Rn+Rm*Sc). The
scaled index address mode will likely be a 48-bit op.

MitchAlsup

unread,
Nov 18, 2023, 12:34:14 PM11/18/23
to
The diagram is for a 6R6W PRF with a history table, ARN->PRN translation,
Free pool pickers, and register ports. The X with a ½ box is a latch
or flip-flop depending on the clocking that is put around the figure.
It also includes the renamer {history table and free pool pickers}.

> Decided to go with a unified register file, 27r3w so far. Having
> separate register files would not reduce the number of read ports
> required and would add complexity to the processor.

9 Reads per 1 write ?!?!?

Robert Finch

unread,
Nov 18, 2023, 2:41:19 PM11/18/23
to
Freelist:

I just used the find-first/last-one’s trick on a bit-list to pick a PR
for an AR. It can provide PRs for two ARs per cycle. I have all the PRs
from the ROB feeding into the list manager so that on a branch miss the
PRs can be freed up. (Just the portion of the PRs associated with the
miss are freed). Three discarded PRs from commit also feed into the list
manager so they can be freed. It seems like a lot of logic translating
the PR to a bit. It seems a bit impractical to me to feed all the PRs
from the ROB to the list manager. It can be done with the smallish 16
entry ROB, but for a larger ROB the free may have to be split up or
another means found.

RAT:

A register alias table is being used to track the mappings of ARs to
PRs. It uses two maps; speculative and committed. On instruction enqueue
speculative mappings are updated. On commit committed mappings are
updated, and on pipeline flush commit is copied to speculative.

Register file:

I’ve reduced the number of read ports, by not supporting the vector
stuff. There are only 18 read ports. Six groups of three.

ROB:
The ROB acts like a CAM to store both the aRN and pRN for the target
register. The aRN is needed to know which previous pRN to free on
commit. For source operands only the pRN is stored.

Robert Finch

unread,
Nov 24, 2023, 7:32:16 PM11/24/23
to
On 2023-11-18 2:41 p.m., Robert Finch wrote:
Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
root pointers to support. With a 12-bit ASID, 4096 root pointers are
required to link to the mapping tables with one root pointer for each
address space. A 512 MB space is probably sufficient for a large number
of apps. Meaning access for a TLB update is via a single root pointer
lookup and then looking up the translation from a single memory page.
Not much for the table walker to do. The 4096 root pointers use two
block RAMs and require an 8192-byte address space for update assuming a
32-bit physical address space (a 16-bit root page number).

An IO mapped area of 64kB is available for root pointer memory. 16 block
RAMs could be setup in this area, that would allow 8 root pointers for
each address space. Three bits of the virtual address space could then
be mapped using root pointers. If the root pointer just points to a
single level of page tables, then a 4GB (32-bit) space could be mapped.
I am mulling over whether it is worth it to support the additional root
pointers. It is a chunk of block RAM memory that might be better spent
elsewhere.

If I use an 11-bit ASID, all the root pointers could be present in a
single block RAM. So, design choices are 11 or 12-bits ASID, 1 or 8 root
pointers per address space.

My thought is to have only a single root pointer per space, and organize
the root pointer table as if there were 32-bits for the pointer. This
would allow a 48-bit physical address space to place the mapping tables
in. The RAM could be mapped so that the high order bits of the pointer
are assumed to be zero. The system could get by using a single block RAM
if the mapping tables location were restricted to a 16MB address range.
Eight-bit pointers could be used then.

Given that it is a small system, with only 512MB of DRAM, I think it
best to keep the page-table-walker simple, and use the minimum amount of
BRAM (1).

MitchAlsup

unread,
Nov 24, 2023, 8:02:18 PM11/24/23
to
Robert Finch wrote:

> On 2023-11-18 2:41 p.m., Robert Finch wrote:
> Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how many
> root pointers to support. With a 12-bit ASID, 4096 root pointers are
> required to link to the mapping tables with one root pointer for each
> address space.

So, you associate a single ROOT pointer VALUE with an ASID, and manage
in SW who gets that ROOT pointer VALUE; using ASID as an index into
Virtual Address Spaces.

How is this usefully different that only using the ASID to qualify TLB
results ?? <Was this TLB entry installed from the same ASID as is accessing
right now>. And using ASID as an index into any array might lead to some
conundrum down the road a apiece.

Secondarily, SUN started out with 12-bit ASID and went to 16-bits just about
as fast as they could--even before main memories went bigger than 4GB.

Robert Finch

unread,
Nov 24, 2023, 9:28:31 PM11/24/23
to
I view the address space as an entity in it own right to be managed by
the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
identifies the address space has a life outside of just the TLB. I may
be increasing the typical scope of an ASID.

It is the same idea as using the ASID to qualify TLB entries, except
that it qualifies the root pointer as well. So, the root pointer does
not need to be switched by software. Once the root pointer is set for
the AS it simply sits there statically until the AS is reused.

I am using the ASID like a process ID. So, the root pointer register
does not need to be reset on a task switch. Address spaces may not be
mapped 1:1 with processes. An address space may outlive a task if it is
shared with another task. So, I do not want to use the PID to
distinguish tables. This assumes the address space will not be freed up
and reused by another task, if there are tasks using the ASID.

4096 address spaces is a lot. But if using a 16-bit ASID it would no
longer be practical to store a root pointer per ASID in a table.
Instead, the root pointer would have to be managed by software as is
normally done.

I am wondering why the 16-bit ASID? 256 address spaces in 256 process? I
suspect it is just because 16-bit is easier to pass around/calculate in
a HLL than some other value like 14-bits. Are 65536 address spaces
really needed?

BGB

unread,
Nov 24, 2023, 10:16:51 PM11/24/23
to
If one assumes one address space per PID, then one is going to hit a
limit of 4K a lot faster than 64K, and when one hits the limit, there is
no good way to "reclaim" previously used address spaces short of
flushing the TLB to be sure that no entries from that space remain in
the TLB (ASID thrashing is likely to be relatively expensive to deal
with as a result).



Well, along with other things, like if/how to allow "Global" pages:
True global pages are likely a foot gun, as there is no way to exclude
them from a given process (where there may be a need to do so);
Disallowing global pages entirely means higher TLB miss rates because no
processes can share TLB entries.

One option seems to be, say, that a few of the high-order bits of the
ASID could be used as a "page group", with global pages only applying
within a single page-group (possibly with one of the page groups being
designated as "No global pages allowed").

...

Robert Finch

unread,
Nov 24, 2023, 10:48:41 PM11/24/23
to
I see after reading several webpages that the root pointer is used to
point to only a single table for a process. This is not how I was doing
things. I have a MMU tables for each address space as opposed to having
a table for the process. The process may have only a single address
space, or it may use several address spaces.

I am wondering why there is only a single table per process.
>
>
> Well, along with other things, like if/how to allow "Global" pages:
> True global pages are likely a foot gun, as there is no way to exclude
> them from a given process (where there may be a need to do so);
> Disallowing global pages entirely means higher TLB miss rates because no
> processes can share TLB entries.
>
Global space can be assigned by designating an address space as a global
space and giving it an ASID. All process wanting access to the global
space need only then use the MMU table for that ASID. Eg. use ASID 0 for
the global address space.

Scott Lurndal

unread,
Nov 25, 2023, 12:11:13 PM11/25/23
to
Yeah, armv8 was originally 8-bit, and added 16 even before the spec was dry.

I don't see a benefit to tying the ASID (or VMID for that matter) to
the root of the page table. Especially with the common split
address spaces (ARMv8 has a root pointer for each half of the VA space,
for example, where the upper half is shared by all schedulable entities).

Scott Lurndal

unread,
Nov 25, 2023, 12:17:00 PM11/25/23
to
256 is far too small.

$ ps -ef | wc -l
709

Every time the ASID overflows, the system must basically flush
all the caches system-wide. On an 80 processor system, that's a lot of
overhead.

Scott Lurndal

unread,
Nov 25, 2023, 12:20:39 PM11/25/23
to
Robert Finch <robf...@gmail.com> writes:
>On 2023-11-24 10:16 p.m., BGB wrote:
>> On 11/24/2023 8:28 PM, Robert Finch wrote:
>>> On 2023-11-24 8:00 p.m., MitchAlsup wrote:

>>
>> If one assumes one address space per PID, then one is going to hit a
>> limit of 4K a lot faster than 64K, and when one hits the limit, there is
>> no good way to "reclaim" previously used address spaces short of
>> flushing the TLB to be sure that no entries from that space remain in
>> the TLB (ASID thrashing is likely to be relatively expensive to deal
>> with as a result).
>>
>I see after reading several webpages that the root pointer is used to
>point to only a single table for a process. This is not how I was doing
>things. I have a MMU tables for each address space as opposed to having
>a table for the process. The process may have only a single address
>space, or it may use several address spaces.
>
>I am wondering why there is only a single table per process.

There is actually two in most operating systems - the lower half
of the VA space is owned by the user-mode code in the process and
the upper-half is shared by all processors and used by the
operating system on behalf of the process. For Intel/AMD, the
kernel manages both halves, for ARMv8, each half has a completely
distinct and separate root pointer (at each exeception level).

BGB

unread,
Nov 25, 2023, 12:59:49 PM11/25/23
to
I went the opposite route of one big address space, with the idea of
allowing memory protection within this address space via the VUGID/ACL
mechanism. There is a KRR, or Keyring Register, which holds up to 4 keys
that may be used for ACL checking, granting an access if it is allowed
by at least one of the keys; triggering an ISR on miss similar to the
TLB. In this case, the conceptual model is more similar to that
typically used in filesystems.

But, I also have a 16-bit ASID...

As-is, there is at most one set of page tables per address space, or
per-process if processes are given different address spaces.


>>
>>
>> Well, along with other things, like if/how to allow "Global" pages:
>> True global pages are likely a foot gun, as there is no way to exclude
>> them from a given process (where there may be a need to do so);
>> Disallowing global pages entirely means higher TLB miss rates because
>> no processes can share TLB entries.
>>
> Global space can be assigned by designating an address space as a global
> space and giving it an ASID. All process wanting access to the global
> space need only then use the MMU table for that ASID. Eg. use ASID 0 for
> the global address space.
>

Had considered this, but there is a problem:
What if you have a process that you *don't* want to be able to see into
this global space?...

Though, this is where the idea of page-grouping can come in, say, the
ASID becomes:
gggg-pppp-pppp-pppp

Where:
0000 is visible to all of 0zzz
1000 is visible to all of 1zzz
...
Except:
Fzzz, this group does not have any global pages (all one-off ASIDs).

Or, possible also, is a 2.14 bit split.


>> One option seems to be, say, that a few of the high-order bits of the
>> ASID could be used as a "page group", with global pages only applying
>> within a single page-group (possibly with one of the page groups being
>> designated as "No global pages allowed").
>>
>> ...
>>
>


Meanwhile:
I went and bought 128GB of RAM, only to realize my PC doesn't work if
one tries to install the full 128GB (the BIOS boot-loops a bunch of
times, and then apparently concludes that there is only 3.5GB ...).

Does work at least if I install 3x 32GB sticks and 1x 16GB stick, giving
112GB. This breaks the pairing rules, but seems to be working.

...

Had I known this, could have spent half as much, and only upgraded to 96GB.



Seemingly MOBO/BIOS/... designers didn't anticipate someone sticking a
full 128GB in this thing?... (BIOS is dated from 2018).

Well, either this, or a hardware compatibility issue with one of the
cards?...

MitchAlsup

unread,
Nov 25, 2023, 2:31:35 PM11/25/23
to
Robert Finch wrote:

> On 2023-11-24 8:00 p.m., MitchAlsup wrote:
>> Robert Finch wrote:
>>
>>> On 2023-11-18 2:41 p.m., Robert Finch wrote:
>>> Q+ uses 64kB memory pages containing 8192 PTEs to map memory. A single
>>> 64kB page can handle 512MB of mappings. Tonight’s trade-off is how
>>> many root pointers to support. With a 12-bit ASID, 4096 root pointers
>>> are required to link to the mapping tables with one root pointer for
>>> each address space.
>>
>> So, you associate a single ROOT pointer VALUE with an ASID, and manage
>> in SW who gets that ROOT pointer VALUE; using ASID as an index into
>> Virtual Address Spaces.
>>
>> How is this usefully different that only using the ASID to qualify TLB
>> results ?? <Was this TLB entry installed from the same ASID as is accessing
>> right now>. And using ASID as an index into any array might lead to some
>> conundrum down the road a apiece.
>>
>> Secondarily, SUN started out with 12-bit ASID and went to 16-bits just
>> about
>> as fast as they could--even before main memories went bigger than 4GB.

> I view the address space as an entity in it own right to be managed by
> the MMU. ASIDs and address spaces should be mapped 1:1. The ASID that
> identifies the address space has a life outside of just the TLB. I may
> be increasing the typical scope of an ASID.

Consider the case where two different processes MMAP the same area
of memory.
Should they both end up using the same ASID ??
Should they both take extra TLB walks because they use different ASIDs ??
Should they uses their own ASIDs for their own memory but a different ASID
for the shared memory ?? And How do you expect this to happen ??

MitchAlsup

unread,
Nov 25, 2023, 2:47:10 PM11/25/23
to
My 66000 Architecture has 4 Root Pointers available at all instants
of time. The above was designed before the rise of HyperVisors and is
now showing its age problems. All 4 Root Pointers are used based on
privilege level::

HOB=0 HOB=1
Application:: Application 2-level No Access
Guest OS :: Application 2-level Guest OS 2-level
Guest HV :: Guest HV 1-level Guest OS 2-level
Real HV :: Guest HV 1-level Real HV 1-level

The overhead of Application to Application is no higher than that
of Guest OS to a different Guest OS--whereas on the machines with
VMENTER and VMEXIT it takes 10,000 cycles whereas Application to
Application is closer to 1,000 cycles. I want this down in the
10-100 cycle range.

The exception <stack> system is designed to allow Guest HV to
recover a Guest OS that takes page faults while servicing ISRs
(and the like).

The interrupt <stack> system is designed to allow the ISR to
RPC or softIRQ without having to look at the pending stack on
the way out. RTI looks at the pending stack and services the
highest pending PRC/softIRQ affinitized to the CPU with control.

The Interrupt dispatch system allows the CPU to continue running
instructions until the contending CPUs decide which interrupt
is claimed by which CPU (1::1) and then context switch do the
interrupt dispatcher.

Scott Lurndal

unread,
Nov 25, 2023, 3:02:20 PM11/25/23
to
In which case, the area of memory would be mapped to different
virtual address ranges in each process, and thus naturally
consume two TLBs.

FWIW, MAP_FIXED is specified as an optional feature by POSIX
and may not be supported by the OS at all.

Given various forms of ASLR being used, it's unlikely even in
two instances of the same executable that a call to mmap
with MAP_SHARED without MAP_FIXED would map the region at
the same virtual address in both processes.

>Should they both end up using the same ASID ??

They couldn't share an ASID assuming the TLB looks up by VA.

>Should they both take extra TLB walks because they use different ASIDs ??

Given the above, yes. It's likely they'll each be scheduled
on different cores anyway in any modern system.

MitchAlsup

unread,
Nov 25, 2023, 3:42:25 PM11/25/23
to
MMAP() first, fork() second. Now we have 2 processes with the
memory mapped shared memory at the same address.

Scott Lurndal

unread,
Nov 25, 2023, 4:55:09 PM11/25/23
to
Yes, in that case, they'll be mapped at the same VA. All
the below points still apply so long as TLB's are per core.

Robert Finch

unread,
Nov 25, 2023, 7:48:11 PM11/25/23
to
Are top-level page directory pages shared between tasks? Suppose a task
needs a 32-bit address space. With one level of page maps, 27 bits is
accommodated, that leaves 5 bits of address translation to be done by
the page directory. Using a whole page which can handle 11 address bits
would be wasteful. But if root pointers could point into the same page
directory page then the space would not be wasted. For instance, root
pointer for task #1 could point the first 32 entries, root pointer for
task #2 could point into the next 32 entries, and so on.


MitchAlsup

unread,
Nov 25, 2023, 8:36:09 PM11/25/23
to
Robert Finch wrote:

> Are top-level page directory pages shared between tasks?

The HyperVisor tables supporting a single Guest OS certainly are.
The Guest OS tables supporting Guest OS certainly are.
I should Note: that My 66000 Root Pointers determine the address space they
map; anything from 8MB through 8EB and PTEs supporting 8KB through 8EB page
sizes--with the kicker that large page entries can restrict themselves::
for example you can use a 8MB PTE and enable only 1..1024 pages under that
Virtual sub Address Space; furthermore, levels in the hierarchy can be
skipped--all of this to minimize table walk time.

Scott Lurndal

unread,
Nov 26, 2023, 10:55:09 AM11/26/23
to
Robert Finch <robf...@gmail.com> writes:
>Are top-level page directory pages shared between tasks?

The top half of the VA space could support this, for
the most part (since the top half is generally shared
by all tasks). The bottom half that's much less likely.
If the VA space is small enough, on ARMv8, the tables can be configured
with fewer than the normal four levels by specifying a smaller VA
size in the TCR_ELx register, so the walk may be only two or three levels
deep instead of four (or five when the VA gets larger than 52 bits).

Using intermediate level blocks (soi disant 'huge pages') reduces the
walk overhead as well, but has it's issues with allocation (since
the huge pages need not just be physical contiguous, but aligned
on huge-page-sized boundaries.

Anton Ertl

unread,
Nov 26, 2023, 11:08:45 AM11/26/23
to
sc...@slp53.sl.home (Scott Lurndal) writes:
>mitch...@aol.com (MitchAlsup) writes:
>>Consider the case where two different processes MMAP the same area
>>of memory.
>
>In which case, the area of memory would be mapped to different
>virtual address ranges in each process,

Says who? Unless the user process asks for MAP_FIXED or the address
range is already occupied in the user process, nothing prevents the OS
from putting the shared area in the same process. If the permissions
are also the same, the OS can then use one ASID for the shared area.

This would be especially useful for the read-only sections (e.g, code)
of common libraries like libc. However, in todays security landscape,
you don't want one process to know where library code is mapped in
other processes (i.e., you want ASLR), so we can no longer make use of
that benefit. And it's doubtful whether other uses are worth the
complications (and even if they are, there might be security issues,
too).

>FWIW, MAP_FIXED is specified as an optional feature by POSIX
>and may not be supported by the OS at all.

As usual, what is specified by a common-subset standard is not
relevant for what an OS implementor has to do if they want to supply
more than a practically unusable checkbox feature like the POSIX
subsystem for Windows. There is a reason why WSL2 includes a full
Linux kernel.

>>Should they both end up using the same ASID ??
>
>They couldn't share an ASID assuming the TLB looks up by VA.

Of course the TLB looks up by VA, what else. But if the VA is the
same and the PA is the same, the same ASID can be used.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7...@googlegroups.com>

EricP

unread,
Nov 26, 2023, 12:32:36 PM11/26/23
to
Anton Ertl wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
>> mitch...@aol.com (MitchAlsup) writes:
>>> Consider the case where two different processes MMAP the same area
>>> of memory.
>> In which case, the area of memory would be mapped to different
>> virtual address ranges in each process,
>
> Says who? Unless the user process asks for MAP_FIXED or the address
> range is already occupied in the user process, nothing prevents the OS
> from putting the shared area in the same process. If the permissions
> are also the same, the OS can then use one ASID for the shared area.

If the mapping range is being selected dynamically, the chance that a
range will already be in use goes up with the number of sharers.
At some point when a new member tries to join the sharing group
the map request will be denied.

Software that does not want to have a mapping request fail should assume
that a shared area will be mapped at a different address in each process.
That implies one should not assume that virtual address can be passed
but instead use, say, section relative offsets to build a linked list.

MitchAlsup

unread,
Nov 26, 2023, 3:55:59 PM11/26/23
to
Here you are using shared memory like PL/1 uses AREA and OFFSET types.

Anton Ertl

unread,
Nov 26, 2023, 4:36:33 PM11/26/23
to
EricP <ThatWould...@thevillage.com> writes:
>Anton Ertl wrote:
>> sc...@slp53.sl.home (Scott Lurndal) writes:
>>> In which case, the area of memory would be mapped to different
>>> virtual address ranges in each process,
>>
>> Says who? Unless the user process asks for MAP_FIXED or the address
>> range is already occupied in the user process, nothing prevents the OS
>> from putting the shared area in the same process.

s/process/address range/ for the last word.

>> If the permissions
>> are also the same, the OS can then use one ASID for the shared area.
>
>If the mapping range is being selected dynamically, the chance that a
>range will already be in use goes up with the number of sharers.
>At some point when a new member tries to join the sharing group
>the map request will be denied.

It will map, but with a different address range, and therefore a
different ASID. Then, for further mapping requests, the chances that
one of the two address ranges are free are increased. So even with a
large number of processes mapping the same library, you will need only
a few ASIDs for this physical memory, so there will be lots of
sharing. Of course with ASLR this is all no longer relevant.

>Software that does not want to have a mapping request fail should assume
>that a shared area will be mapped at a different address in each process.
>That implies one should not assume that virtual address can be passed
>but instead use, say, section relative offsets to build a linked list.

Yes. The other option is to use MAP_FIXED early in the process, and
to have some way of dealing with potential failures. But sharing of
VAs in user code between processes is not what the sharing of ASIDs we
have discussed here would be primarily about.

BGB

unread,
Nov 26, 2023, 4:45:16 PM11/26/23
to
On 11/26/2023 9:45 AM, Anton Ertl wrote:
> sc...@slp53.sl.home (Scott Lurndal) writes:
>> mitch...@aol.com (MitchAlsup) writes:
>>> Consider the case where two different processes MMAP the same area
>>> of memory.
>>
>> In which case, the area of memory would be mapped to different
>> virtual address ranges in each process,
>
> Says who? Unless the user process asks for MAP_FIXED or the address
> range is already occupied in the user process, nothing prevents the OS
> from putting the shared area in the same process. If the permissions
> are also the same, the OS can then use one ASID for the shared area.
>
> This would be especially useful for the read-only sections (e.g, code)
> of common libraries like libc. However, in todays security landscape,
> you don't want one process to know where library code is mapped in
> other processes (i.e., you want ASLR), so we can no longer make use of
> that benefit. And it's doubtful whether other uses are worth the
> complications (and even if they are, there might be security issues,
> too).
>

It seems to me, as long as it is a different place on each system,
probably good enough. Demanding a different location in each process
would create a lot of additional memory overhead due to from things like
base-relocations or similar.


>> FWIW, MAP_FIXED is specified as an optional feature by POSIX
>> and may not be supported by the OS at all.
>
> As usual, what is specified by a common-subset standard is not
> relevant for what an OS implementor has to do if they want to supply
> more than a practically unusable checkbox feature like the POSIX
> subsystem for Windows. There is a reason why WSL2 includes a full
> Linux kernel.
>

Still using WSL1 here as for whatever reason hardware virtualization has
thus far refused to work on my PC, and is apparently required for WSL2.

I can add this to my list of annoyances, like I can install "just short
of 128GB", but putting in the full 128GB causes my PC to be like "Oh
Crap, I guess there is 3.5GB ..." (but, apparently "112GB with unmatched
RAM sticks is fine I guess...").



But, yeah, the original POSIX is an easier goal to achieve, vs, say, the
ability to port over the GNU userland.


A lot of it is doable, but things like fork+exec are a problem if one
wants to support NOMMU operation or otherwise run all of the logical
processes in a shared address space.

A practical alternative is something more like a CreateProcess style
call, but this is "not exactly POSIX". In theory though, one could treat
"fork()" more like "vfork()" and then turn the exec* call into a
CreateProcess call and then terminate the current thread. Wouldn't
really work "in general" though, for programs that expect to be able to
"fork()" and then continue running the current program as a sub-process.


>>> Should they both end up using the same ASID ??
>>
>> They couldn't share an ASID assuming the TLB looks up by VA.
>
> Of course the TLB looks up by VA, what else. But if the VA is the
> same and the PA is the same, the same ASID can be used.
>

?...

Typically the ASID applies to the whole virtual address space, not to
individual memory objects.


Or, at least, my page-table scheme doesn't have a way to express
per-page ASIDs (merely if a page is Private/Shared, with the results of
this partly depending on the current ASID given for the page-table).

Where, say, I am mostly using 64-bit entries in the page-table, as going
to a 128-bit page-table format would be a bit steep.

Say, PTE layout (16K pages):
(63:48): ACLID
(47:14): Physical Address.
(13:12): Address or OS flag.
(11:10): For use by OS
( 9: 0): Base page-access and similar.
(9): S1 / U1 (Page-Size or OS Flag)
(8): S0 / U0 (Page-Size or OS Flag)
(7): Nu User (Supervisor Only)
(6): No Execute
(5): No Write
(4): No Read
(3): No Cache
(2): Dirty (OS, ignored by TLB)
(1): Private/Shared (MBZ if not Valid)
(0): Present/Valid

Where, ACLID serves as an index into the ACL table, or to lookup the
VUGID parameters for the page (well, along with an alternate PTE variant
that encodes VUGID directly, but reduces the physical address to 36
bits). It is possible that the original VUGID scheme may be phased out
in favor of using exclusively ACL checking.

Note that the ACL checks don't add new permissions to a page, they add
further restrictions (with the base-access being the most permissive).

Some combinations of flags are special, and encode a few edge-case
modes; such as pages which are Read/Write in Supervisor mode but
Read-Only in user mode (separate from the possible use of ACL's to mark
pages as read-only for certain tasks).



But, FWIW, I ended up adding an extended MAP_GLOBAL flag for "mmap'ed
space should be visible to all of the processes"; which in turn was used
as part of the backing memory for the "GlobalAlloc" style calls (it is
not a global heap, in that each process still manages the memory
locally, but other intersecting processes can see the address within
their own address spaces).

Well, along with a MAP_PHYSICAL flag, for if one needs memory where
VA==PA (this may fail, with the mmap returning NULL, effectively only
allowed for "superusermode"; mostly intended for hardware interfaces).



The usual behavior of MAP_SHARED didn't really make sense outside of the
context of mapping a file, and didn't really serve the needed purpose
(say, one wants to hand off a pointer to a bitmap buffer to the GUI
subsystem to have it drawn into a window).

It is also being used for things like shared scratch buffers, say, for
passing BITMAPINFOHEADER and MIDI commands and similar across the
interprocess calls (the C API style wrapper wraps a lot of this; whereas
the internal COM-style interfaces will require any pointer-style
arguments to point to shared memory).

This is not required for normal syscall handlers, where the usual
assumption is that normal syscalls will have some means of directly
accessing the address space of the caller process. I didn't really want
to require that TKGDI have this same capability.

It is debatable whether calls like BlitImage and similar should require
global memory, or merely recommend it (potentially having the call fall
back to a scratch buffer and internal memcpy if the passed bitmap image
is not already in global memory).



I had originally considered a more complex mechanism for object sharing,
but then ended up going with this for now partly because it was easier
and lower overhead (well, and also because I wanted something that would
still work if/when I started to add proper memory protection). May make
sense to impose a limit on per-process global alloc's though (since it
is intended specifically for shared buffers and not for general heap
allocation; where for heap allocation ANONYMOUS+PRIVATE would be used
instead).

Though, looking at stuff, MAP_GLOBAL semantics may have also been
partially covered by "MAP_ANONYMOUS|MAP_SHARED"?... Though, the
semantics aren't the same.

I guess, another alternative would have been to use shm_open+mmap or
similar.


Where, say, memory map will look something like:
00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
00yy_xxxxxxxx: Start of global virtual memory (*1);
3FFF_xxxxxxxx: End of global virtual memory;
4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
7FFF_xxxxxxxx: End of private/local virtual memory (possible);
8000_xxxxxxxx: Start of kernel virtual memory;
BFFF_xxxxxxxx: End of kernel virtual memory;
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

*1: The 'yy' division point may move, will depend on things like how
much RAM exists (currently, 00/01; no current "sane cost" FPGA boards
having more than 256 or 512 MB of RAM).

*2: If I go to a scheme of giving processes their own address spaces,
then private memory will be used. It is likely that executable code may
remain shared, but the data sections and heap would be put into private
address ranges.


Scott Lurndal

unread,
Nov 26, 2023, 5:27:42 PM11/26/23
to
If an implementation claims support for the XSI option of
POSIX, then it must support MAP_FIXED. There were a couple
of vendors who claimed not to be able to support MAP_FIXED
back in the days when it was being discussed in the standards
committee working groups.

In addition, the standard notes:

"Use of MAP_FIXED may result in unspecified behavior in
further use of malloc() and shmat(). The use of MAP_FIXED is
discouraged, as it may prevent an implementation from making
the most effective use of resources.

Because the semantics of MAP_FIXED are to unmap any
prior mapping in the range, if the implementation had happened to
allocate the heap or shared System V region at that address, the heap
would have become corrupt with dangling references hanging
around which, if stored into, would subsequently corrupt the mapped region.


>
>>>Should they both end up using the same ASID ??
>>
>>They couldn't share an ASID assuming the TLB looks up by VA.
>
>Of course the TLB looks up by VA, what else. But if the VA is the
>same and the PA is the same, the same ASID can be used.

That sounds like a nightmare scenario. Normally the ASID is
closely associated with a single process and the scope of
necessary TLB maintenance operations (e.g. invalidates
after translation table updates) is usually the process.

It's certainly not possible to do that on ARMv8 systems. The
ASID tag in the TLB entry comes from the translation table base
register and applies to all accesses made to the entire range covered
by the translation table by all the threads of the process.

Likewise the VMID tag in the TLB entry comes from the nested
translation table base address system register at the time
of entry creation.

For a subsequent process (child or detached) sharing memory with
that process, there just isn't any way to tag it's TLB entry with
the ASID of the first process to map the shared region.

Scott Lurndal

unread,
Nov 26, 2023, 5:35:23 PM11/26/23
to
The modern preference is to make the memory map flexible.

Linux, for example, requires that PCI Base Address Registers
be programmable by the operating system, and the OS can
choose any range (subject to host bridge configuration, of
course) for the device.

It is notable that even on non-intel systems, one may need
to map a 32-bit PCI BAR (AHCI is the classic example) which
requires the address programmed in the bar to be less than
0x10000000. Granted systems can have custom PCI controllers
that remap that into the larger physical address space with
a bit of extra hardware, however the kernel people don't
like that at all since there is no universal standard for
such remapping and they don't want to support
dozens of independent implementations, constantly
changing from generation to generation.

Many modern SoCs (and ARM SBSA requires this) make their
on-board devices and coprocessors look like PCI express
devices to software, and SBSA requires the PCIe ECAM
region for device discovery. Here again, each of
these on board devices will have from one to six
memory region base address registers (or one to
three for 64-bit bars).

Encoding memory attributes into the address is common
in microcontrollers, but in a general purpose processor
constrains the system to an extent sufficient to make it
unattractive for general purpose workloads.

Robert Finch

unread,
Nov 26, 2023, 6:21:04 PM11/26/23
to
Q+ has a similar setup, but the ACLID is in a separate table.

For Q+ Two similar MMUs have been designed, one to be used in a large
system and a second for a small system. The difference between the two
is in the size of page numbers. The large system uses 64-bit page
numbers, and the small system uses 32-bit page numbers. The PTE for the
large system is 96-bits, 32-bits larger than the PTE for the small
system due to the extra bits for the page number. Pages are 64kB. The
small system supports a 48-bit address range.

The PTE has the following fields:
PPN 64/32 Physical page number
URWX 3 User read-write-execute override
SRWX 3 Supervisor read-write-execute override
HRWX 3 Hypervisor read-write-execute override
MRWX 3 Machine read-write-execute override
CACHE 4 Cache-ability bits
SW 2 OS software usage
A 1 1=accessed/used
M 1 1=modified
V 1 1 if entry is valid, otherwise 0
S 1 1=shared page
G 1 1=global, ignore ASID
T 1 0=page pointer, 1= table pointer
RGN 3 Region table index
LVL/BC 5 the page table level of the entry pointed to

The RWX and CACHE bits are overrides. These values normally come from
the region table, but may be overridden by values in the PTE.
The LVL/BC5 field is five bits to account for a five-bit bounce counter
for inverted page tables. Only a 3-bit level is in use.

There is a separate table with per page information that contains a
reference to an ACL (16-bts), share counts (16-bits), privilege level
(8-bits), and access key (24-bits), and a couple of other fields for
compression / encryption.

I have made the PTBR a full 64-bit address now rather than a page number
with control bits. So, it may now point into the middle of a page
directory which is shared between tasks.

The table walker and region table look like PCI devices to the system.

BGB

unread,
Nov 26, 2023, 7:40:23 PM11/26/23
to
On 11/26/2023 4:35 PM, Scott Lurndal wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/26/2023 9:45 AM, Anton Ertl wrote:
>
>>
>>
>> Where, say, memory map will look something like:
>> 00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
>> 00yy_xxxxxxxx: Start of global virtual memory (*1);
>> 3FFF_xxxxxxxx: End of global virtual memory;
>> 4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
>> 7FFF_xxxxxxxx: End of private/local virtual memory (possible);
>> 8000_xxxxxxxx: Start of kernel virtual memory;
>> BFFF_xxxxxxxx: End of kernel virtual memory;
>> Cxxx_xxxxxxxx: Physical Address Range (Cached);
>> Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
>> Exxx_xxxxxxxx: Reserved;
>> Fxxx_xxxxxxxx: MMIO and similar.
>
>
> The modern preference is to make the memory map flexible.
>
> Linux, for example, requires that PCI Base Address Registers
> be programmable by the operating system, and the OS can
> choose any range (subject to host bridge configuration, of
> course) for the device.
>


As for the memory map, actual hardware-relevant part of the map is:
0000_xxxxxxxx..7FFF_xxxxxxxx: User Mode, virtual
8000_xxxxxxxx..BFFF_xxxxxxxx: Supervisor Mode, virtual
Cxxx_xxxxxxxx: Physical Address Range (Cached);
Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
Exxx_xxxxxxxx: Reserved;
Fxxx_xxxxxxxx: MMIO and similar.

No good way to make more entirely flexible, some of this stuff requires
special handling from the L1 cache, and by the time it reaches the TLB,
it is too late (unless there were additional logic to be like "Oh, crap,
this was actually meant for MMIO!").

Though, with the 96-bit VA mode, if GBH(47:0)!=0, then the entire 48-bit
space is User Mode Virtual (and it is not possible to access MMIO or
similar at all, short of reloading 0 into GBH, or using XMOV.x
instructions with a 128-bit pointer, say:
0000_0000_00000000-tttt_Fxxx_xxxxxxxx).


Note here that the high 16-bits are ignored for normal pointers
(typically used for type-tagging or bounds-checking by the runtime).

For branches and captured Link-Register values:
If LSB is 0: High 16 bits are ignored;
The branch will always be within the same CPU Mode.
If LSB is 1: High 16 bits encode CPU Mode control flags.
LSB is always set for created LR values.
CPU will trap if the LSB is Clear in LR during an RTS/RTSU.

Setting the LSB and putting the mode in the high 16 bits is also often
used on function pointers so that theoretically Baseline and XG2 code
can play along together (though, at present, BGBCC does not generate any
mixed binaries, so this part would mostly apply to DLLs).




For the time being, there is no PCI or PCIe in my case.
Nor have I gone up the learning curve for what would be required to
interface with any PCIe devices.


Had tried to get USB working, but didn't have much success as it seemed
I was still missing something (seemed to be sending/receiving bytes, but
the devices would not respond as expected to any requests or commands).

Mostly ended up using a PS2 keyboard, and had realized that (IIRC) if
one pulled the D+ and D- lines high (IIRC) the mouse would instead
implement the PS2 protocol (though, this didn't work on the USB
keyboards I had tried).


Most devices are mapped to fixed address ranges in the MMIO space:
F000Cxxx: Rasterizer / Edge-Walker Control Registers
F000Exxx: Various basic devices
SDcard, PS2 Keyboard/Mouse, RS232 UART (*), etc
F008xxxx: FM Synth / Sample Mixer Control / ...
F009xxxx: PCM Audio Loop/Registers
F00Axxxx: MMIO VRAM
F00Bxxxx: MMIO VRAM and Video Control
At present, VRAM is also RAM-backed.
VRAM framebuffer base address in RAM is now movable.

All this existing within:
FFFF_Fxxxxxxx

*: RS232 generally connected to a UART interface that feeds back to a
connected computer via an on-board FTDI chip or similar.


As for physical memory map, it is sorta like:
00000000..00007FFF: Boot ROM
0000C000..0000DFFF: Boot SRAM
00010000..0001FFFF: ZERO's
00020000..0002FFFF: BJX2 NOP's
00030000..0003FFFF: BJX2 BREAK's
...
01000000..1FFFFFFF: Reserved for RAM
20000000..3FFFFFFF: Reserved for More RAM (And/or repeating)
40000000..5FFFFFFF: RAM repeats (and/or Reserved)
60000000..7FFFFFFF: RAM repeats more (and/or Reserved)
80000000..EFFFFFFF: Reserved
F0000000..FFFFFFFF: MMIO in 32-bit Mode (*1)

*1: There used to be an MMIO range at 0000_F0000000, but this has been
eliminated in favor of only recognizing this range as MMIO in 32-bit
mode (where only the low 32-bits of the address are used). Enabling
48-bit addressing will now require using the proper MMIO address.

Currently, nothing past the low 4GB is used in the physical memory map.


> It is notable that even on non-intel systems, one may need
> to map a 32-bit PCI BAR (AHCI is the classic example) which
> requires the address programmed in the bar to be less than
> 0x10000000. Granted systems can have custom PCI controllers
> that remap that into the larger physical address space with
> a bit of extra hardware, however the kernel people don't
> like that at all since there is no universal standard for
> such remapping and they don't want to support
> dozens of independent implementations, constantly
> changing from generation to generation.
>
> Many modern SoCs (and ARM SBSA requires this) make their
> on-board devices and coprocessors look like PCI express
> devices to software, and SBSA requires the PCIe ECAM
> region for device discovery. Here again, each of
> these on board devices will have from one to six
> memory region base address registers (or one to
> three for 64-bit bars).
>
> Encoding memory attributes into the address is common
> in microcontrollers, but in a general purpose processor
> constrains the system to an extent sufficient to make it
> unattractive for general purpose workloads.


Possibly, but making things more flexible here would be a non-trivial
level of complexity to deal with at the moment (and, it seemed relevant
at first to design something I could "actually implement").


At the time I started out on this, even maintaining similar hardware
interfaces to a minimalist version of the Sega Dreamcast (what the
BJX1's hardware-interface design was partly based on) was asking a bit
too much (even after leaving out things like the CD-ROM drive and similar).


So, I simplified things somewhat, initially taking some design
inspiration in these areas from the Commodore 64 and MSP430 and similar...

Say:
VRAM was reinterpreted as being an 80x25 grid of 8x8 pixel color cells;
Audio was a simple MMIO-backed PCM loop (with a few registers to adjust
the sample rate and similar).

In terms of output signals, the display module drives a VGA output, and
the audio is generally pulled off by turning an IO pin on and off really
fast.

Or, one drives 2 lines for audio, say:
10: +, 01: -, 11: 0

Using an H-Bridge driver as an amplifier (turns out one needs to drive
like 50-100mA to get any decent level of loudness out of headphones;
which is well beyond the power normal IO pins can deliver). Generally
PCM needs to get turned into PWM/PDM.

Driving stereo via a dual H-Bridge driver would get a little wonky
though, since headphones use Left/Right and a Common, effectively one
needs to drive the center as a neutral, with L/R channels (and/or, just
get lazy and drive mono across both the L/R channels using a single
H-Bridge and ignore the center point, which ironically can get more
loudness at less current because now one is dealing with 70 ohm rather
than 35 ohm).

...


Generally, with all of the hardware addresses at fixed locations.
Doing any kind of dynamic configuration or allowing hardware addresses
to be movable would have likely made the MMIO devices significantly more
expensive (vs hard-coding the address of each device).


Did generally go with MMIO rather than x86 style IO ports though.
Partly because IO ports sucks, and I wasn't quite *that* limited (say,
could afford to use a 28-bit space, rather than a 16-bit space).


...

MitchAlsup

unread,
Nov 26, 2023, 9:11:03 PM11/26/23
to
Scott Lurndal wrote:

> BGB <cr8...@gmail.com> writes:
>>On 11/26/2023 9:45 AM, Anton Ertl wrote:

>>
>>
>>Where, say, memory map will look something like:
>> 00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
>> 00yy_xxxxxxxx: Start of global virtual memory (*1);
>> 3FFF_xxxxxxxx: End of global virtual memory;
>> 4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
>> 7FFF_xxxxxxxx: End of private/local virtual memory (possible);
>> 8000_xxxxxxxx: Start of kernel virtual memory;
>> BFFF_xxxxxxxx: End of kernel virtual memory;
>> Cxxx_xxxxxxxx: Physical Address Range (Cached);
>> Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
>> Exxx_xxxxxxxx: Reserved;
>> Fxxx_xxxxxxxx: MMIO and similar.


> The modern preference is to make the memory map flexible.

// cacheable, used, modified bits
CUM kind of access
--- ------------------------------
000 uncacheable DRAM
001 MMI/O
010 config
011 ROM
1xx cacheable DRAM

> Linux, for example, requires that PCI Base Address Registers
> be programmable by the operating system, and the OS can
> choose any range (subject to host bridge configuration, of
> course) for the device.

Easily done, just create an uncacheable PTE and set UM to 10
for config space or 01 for MMI/O space.

> It is notable that even on non-intel systems, one may need
> to map a 32-bit PCI BAR (AHCI is the classic example) which
> requires the address programmed in the bar to be less than
> 0x10000000.

I/O MMU translates these devices from a 32-bit VAS into the
64-bit PAS.

> Granted systems can have custom PCI controllers
> that remap that into the larger physical address space with
> a bit of extra hardware, however the kernel people don't
> like that at all since there is no universal standard for
> such remapping and they don't want to support
> dozens of independent implementations, constantly
> changing from generation to generation.

What they figure if they are already supporting 4 incompatible
mapping systems {Intel, AMD, ARM, RISC-V} you would have though
they had gotten good at these implementations :-)

> Many modern SoCs (and ARM SBSA requires this) make their
> on-board devices and coprocessors look like PCI express
> devices to software,

I made the CPU/cores in My 66000 have a configuration port
that is setup during boot that smells just like a PCIe
port.

> and SBSA requires the PCIe ECAM
> region for device discovery. Here again, each of
> these on board devices will have from one to six
> memory region base address registers (or one to
> three for 64-bit bars).

> Encoding memory attributes into the address is common
> in microcontrollers, but in a general purpose processor
> constrains the system to an extent sufficient to make it
> unattractive for general purpose workloads.

Agreed.

BGB

unread,
Nov 27, 2023, 1:04:20 AM11/27/23
to
On 11/26/2023 8:09 PM, MitchAlsup wrote:
> Scott Lurndal wrote:
>
>> BGB <cr8...@gmail.com> writes:
>>> On 11/26/2023 9:45 AM, Anton Ertl wrote:
>
>>>
>>>
>>> Where, say, memory map will look something like:
>>>   00yy_xxxxxxxx: Physical mapped and direct-mapped memory.
>>>   00yy_xxxxxxxx: Start of global virtual memory (*1);
>>>   3FFF_xxxxxxxx: End of global virtual memory;
>>>   4000_xxxxxxxx: Start of private/local virtual memory (possible, *2);
>>>   7FFF_xxxxxxxx: End of private/local virtual memory (possible);
>>>   8000_xxxxxxxx: Start of kernel virtual memory;
>>>   BFFF_xxxxxxxx: End of kernel virtual memory;
>>>   Cxxx_xxxxxxxx: Physical Address Range (Cached);
>>>   Dxxx_xxxxxxxx: Physical Address Range (Volatile/NoCache);
>>>   Exxx_xxxxxxxx: Reserved;
>>>   Fxxx_xxxxxxxx: MMIO and similar.
> >
>> The modern preference is to make the memory map flexible.
>

As noted, some amount of the above would be part of the OS memory map,
rather than a hardware imposed memory map.


Like, say, Windows on x86 typically had:
00000000..000FFFFF: DOS-like map (9x)
00100000..7FFFFFFF: Userland stuff
80000000..BFFFFFFF: Shared stuff
C0000000..FFFFFFFF: Kernel Stuff

Did the hardware enforce this? No.
Did Windows follow such a structure? Yes, generally.

Linux sorta followed a similar structure, except that on some versions,
they had given the full 4GB to userland addresses (which made an
annoyance if trying to use TagRefs and the OS might actually put memory
in the part of the address space one would have otherwise used to hold
fixnums and similar).

Ironically though, this sort of thing (along with the limits of 32-bit
tagrefs) made incentive for my to go over to 64-bit tagrefs even on
32-bit machines, and a generally similar tagref scheme got carried into
my later projects.


Say:
0ttt_xxxx_xxxxxxxx: Pointers
1ttt_xxxx_xxxxxxxx: Small Value Spaces
2ttt_xxxx_xxxxxxxx: ...
3yyy_xxxx_xxxxxxxx: Bounds Checked Pointers
4iii_iiii_iiiiiiii: Fixnum
..
7iii_iiii_iiiiiiii: Fixnum
8iii_iiii_iiiiiiii: Flonum
..
Biii_iiii_iiiiiiii: Flonum
...

But, this scheme is more used by the runtime, not so much by the hardware.

For the most part, C doesn't use pointer tagging.
However BGBScript/JavaScript and my BASIC variant do make use of
type-tagging.


>                 // cacheable, used, modified bits
>     CUM            kind of access
>     ---            ------------------------------
>     000            uncacheable DRAM
>     001            MMI/O
>     010            config
>     011            ROM
>     1xx            cacheable DRAM
>

Hmm...
Unfortunate acronyms are inescapable it seems...


>> Linux, for example, requires that PCI Base Address Registers
>> be programmable by the operating system, and the OS can
>> choose any range (subject to host bridge configuration, of
>> course) for the device.
>
> Easily done, just create an uncacheable PTE and set UM to 10
> for config space or 01 for MMI/O space.
>

I guess, if PCIe were supported, some scheme could be developed to map
the PCIe space either into part of the MMIO space, into RAM space, or
maybe some other space.

There is a functional difference between MMIO space and RAM space in
terms of how they are accessed:
RAM space: Cache does its thing and works with cache-lines;
MMIO space: A request is sent over the bus, and then it waits for a
response.

If the MMIO bridge sees an MMIO request, it puts it onto the MMIO Bus,
and sees if any device responds (if so, sending the response back to the
origin). Otherwise, if no device responds after a certain number of
clock cycles, an all-zeroes response is sent instead.


Currently, no sort of general purpose bus is routed outside of the FPGA,
and if it did exist, it is not yet clear what form it would take.

Would need to limit pin counts though, so probably some sort of serial
bus in any case.

PCIe might be sort of tempting in the sense that apparently, 1 PCIe lane
can be subdivided to multiple devices, and bridge cards exist that can
apparently route PCIe over a repurposed USB cable and then connect
multiple devices, PCI, or ISA cards. Albeit apparently with mixed results.
At least for the userland address ranges, there is less of this going on
than in SH4, which had basically spent the top 3 bits of the 32-bit
address as mode.

Say, IIRC:
(29): No TLB
(30): No Cache
(31): Supervisor

So, in effect, there was only 512MB of usable address space.
The SH-4A had then expanded the lower part to 31 bits, so one could have
2GB of usermode address space.


But, say, if one can have 47 bits of freely usable virtual address space
for userland, probably good enough.


Anton Ertl

unread,
Nov 27, 2023, 2:52:19 AM11/27/23
to
BGB <cr8...@gmail.com> writes:
>On 11/26/2023 9:45 AM, Anton Ertl wrote:
>> This would be especially useful for the read-only sections (e.g, code)
>> of common libraries like libc. However, in todays security landscape,
>> you don't want one process to know where library code is mapped in
>> other processes (i.e., you want ASLR), so we can no longer make use of
>> that benefit. And it's doubtful whether other uses are worth the
>> complications (and even if they are, there might be security issues,
>> too).
>>
>
>It seems to me, as long as it is a different place on each system,
>probably good enough. Demanding a different location in each process
>would create a lot of additional memory overhead due to from things like
>base-relocations or similar.

If the binary is position-independent (the default on Linux on AMD64),
there is no such overhead.

I just started the same binary twice and looked at the address of the
same peace of code:

Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
Type `bye' to exit
see open-file
Code open-file
0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
...

For the other process the same instruction is:

Code open-file
0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)

Following the calls until I get to glibc, I get, for the two processes:

0x00007f705c0c3b90 <__libc_open64+0>: push %r12
0x00007f190aa34b90 <__libc_open64+0>: push %r12

So not just the binary, but also glibc resides at different virtual
addresses in the two processes.

So obviously the Linux and glibc maintainers think that per-system
ASLR is not good enough. They obviously want ASLR to work as well as
possible against local attackers.

>> Of course the TLB looks up by VA, what else. But if the VA is the
>> same and the PA is the same, the same ASID can be used.
>>
>
>?...
>
>Typically the ASID applies to the whole virtual address space, not to
>individual memory objects.

Yes, one would need more complicated ASID management than setting
"the" ASID on switching to a process if different VMAs in the process
have different ASIDs. Another reason not to go there.

Power (and IIRC HPPA) do something in this direction with their
"segments", where the VA space was split into 16 equally parts, and
IIRC the 16 parts each extended the address by 16 bits (minus the 4
bits of the segment number), so essentially they have 16 16-bit ASIDs.
The address spaces are somewhat unflexible, but with 64-bit VAs
(i.e. 60-bit address spaces) that may be good enough for quite a
while. The cost is that you now have to manage 16 ASID registers.
And if we ever get to actually making use of more the 60 bits of VA in
other ways, combining this ASID scheme with the other use of the VAs.

Anton Ertl

unread,
Nov 27, 2023, 4:17:37 AM11/27/23
to
sc...@slp53.sl.home (Scott Lurndal) writes:
>an...@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>>sc...@slp53.sl.home (Scott Lurndal) writes:
>>>FWIW, MAP_FIXED is specified as an optional feature by POSIX
>>>and may not be supported by the OS at all.
>>
>>As usual, what is specified by a common-subset standard is not
>>relevant for what an OS implementor has to do if they want to supply
>>more than a practically unusable checkbox feature like the POSIX
>>subsystem for Windows. There is a reason why WSL2 includes a full
>>Linux kernel.
...
>Because the semantics of MAP_FIXED are to unmap any
>prior mapping in the range, if the implementation had happened to
>allocate the heap or shared System V region at that address, the heap
>would have become corrupt with dangling references hanging
>around which, if stored into, would subsequently corrupt the mapped region.

Of course you can provide an address without specifying MAP_FIXED, and
a high-quality OS will satisfy the request if possible (and return a
different address if not), while a work-to-rule OS like the POSIX
subsystem for Windows may then treat that address as if the user had
passed NULL.

Interestingly, Linux (since 4.17) also provides MAP_FIXED_NOREPLACE,
which works like MAP_FIXED except that it returns an error if
MAP_FIXED would replace part of an existing mapping. Makes me wonder
if in the no-conflict case, and given a page-aligned addr there is any
difference between MAP_FIXED, MAP_FIXED_NOREPLACE and just providing
an address without any of these flags in Linux. In the conflict case,
the difference between the latter two variants is how you detect that
it did not work as desired.

BGB

unread,
Nov 27, 2023, 4:34:41 AM11/27/23
to
On 11/27/2023 1:22 AM, Anton Ertl wrote:
> BGB <cr8...@gmail.com> writes:
>> On 11/26/2023 9:45 AM, Anton Ertl wrote:
>>> This would be especially useful for the read-only sections (e.g, code)
>>> of common libraries like libc. However, in todays security landscape,
>>> you don't want one process to know where library code is mapped in
>>> other processes (i.e., you want ASLR), so we can no longer make use of
>>> that benefit. And it's doubtful whether other uses are worth the
>>> complications (and even if they are, there might be security issues,
>>> too).
>>>
>>
>> It seems to me, as long as it is a different place on each system,
>> probably good enough. Demanding a different location in each process
>> would create a lot of additional memory overhead due to from things like
>> base-relocations or similar.
>
> If the binary is position-independent (the default on Linux on AMD64),
> there is no such overhead.
>

OK.

I was thinking mostly of things like PE/COFF, where often a mix of
relative and absolute addressing is used, and loading typically involves
applying base relocations (so, once loaded, the assumption is that the
binary will not move further).

Granted, traditional PE/COFF and ELF manage things like global variables
differently (direct vs GOT).

Though, on x86-64, PC-relative addressing is a thing, so less need for
absolute addressing. PIC with PE/COFF might not be too much of a stretch.


> I just started the same binary twice and looked at the address of the
> same peace of code:
>
> Gforth 0.7.3, Copyright (C) 1995-2008 Free Software Foundation, Inc.
> Gforth comes with ABSOLUTELY NO WARRANTY; for details type `license'
> Type `bye' to exit
> see open-file
> Code open-file
> 0x000055c2b76d5833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
> ...
>
> For the other process the same instruction is:
>
> Code open-file
> 0x000055dd606e4833 <gforth_engine+6595>: mov %r15,0x1c126(%rip)
>
> Following the calls until I get to glibc, I get, for the two processes:
>
> 0x00007f705c0c3b90 <__libc_open64+0>: push %r12
> 0x00007f190aa34b90 <__libc_open64+0>: push %r12
>
> So not just the binary, but also glibc resides at different virtual
> addresses in the two processes.
>
> So obviously the Linux and glibc maintainers think that per-system
> ASLR is not good enough. They obviously want ASLR to work as well as
> possible against local attackers.
>

OK.


>>> Of course the TLB looks up by VA, what else. But if the VA is the
>>> same and the PA is the same, the same ASID can be used.
>>>
>>
>> ?...
>>
>> Typically the ASID applies to the whole virtual address space, not to
>> individual memory objects.
>
> Yes, one would need more complicated ASID management than setting
> "the" ASID on switching to a process if different VMAs in the process
> have different ASIDs. Another reason not to go there.
>
> Power (and IIRC HPPA) do something in this direction with their
> "segments", where the VA space was split into 16 equally parts, and
> IIRC the 16 parts each extended the address by 16 bits (minus the 4
> bits of the segment number), so essentially they have 16 16-bit ASIDs.
> The address spaces are somewhat unflexible, but with 64-bit VAs
> (i.e. 60-bit address spaces) that may be good enough for quite a
> while. The cost is that you now have to manage 16 ASID registers.
> And if we ever get to actually making use of more the 60 bits of VA in
> other ways, combining this ASID scheme with the other use of the VAs.
>

OK.

That seems a bit odd...


Scott Lurndal

unread,
Nov 27, 2023, 9:59:41 AM11/27/23
to
I've never seen a case where using MAP_FIXED was useful, and I've
been using mmap since the early 90's. I'm sure there must be one,
probabably where someone uses full VAs instead of offsets in data
structures. Using the full VAs in the region will likely cause
issues in the long term as the application is moved to updated or
different posix systems, particularly if the data file associated
with the region is expected to work in all subsequent
implementats. MAP_FIXED should be avoided, IMO.

Anton Ertl

unread,
Nov 27, 2023, 11:18:19 AM11/27/23