Hardware support for pointer selectors (for protection, expansion, and bookkeeping)

Rick C. Hodgin

unread,

Mar 9, 2016, 10:46:54 AM3/9/16

to

I had an idea today, and I'm sure it already exists, but I wanted to find
out if anyone knows of an architecture which supports it? It goes like
this:

On the x86:

movzx eax,_16_bit_dat_segment_number
mov ds,eax
; Somewhere in prior setup, the selector used here for DS: was
; created, populated, established, with settings, a base and
; limit, etc., providing protection

; We can then use it in code:
mov esi,myOffset
mov [esi],0

In this, we have a data segment selector, and a pointer, and we're able to
use the hardware protection mechanisms of the CPU to prevent the offset in
esi from writing into a memory area it shouldn't.

What if there were a a way to do this:

mov esi,myPointer
; Somewhere in prior setup, myPointer was setup to have its own,
; second level of protection, with its own settings, base, and
; limit, allowing for data accesses through this pointer to be
; consistently accessed relative to its address, but that the
; kernel can, underneath and transparently to the application,
; manage its memory, provide for movement of the data in physical
; memory without having to actually have memory allocated in
; sequential blocks, but only relative to an always 0-based offset
; into its own allocated area.

; We can then use it in code like normal
mov [esi],0
; But rather than updating data at the value of esi, esi itself
; serves as a selector, which points to the physical location in
; memory which is established by the pointer selector.

-----
By doing this, protection to data within the pointer can be ensured,
expansion up and down of the memory size at that location can be
assured and without any physical locations in memory needing to be
changed should the realloc() function needed to physically move the
data, because in this case the OS would move the data underneath and
update the hidden fields in the pointer selector, and there can be
all kinds of read/write counts, including interrupt-on-read/write
settings, etc., to allow for complex bookkeeping.

Are there architectures which provide this facility?

Best regards,
Rick C. Hodgin

Rick C. Hodgin

unread,

Mar 9, 2016, 10:50:10 AM3/9/16

to

In this way, the standard malloc(), realloc(), free() functions can be
used for memory that is in the traditional form. But for other
applications where managed memory is desirable, the new facilities could
be exposed and used in code, making it far easier to use and maintain in
software, with real hardware support provided making it completely
transparent to libraries, as each of them is simply by physical address,
allowing for transparent integration, passing of parameters, etc., as is
done today.

Robert Wessel

unread,

Mar 9, 2016, 12:02:18 PM3/9/16

to

This can be done on any system supporting traditional paging. Just
make each allocation start on a page boundary, and have guard pages on
either side.

The problem is that it's inefficient for small allocations (the guard
pages just waste address space, but the actual allocations are all
rounded up to a page-multiple in size), and it doesn't prevent things
like: "add esa,100000" from adjusting the pointer into a whole
different allocation.

Obviously there are several ways you can partition the segment and
offset in the pointer (you can use a fixed scheme which gives you
something more like the traditional x86 segment+offset, or just
allocate a fixed number of guard pages on either side).

In Windows, for example, you could replace malloc() so that it did a
VirtualAlloc/reserve (allocate the address space to the required n+2
pages) and VirtualAlloc/commit (map actual storage to the middle n
pages) to create that area. Linux has similar facilities. Obviously
a corresponding modification of free() is required as well.

Alternatively you can allocate smaller guard areas on either side of
an allocation, and detect if anything has been written into them
(stack guards like that are on fairly common on x86). That allows you
to allocate sub-page areas, but increases detection overhead, and
delays detection as well.

Several systems use a few high bits in address to select a segment or
region, this can have similar effects.

Rick C. Hodgin

unread,

Mar 9, 2016, 12:16:14 PM3/9/16

to

It would also be a way to allow 32-bit applications to immediately
address more than 4GB of memory per task, as each pointer selector could
point to its own 36-bit or larger base, allowing single applications to
allocate 4GB per pointer, rather than per task/process, allowing for a
nearly unlimited amount of memory, even in 32-bit code.

Stephen Fuld

unread,

Mar 9, 2016, 12:52:11 PM3/9/16

to

On 3/9/2016 7:46 AM, Rick C. Hodgin wrote:
> I had an idea today, and I'm sure it already exists, but I wanted to find
> out if anyone knows of an architecture which supports it?

snip description

This seems pretty similar to the concept of "banks" as used in the
Unisys 2200 series architecture. All code and data reside in banks.
These are variable sized regions of virtual memory described by a bank
descriptor, which contains its size, base address, attributes like
protection, etc.

Up to 16 banks are specified with bits in the instruction, and there are
instructions to change which of the up to many thousands of potential
banks are currently "based" i.e. one of the active 16.

All accesses are checked by the hardware that they conform to what is
allowed in the specified bank.

Note that this is totally on top of the virtual memory system which uses
traditional fixed size (though larger than 4Kbytes) pages.

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

Rick C. Hodgin

unread,

Mar 9, 2016, 1:31:32 PM3/9/16

to

Today we have LGDT, LIDT, LLDT, with this one added as LPTD. It allows
values in its range (typically low-level near-0 values) to be used as
the pointer selectors.

In these cases, all standard instruction encodings which use a pointer
variable with a base ([esi+...] and [edi+...]) then no longer use the
actual value of esi for their address, but the value pointed to by their
selector.

; kernel code
lpdt my_pointer_selector_table

; User code
lptr esi,myPointerSelector1
lptr edi,myPointerSelector2

; A flag in EFLAGS is used for ESI=use LPDT, one for EDI=use LPDT
; Any subsequent writes to esi or edi will lower the flag, except
; one using the lptr instruction.

mov [esi],0
mov eax,[edi]

They all work like normal, but read through the base address and limit
of the pointer selectors, allowing for windows into all of memory,
through the esi or edi registers, and if extended to 64-bits rsi, and
rdi.

Joe Chisolm

unread,

Mar 9, 2016, 1:52:59 PM3/9/16

to

Search bitsavers. They may have Honeywell DPS8 manuals and look
for anything about CP-6. The DPS8 (and L66 with the New Systems
Architecture board set) had segment descriptors and pointer
registers. A priveleged process could create a segment descriptor
and anyone could "shrink" a descriptor to a smaller size. I'm
trying to remember from 30 years ago but the climb instruction was
used to call the OS or what we might think of as shared libraries.
The climb instruction would take the arguments to the procedure
call and create descriptors for them. The descriptor for,say a buffer,
would have the base as the start of the buffer and the size would
be sizeof(buffer). The library and even OS would mem fault on a
write outside of the bounds of the descriptor (in a pointer register)

Shrinking descriptors and using pointer registers was trivial but
the climb instruction was expensive. If memory serves PL-6 (the
CP-6 PL/1 derivitive) would only use the climb instruction for
OS calls and shared library calls.

Something like:
char x[1];
read(fd,x,5000);

If the available data count > 1 the read would mem fault and never
write out side the bounds of "x". (I have no idea today what that
could would look like in PL-6 nor do I have any desire to find out).

The equivalent of a malloc() in CP-6 would return a descriptor so
even something like:
char *x=malloc(1);
read(fd,x,5000);

would still fail.

I also think there was a base/bound for the stack pointer register. You
did not need guard pages to protect from stack overflow.

--
Chisolm
Republic of Texas

David Brown

unread,

Mar 9, 2016, 1:58:00 PM3/9/16

to

I believe this is roughly what happens when you use something like the
address sanitizer in clang or gcc, or using debug libraries such as
valgrind. I have not used these myself (they are not suitable for the
type of systems I target), but have helped many people find bugs in
their code by trapping invalid accesses.

Rick C. Hodgin

unread,

Mar 9, 2016, 2:09:49 PM3/9/16

to

No. Sanitizers use software algorithms for nearly all checking, which
is why some of them require full access to the standard library sources,
and all of them require some type of memory footprint in higher memory
area (for shadowing the processing data as it's encountered).

They instrument the source code inserting function calls on access to
validate access before it occurs.

2013: https://www.youtube.com/watch?v=Q2C2lP8_tNE
2014: http://www.youtube.com/watch?v=V2_80g0eOMc
2015: http://www.youtube.com/watch?v=qTkYDA0En6U

The facilities I'm talking about would be new features added to hardware,
to allow this type of windowing ability into memory through the pointer
selector, rather than the pointer value.

David Brown

unread,

Mar 9, 2016, 3:59:58 PM3/9/16

to

Existing sanitizers have to work with existing hardware. You are right
about "address sanitizer" - it uses additional memory blocks to track
which addresses are valid, and has some overhead in the software to
check this (they estimate a 2x slow-down).

The one I was thinking of was "Electric Fence" - or its more
full-featured fork, DUMA (<http://duma.sourceforge.net/>). Basically,
whenever you allocate some memory with malloc(), this allocates a full
virtual page and places the desired memory area at the end of that page.
It also allocates the next page, but makes it inaccessible (no read or
write access). So code can directly access the allocated memory without
additional instructions or checks in software, but if you run off the
end of the block, you hit an illegal access fault. That is probably
about the best you can get with normal cpu hardware.

But if you are designing new hardware, I don't see that it would be too
difficult to add a range-limit field to the virtual address information
in the MMU for an address or segment. The problem is that code would
have to use "long pointers" including a different segment for each
memory block. It may make more sense to use really long pointers in
your hardware - perhaps have 64-bit pointers, of which the top 24 bits
are used for the segment (leaving 40 bits for the offset, as I know you
like that number!).

David W Schroth

unread,

Mar 9, 2016, 10:34:51 PM3/9/16

to

On Wed, 9 Mar 2016 09:52:04 -0800, Stephen Fuld
<SF...@alumni.cmu.edu.invalid> wrote:

>On 3/9/2016 7:46 AM, Rick C. Hodgin wrote:
>> I had an idea today, and I'm sure it already exists, but I wanted to find
>> out if anyone knows of an architecture which supports it?
>
>snip description
>
>This seems pretty similar to the concept of "banks" as used in the
>Unisys 2200 series architecture. All code and data reside in banks.
>These are variable sized regions of virtual memory described by a bank
>descriptor, which contains its size, base address, attributes like
>protection, etc.
>
>Up to 16 banks are specified with bits in the instruction, and there are
>instructions to change which of the up to many thousands of potential
>banks are currently "based" i.e. one of the active 16.
>

I'll pick a nit, here.
That 16 is the number of Base Registers a program can access; there
are considerably more than 16 banks available to a program. The cost
of loading a different bank into a base register is around 7 cycles.

>All accesses are checked by the hardware that they conform to what is
>allowed in the specified bank.
>
>Note that this is totally on top of the virtual memory system which uses
>traditional fixed size (though larger than 4Kbytes) pages.

I have, in the past, read a research paper that argued that the
optimal size of a page (at that time) was 16 KiB. As someone who has
spent a lot of time maintaining the virtual memory system, I think
it's ludicrous to stick with a 4K page size when the size of physical
memory is greater than 64G.

Regards,

David W. Schroth

Quadibloc

unread,

Mar 9, 2016, 11:35:59 PM3/9/16

to

On Wednesday, March 9, 2016 at 8:34:51 PM UTC-7, David W Schroth wrote:

> I have, in the past, read a research paper that argued that the
> optimal size of a page (at that time) was 16 KiB. As someone who has
> spent a lot of time maintaining the virtual memory system, I think
> it's ludicrous to stick with a 4K page size when the size of physical
> memory is greater than 64G.

I remember reading that one of the DEC Alpha chips had a variable page size; 8K, 64K, or 512K IIRC.

John Savard

Stephen Fuld

unread,

Mar 10, 2016, 2:15:39 AM3/10/16

to

On 3/9/2016 7:36 PM, David W Schroth wrote:
> On Wed, 9 Mar 2016 09:52:04 -0800, Stephen Fuld
> <SF...@alumni.cmu.edu.invalid> wrote:
>
>> On 3/9/2016 7:46 AM, Rick C. Hodgin wrote:
>>> I had an idea today, and I'm sure it already exists, but I wanted to find
>>> out if anyone knows of an architecture which supports it?
>>
>> snip description
>>
>> This seems pretty similar to the concept of "banks" as used in the
>> Unisys 2200 series architecture. All code and data reside in banks.
>> These are variable sized regions of virtual memory described by a bank
>> descriptor, which contains its size, base address, attributes like
>> protection, etc.
>>
>> Up to 16 banks are specified with bits in the instruction, and there are
>> instructions to change which of the up to many thousands of potential
>> banks are currently "based" i.e. one of the active 16.
>>
> I'll pick a nit, here.
> That 16 is the number of Base Registers a program can access; there
> are considerably more than 16 banks available to a program. The cost
> of loading a different bank into a base register is around 7 cycles.

Yes, I tried to say that where I said above that you could have any of
thousands of banks based on the 16 available base registers with an
instruction. Sorry if that wasn't clear.

>> All accesses are checked by the hardware that they conform to what is
>> allowed in the specified bank.
>>
>> Note that this is totally on top of the virtual memory system which uses
>> traditional fixed size (though larger than 4Kbytes) pages.
>
> I have, in the past, read a research paper that argued that the
> optimal size of a page (at that time) was 16 KiB. As someone who has
> spent a lot of time maintaining the virtual memory system, I think
> it's ludicrous to stick with a 4K page size when the size of physical
> memory is greater than 64G.

Of course. Part of the problem is that the virtual memory system has
been "overloaded" with additional functions like protection, sharing
capability and cacheability. These are logically unrelated to pages but
using the paging system for them makes it harder to change the page
size. Of course, with memory so cheap now a days, many systems
essentially never page so the page size doesn't really matter much.

Quadibloc

unread,

Mar 10, 2016, 7:43:55 AM3/10/16

to

On Thursday, March 10, 2016 at 12:15:39 AM UTC-7, Stephen Fuld wrote:

> Part of the problem is that the virtual memory system has
> been "overloaded" with additional functions like protection, sharing
> capability and cacheability. These are logically unrelated to pages but
> using the paging system for them makes it harder to change the page
> size.

This is true.

Of course, it's also a natural development. Memory protection, and ways to
indicate shared memory, and mark areas of address space as not to be cached,
are urgently needed by computers in many applications. Adding more tools to the
architecture would lead to additional complexity, while simply adding
attributes to the pages in an existing paging mechanism minimizes that.

Some re-thinking, though, is certainly in order.

One of the most common reasons to mark an area as not cacheable is that it is used for a memory-mapped input/output device. So this is an attribute that tends to be associated with areas of _physical_ memory.

On the other hand, if an area of memory is being shared by two or more
programs, so that they can pass information between one another, obviously it's
an area of _virtual_ memory that one would want to mark this way.

Security, in the sense of binding memory to a program, and denying other
programs access to it, would normally apply to virtual memory as well.

The paging mechanism's function is to control the binding between virtual and
physical memory. So it has immediate access to all of virtual memory - and can
naturally access the _attributes_ of a page of virtual memory that is not
currently in physical RAM without pulling that page into RAM. So even if
security and sharing aren't properly a part of the paging mechanism _itself_,
it would seem there are advantages to... plugging them in, as it were, to the
paging mechanism.

But _that_ wouldn't negate your point. For purposes of security and sharing,
there could be a "logical page size" which is independent of the actual page
size used for swapping.

John Savard

Rick C. Hodgin

unread,

Mar 10, 2016, 8:20:18 AM3/10/16

to

On Wednesday, March 9, 2016 at 3:59:58 PM UTC-5, David Brown wrote:
> The one I was thinking of was "Electric Fence" - or its more
> full-featured fork, DUMA (<http://duma.sourceforge.net/>). Basically,
> whenever you allocate some memory with malloc(), this allocates a full
> virtual page and places the desired memory area at the end of that page.
> It also allocates the next page, but makes it inaccessible (no read or
> write access). So code can directly access the allocated memory without
> additional instructions or checks in software, but if you run off the
> end of the block, you hit an illegal access fault. That is probably
> about the best you can get with normal cpu hardware.

I was not aware of that sanitizer. However, it sounds more like the
work-around implementation that Robert Wessel posted above that can
be employed using today's mechanisms.

> But if you are designing new hardware, I don't see that it would be too
> difficult to add a range-limit field to the virtual address information
> in the MMU for an address or segment. The problem is that code would
> have to use "long pointers" including a different segment for each
> memory block. It may make more sense to use really long pointers in
> your hardware - perhaps have 64-bit pointers, of which the top 24 bits
> are used for the segment (leaving 40 bits for the offset, as I know you
> like that number!).

My thinking was that pointer selectors would bypass segment selectors,
containing within them their own information, which extends beyond the
4GB limit for machine addressing, while still being limited to the 4GB
limit for physical addressing per pointer.

lptr esi,myPointerSelector1
; Right now, [esi+x] has a range from [esi+0] to [esi+(2^32)-1]

I can simultaneously do this:

lptr edi,myPointerSelector2
; Right now, [edi+x] has a range from [edi+0] to [edi+(2^32)-1]

The two pointer selectors could map into overlapping, or completely
separate and isolated memory, allowing two windows into any area of
memory. To move to the next 4GB chunk, load myPointerSelector3, and
so on.

With a 40-bit machine it becomes academic. Much larger address space
such that a single pointer selector would be sufficient for all but
the most extreme needs.

I'm still on target for my 40-bit CPU. I have goals I'm focusing on
beforehand, though:

(1) Complete my RDC compiler framework.
(2) Complete my Visual FreePro, Jr.

I figure those will take me up to the middle of next year.

(3) Begin work on my kernel, of which I have revamped my goals
slightly to now include some of what OS/2 had back in the
90's (a fully object-oriented Workplace Shell, for example),
and I've renamed the products:

(4) ExoduS/2 32-bit
(5) ArmoduS/2 32-bit

https://github.com/RickCHodgin/libsf/tree/master/exodus/LibSF_ES2
https://github.com/RickCHodgin/libsf/tree/master/exodus/LibSF_AS2

In the time between (3) and (5) I will periodically work on my Logician
software so that when I get (5) completed I will be able to begin work
on my hardware design. I am hopeful that by mid-2022, I will be to the
point where I have my hardware designed, and am well into validation
through Logician so that I can begin to have real parts fabricated. I
will then turn to motherboard support with storage controllers, graphics,
sound, etc.

We'll see though. These are my plans, but all of them are subject to
the guidance our Lord gave, that His plans may not be our plans, and that
if a thing goes forward, it is because He allows it (James 4:15).

It would be nice if I could get really focused and have hardware heading
off to the fab in July, 2022, because it would then be a 10-year effort
from when I started on writing a free replacement for Visual FreePro to
when I actually have the full hardware and software stack designed and
working in simulation, ready for hardware trials.

Not a bad way to spend a decade of one's life. :-)

Rick C. Hodgin

unread,

Mar 10, 2016, 8:33:56 AM3/10/16

to

On Thursday, March 10, 2016 at 8:20:18 AM UTC-5, Rick C. Hodgin wrote:
> (3) Begin work on my kernel, of which I have revamped my goals
> slightly to now include some of what OS/2 had back in the
> 90's (a fully object-oriented Workplace Shell, for example),
> and I've renamed the products:
>
> (4) ExoduS/2 32-bit
> (5) ArmoduS/2 32-bit
>
> https://github.com/RickCHodgin/libsf/tree/master/exodus/LibSF_ES2
> https://github.com/RickCHodgin/libsf/tree/master/exodus/LibSF_AS2

Here are some of the features I'm talking about (this video is from OS/2
2.1, circa 1993, and the bulk of the features from OS/2 I'll include are
there, plus I have several of my own ideas on how to glue the system
together using a simple Blender nodes-like GUI):

https://www.youtube.com/watch?v=-DAojx2Hgec&t=40m55s

Rick C. Hodgin

unread,

Mar 10, 2016, 9:32:21 AM3/10/16

to

On Thursday, March 10, 2016 at 8:20:18 AM UTC-5, Rick C. Hodgin wrote:

> My thinking was that pointer selectors would bypass segment selectors,
> containing within them their own information, which extends beyond the
> 4GB limit for machine addressing, while still being limited to the 4GB
> limit for physical addressing per pointer.
>
> lptr esi,myPointerSelector1
> ; Right now, [esi+x] has a range from [esi+0] to [esi+(2^32)-1]
>
> I can simultaneously do this:
>
> lptr edi,myPointerSelector2
> ; Right now, [edi+x] has a range from [edi+0] to [edi+(2^32)-1]
>
> The two pointer selectors could map into overlapping, or completely
> separate and isolated memory, allowing two windows into any area of
> memory. To move to the next 4GB chunk, load myPointerSelector3, and
> so on.

To explain even further, consider that today's selectors have a particular
format. For my LibSF 386-x40 CPU (which I now call Arxoda), I have revised
that format to something a little cleaner:

https://github.com/RickCHodgin/libsf/blob/master/li386/li386-documentation/images/descriptor.png

Here we see a 40-bit base, with a limit being defined by only 28 bits which
ranges in byte block sizes, allowing from 0 to the max range of physically
addressable memory.

To create a pointer selector, it would be similar, but it might employ these
fields:

[ExtraMemAddressBits][Limit][Base][+ the standard info]

Those "extra memory address bits" are tagged on to the physical address at
the highest bits, taking the 32-bit computed pointer offset and concatenating
it with the extra bits, allowing the [esi] pointer selector to access a much
greater range of memory than the 32-bits of esi will do by itself.

In other cases, all 32-bit computations work by itself. And a pointer
selector can be unloaded by loading any other value into ESI or EDI without
using the lptr instruction:

lptr esi,myPointerSelector
; Now, esi is in pointer-selector mode

And when finished, back to normal mode:

xor esi,esi
; Now, esi is back in standard addressing mode

-----
I think this 32-bit solution is a much better solution than going to a full
64-bit machine, though I think the 40-bit movement is a better solution all
the way around because it meets us at the point of the range of feasible
workloads for the foreseeable future as each core is capable of addressing
up to 1 Terabyte of RAM per physical pins, with selectors being created
which allow access into either local or global memory, with additional bits
being added to communicate with a shared pool of memory per multi-cores,
etc., so each uses their own memory controller, their own dedicated address
space for applications, with crossover only necessary where it's necessary,
but still within a full Terabyte of crossover address space, etc.

These kinds of pointer selectors would allow for a 32-bit computer, or 40-
bit computer, to compute workloads which are only possible today on 64-bit
computers, and to do so with more typical coding requirements (as most
things don't need 64-bits, they just need more than 32-bits).

I honestly believe the 40-bit CPU is a great stopover. I don't know why
it wasn't targeted by AMD with their AMD64 design (as the AMD40). I think
also the 40-bit single float, and the 80-bit double float (both with
implict-1 mantissas), would've made nice extensions to IEEE-654, with a
few more mantissa bits keeping the exponent ranges the same as their 32-bit
and 64-bit counterparts, yielding ~2 extra base-10 significant digits in
40-bit form, and ~4 extra base-10 significant digits in 80-bit form, even
exceeding the count yielded by 80-bit IEEE today (due to it having a
larger exponent).

We'll see though. It is my present goal, however. And I think having the
four ISAs will be desirable as well:

(1) 80386 compatibility mode (flags to extend out to 40-bits, w/WEX)
(2) ARM compatibility mode (flags to extend out to 40-bits, w/WEX)
(3) Arxoda's new ISA (a many-predicate 80386/ARM hybrid, w/WEX)
(4) A programmable ISA which maps any of (1), (2), and (3) to
custom opcodes per task, including aggregated macros which are
many opcodes running in succession with a single command.

WEX: https://github.com/RickCHodgin/libsf/blob/master/li386/li386-documentation/images/wex_register_mapping.png

-----
The planned progression for Arxoda grows in developmental stages to Oppie-8,
which is targeted as the final form involving four Oppie-6 cores:

https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-6.png
https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-8.png

With Oppie-8 allowing for 4 "Love Threading" cores (Love Threading is a
concept where a core can request help from its companion core, and the
companion core will leave its workload temporarily and go and help the
other core, and then come back to its own workload afterward). It is
called love threading because this is what love does: It sacrifices of
itself to help another. Clever naming, yes? :-)

I have lots of plans. And lots of things railing against me in moving
forward. Nonetheless, I proceed forward undaunted. And, if God allows
it to be completed, it will be completed. And, of course, it would go
faster if any of you would like to help me. Your help would be much
appreciated, and would be a great service to God, and to mankind, to
use your skills to offer unto Him full fruits of your labors for these
purposes (to give people an alternative to money-based advances in
technology, and instead to look to Him and say, "How should this be
designed? What should it include were money not a factor?" and then
to look to Him (look to truth) for the answer).

Your help would be welcomed.

Ivan Godard

unread,

Mar 10, 2016, 12:25:27 PM3/10/16

to

Ivan Godard

unread,

Mar 10, 2016, 12:31:30 PM3/10/16

to

On 3/10/2016 4:43 AM, Quadibloc wrote:

> But _that_ wouldn't negate your point. For purposes of security and
> sharing, there could be a "logical page size" which is independent of
> the actual page size used for swapping.

Second try :-(

That's pretty much the reasoning we (Mill) followed, and we use a
logical page size of one byte for the meta-info so programs don't have
to contort themselves in to rigid boxes.

The tricky part was getting rid of the overhead of protection swap,
because a single logical page (a byte) might have different protection
or other meta info for different processes.

EricP

unread,

Mar 10, 2016, 3:48:47 PM3/10/16

to

That is "various page sizes" not variable size.
There have been many studies on large page, aka huge page, superpage.
There have been studies on truly variable page sizes.
There were also studies on varying the structure of page tables,
called "guarded page table". These could all be combined together.

But so far it seems to be more trouble than they are worth,
at least for dynamic allocation requests. Once a large page gets
broken into smaller pieces, it requires a lot of defragmentation
work to reassemble a large page again.

Also Alpha and MIPS used software managed TLB. While that allows
them flexibility, the cost of software TLB was between 5% and 25%
overhead compared to hardware page table walkers.
But hardware walkers want a fixed, well defined structure
(e.g. exactly how many levels are in the page table).

Where large pages are used, they usually are assigned at boot time
to map long lived structures like the non-paged portion of an OS,
or video memory, or memory mapped IO space.

One thing that might help is to replace the logical relocation
that is now done with arithmetic relocation.
Currently the MMU simple replaces some number of virtual address bits
with their physical counterparts. The consequence is that for large page
mappings, say 2MB, the physical page must have the same alignment.
This places restrictions on how free space must be managed.
In arithmetic relocation you add an offset to the virtual address
to get the physical address, and allows a large page to start
on any physical page boundary. This would allow more flexibility in
how large pages are managed, though I'm not sure if it helps enough.

Eric

mac

unread,

Mar 11, 2016, 10:40:20 PM3/11/16

to

Sounds like the intel 432.

Rick C. Hodgin

unread,

Mar 12, 2016, 12:26:47 AM3/12/16

to

On Friday, March 11, 2016 at 10:40:20 PM UTC-5, mac wrote:
> Sounds like the intel 432.

It is very similar.

https://en.wikipedia.org/wiki/Intel_iAPX_432

"Programs are not able to reference data or instructions by address;
instead they must specify a segment and an offset within the segment.
Segments are referenced by Access Descriptors (ADs), which provide an
index into the system object table and a set of rights (capabilities)
governing accesses to that segment. Segments may be "access segments",
which can only contain Access Descriptors, or "data segments" which
cannot contain ADs. The hardware and microcode rigidly enforce the
distinction between data and access segments, and will not allow
software to treat data as access descriptors, or vice versa."

It also used bit-aligned instructions, rather than byte-aligned, something
I've had in my CPU design since a product I designed in 1993 called the
Craetis CPU. I still have the original 14 7/8" greenbar paper where I
printed out the specs. It was an 80-bit CPU that did not have conventional
registers, but each instruction encoded a bit offset into a common scratch
pad which retrieved varying length quantities from 8-bits to 80-bits, and
computed, and stored to another offset (which could also be the same).

It was my first ever attempt to create a CPU design, and I wrote it over
two days during Thanksgiving in 1993.

Rick C. Hodgin

unread,

Mar 12, 2016, 10:35:21 PM3/12/16

to

mac wrote:
> Sounds like the intel 432.

An asynchronous iAPX 432 (with some notable tweaking) is very much
what I'm targeting with my LibSF 386-x40 / Arxoda CPU. I am looking
forward to completing the design in simulation in Logician in the years
to come (Lord willing).

John Dallman

unread,

Mar 15, 2016, 2:16:50 PM3/15/16

to

In article <52a9918e-7101-46db...@googlegroups.com>,

rick.c...@gmail.com (Rick C. Hodgin) wrote:

> By doing this, protection to data within the pointer can be ensured,

This looks a bit like Intel's new MPX feature: see how that compares?

https://en.wikipedia.org/wiki/Intel_MPX

John

Rick C. Hodgin

unread,

Mar 15, 2016, 3:01:13 PM3/15/16

to

Similar. Mine allows a window through the pointer's selector to its
own 4GB segment (beginning with a base offset of 0), allowing for all
references through ESI or EDI when loaded as a pointer selector. When
used normally, it acts as a standard register.

It's a way for a single 32-bit app to gain access to far more than 4GB
of memory by accessing data through its selectors. Each of them is setup
by the running app, which calls an OS function call as needed to establish
and enable the hardware feature.

I haven't seen anything like it on modern architectures, but I'm sure
it exists. There are some switching features I've seen where areas of
memory are shadowed out to another bank of memory, but this is different.
The pointer selector simply creates an extended 36-bit or 40-bit address
and accesses the 4GB inner bits relative to that extra-bit base when it
is enabled. Otherwise, it's just a regular index register.

It could also work with 64-bits, but I don't see as much need there.

Rick C. Hodgin

unread,

Mar 16, 2016, 11:08:16 AM3/16/16

to

I've been thinking that since I considered the possibility of pointer
selectors, that the need for a full 40-bit machine no longer exists.
A 32-bit offset within a 40-bit address would be acceptable for all
but the most extreme tasks, and in the case of those extreme tasks it
would not be that much overhead to manage the larger memory block in
4GB segments in single processes, or allow inter-process communication
with multiple separate processes each working on their own 4GB chunk.

As such, the LibSF 386-x40 may take on a new meaning, with 40-bit
physical addressing, and 32-bit computation in integer registers,
keeping up to 80-bit in floating point and 64-bit in vector processing.

Rick C. Hodgin

unread,

Mar 16, 2016, 1:44:50 PM3/16/16

to

I will post a link to this that doesn't wrap when I get the chance. But,
I have had some thinking on this during my lunch. It resulted in this
consideration:

If you look at Intel's 32-Bit Task-State Segment (TSS), you'll see
a structure like this:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
[..............I/O.Map.Base.Address............][..................Reserved.................][T] 100
[................Reserved......................][..............LDT.Segment.Selector............] 96
[................Reserved......................][.....................GS.......................] 92
[................Reserved......................][.....................FS.......................] 88
[................Reserved......................][.....................DS.......................] 84
[................Reserved......................][.....................SS.......................] 80
[................Reserved......................][.....................CS.......................] 76
[................Reserved......................][.....................ES.......................] 72
[.............................................EDI..............................................] 68
[.............................................ESI..............................................] 64
[.............................................EBP..............................................] 60
[.............................................ESP..............................................] 56
[.............................................EBX..............................................] 52
[.............................................EDX..............................................] 48
[.............................................ECX..............................................] 44
[.............................................EAX..............................................] 40
[............................................EFLAGS............................................] 36
[.............................................EIP..............................................] 32
[.............................................CR3..............................................] 28
[................Reserved......................][.....................SS2......................] 24
[.............................................ESP2.............................................] 20
[................Reserved......................][.....................SS1......................] 16
[.............................................ESP1.............................................] 12
[................Reserved......................][.....................SS0......................] 8
[.............................................ESP0.............................................] 4
[................Reserved......................][...............Previous.Task.Link.............] 0

It occurred to me that there are unused bits in the TSS which could be used to hold the pointer
selector information. By using FS:ESI and GS:EDI, the full range of 40-bit memory could be
addressable with a 5-byte pointer.

This would require that the FS and GS slots be revamped:

31 30 29 28 27 26 25 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 09 08 07 06 05 04 03 02 01 00
[b][d][s][d][c][b][a][ 8 bits ][P][.....................GS.......................] 92
[b][d][s][d][c][b][a][ 8 bits ][P][.....................FS.......................] 88

The bits in positions 31..25 would occupy flags indicating which registers are associated with
which segment selector when used as a pointer selector. This would allow any of the general
purpose registers to be used in this way, and not just esi and edi. They are, from left to
right, b=ebp, d=edi, s=esi, d=edx, c=ecx, b=ebx, a=eax.

The 8 bits block indicates the address to add to the pointer selector, allowing FS and GS to
be used as part of the address in a new instruction:

lptr fs:esi,my_5_byte_pointer ; Set the flag and load the new 8-bit offset
lptr fs:esi,my_4_byte_pointer ; Just set the flags to indicate this register is
; associated with the FS register, but leave whatever's
; already in FS as it is

And the [P] bit indicates if that selector is currently being used in a pointer operation.
It remains set as long as any of the 31..25 bits are set, or when a non-lptr instruction
is used which changes the value of FS or GS.

In this way, a 32-bit computer can gain access to the entire 40-bit range of memory in
32-bit chunks using the type of segment:offset solution used in the 16-bit form. It's
clunky, but it allows far more memory without having a full 64-bit computer, which is
really not necessary in any but the most extreme applications, and is actually wasteful
at times in terms of raw system use (though there are gains to be had through it REX
prefix and additional registers that are available, which LibSF 386-x40 bypasses using
the WEX prefix shown here, though they would now once again be only 32-bit registers):

https://github.com/RickCHodgin/libsf/blob/master/li386/li386-documentation/images/wex_register_mapping.png

-----
In the alternative, creating an EFLAGS2 register would allow it to be used
with appropriate PUSH/POP instructions for saving the state as needed, while
occupying the same 32-bits as the above proposal, just in a new EFLAGS2
register rather than within the TSS itself.

I'll give this setup some thought.

Morten Reistad

unread,

Apr 12, 2016, 8:22:40 AM4/12/16

to

In article <DOqdnfolQsOU7H3L...@earthlink.com>,

Look at the Prime 50-series too.

This was a machine built for a bowldlerised Multics-clone.

The memory was 28 bits of 16-bit words, organised in 16 bits of
intra-segment and 12 bits of segments organised in 4 banks.
Bank 0 was for the kernel, bank 1 for system-wide shared stuff,
bank 2 for code etc and bank 3 for stack/heap.

Pointers were of several types. The most used ones had a type
associated with it, which could take one on 16(?) values.
Like float, double, int(32bit), char(*), word(16bit). Each
segment also had a size register, although this was rarely
used, you could limit the size to any 16-bit length, and accesses
outside this would fault.

In reality not much different from guard pages when used with
code that would handle "long" pointer arithmetic (i.e. doing
inter-segment adjustments). PLP/SPL (the two mini-pl1 compilers)
could limit such arithmetic in declarations, I never used this
in CC, CI, FTN or F77. (the two sets of c and fortran compilers).

Look for Primos monitor calls manual and the 50-series architecture
guide for documentation in detail.

The types were validated on function/procedure/call gates,
and faults generated for incompatible stuff.

Leading to calls like

external fortran thou (char [], short int);
void putstring(char * x) {
tnou ((char []) x, (short int) strlen(x));
}

where you needed to explicitly tag the arguments to the kernel
call gate with the correct types when calling from a half-alien
language like c.

It also trapped on unaligned pointers.

Designed in 1975, and was quite successful until ca 1990.

*the char pointers had another bit for selecting the high/low
byte

>
>Shrinking descriptors and using pointer registers was trivial but
>the climb instruction was expensive. If memory serves PL-6 (the
>CP-6 PL/1 derivitive) would only use the climb instruction for
>OS calls and shared library calls.
>
>Something like:
>char x[1];
>read(fd,x,5000);
>
>If the available data count > 1 the read would mem fault and never
>write out side the bounds of "x". (I have no idea today what that
>could would look like in PL-6 nor do I have any desire to find out).
>
>The equivalent of a malloc() in CP-6 would return a descriptor so
>even something like:
>char *x=malloc(1);
>read(fd,x,5000);
>
>would still fail.
>
>I also think there was a base/bound for the stack pointer register. You
>did not need guard pages to protect from stack overflow.

You could get there in Primos too, but it would use a lot of segments
and would only work in SPL/PLP.

-- mrr

Rick C. Hodgin

unread,

Nov 25, 2016, 4:39:18 PM11/25/16

to

I have published some updated information on GitHub about the LibSF
386-x40, now called Arxoda:

https://github.com/RickCHodgin/libsf/blob/master/li386/li386-documentation/images/wex_register_mapping.png

General visualization of some of its design aspects:

WEX: https://github.com/RickCHodgin/libsf/blob/master/li386/li386-documentation/images

The general planned execution engine in its intended form. The Oppie-1
through Oppie-5 designs are precursors. I consider Oppie-6 to be the
true base of the architecture:

Love Threading:
https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-6-love-threading.png

Oppie-6: Single-core, dual-thread
https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-6.png

Oppie-7: Dual-core, quad-thread
https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-7.png

Oppie-8: Quad-core, oct-thread
https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-6.png

I plan to soon take the primary repository off GitHub and move it to my
own server. The content will remain public, but it will no longer be at
these URLs. If at some point you cannot access this information but
would like to see it, please contact me by email for an updated link.