Bounded pointers

Ivan Godard

unread,

Oct 15, 2013, 9:09:23 AM10/15/13

to

A while back there were several threads here about bounds checking of
access to memory via pointers. The proposed solutions required, if not
full capabilities, at least a double-width data structure comprising an
address and a length. At the time, I said we had an improvement on that
and would describe it when the relevant filing had been submitted.

That filing is now in, so here's the description. The basic idea is to
represent the address in floating point, and to flag any
address-manipulation operation that alters the exponent. The method is
not as general as a full descriptor and has several drawbacks: checked
regions must be of power-of-two sizes, and erroneous internal
misaddressing (such as mis-alignment of an indexed element) are not
caught. On the other hand, checked pointers are of the same size as
unchecked pointers, and many off-by-one or wild address errors will be
caught by the hardware.

We have (so far) not included this idea in the Mill definition; we are
not sure that the game is worth the candle. It costs six bits out of the
64-bit address space, and as a SAS machine we are careful of address
space. In addition, it will break some language-invalid but functionally
correct code, such as code that moves a pointer off a target array but
then moves it back before dereferencing.

So I solicit your opinion: is this idea of sufficient utility to merit
inclusion?

Here's the description from the provisional:
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1.4 Bounded pointers

Languages such as C that permit arbitrary creation and manipulation of
pointers are notorious for wild-address bugs. A pointer can be easily be
made to address a location that does not contain program data, or
contains different data than the logic of the program expects. When the
incorrect pointer is dereferenced to access the pointed-to location the
result may be incorrect program execution or even system crash.

Hardware paging and protection may detect some of these errors, but the
granularity of detection is very large and many bugs are uncaught.
Certain programming languages, such as Java, are immune to pointer wild
addresses; all changes to a pointer value are explicitly checked for
consistency with the program model and will raise an exception if the
check fails. Such checks can also be applied manually in languages like
C. However, the checks have cost in execution time and program and data
space, and are eschewed by many programmers.

The general form of a pointer check presumes that the target of a
pointer is a single program object, frequently an array, and that a
valid pointer will always refer to a location within the address range
occupied by the object. Hence, checking commonly requires retaining the
least and greatest addresses of the occupied range, and verifying that
subsequent pointer manipulation does not cause the pointer to exceed
those bounds. This system is simple and general, but triples the data
required for a checked pointer over that for an unchecked pointer, and
the check itself can add several instructions at each point the pointer
is used. Also note that the range check does not catch errors due to
misaligned pointers within the range.

The present invention provides an economical alternative to the general
range check of pointers. It is not completely general, because the
checked region is required to be a naturally-aligned space of
power-of-two size. However, in practice allocators for power-of-two
spaces are simple and cheap, and the waste space arising from rounding
allocations up to power-of-two boundaries can be reused for other
purposes. The only significant drawback to the present invention, when
compared to the completely general way of checking, is that erroneous
pointer movements within the smallest power-of-two range encompassing
the object are not caught, whereas they would be by the general
solution. Hence the present invention may be considered to be an
economical aid to debugging rather than a panacea.

The present invention alters the representation used to represent an
address within a pointer. Without loss of generality, this discussion
will assume 64-bit addresses. The pointer is given an added field large
enough to hold a bit number, a six-bit field in this case. The remainder
of the bits of the pointer, 58 in this case, contains the actual address
location. This reduces the size of the potential address space from 2^64
to 2^58, which is unimportant in practice on modern CPUs.

The bit number in the bit number field is interpreted as splitting the
address field into two parts, a chunk address and an offset within the
chunk. Thus if the bit number is 3, then the least significant three
bits of the address comprise the offset and the most significant 55 bits
of the address field comprise the chunk. The pointer in this example
thus indicates a location within an eight byte chunk, beginning at the
offset from the start of the chunk; the actual address is the
concatenation of the chunk and offset fields, interpreted as a byte
address. Necessarily the chunk must have power-of-two size and natural
alignment, as a consequence of the representation used.

In use, any modification of the pointer that changes only the offset is
deemed safe and correct, while any operation (other than certain special
operations as described later) that modifies the bit number or chunk
fields is deemed unsafe and incorrect. Thus pointer manipulation within
a chunk is permitted, while manipulation across chunks is not.
This check must be applied at every use of the pointer. While a program
using this pointer representation could perform the checks by added
instructions, it is more economical and less prone to error if the
checks are incorporated into the pointer-manipulating operations in
hardware.

A typical pointer manipulation operation is to increment a pointerï¿½s
address to refer to a successor element in an array. The increment
comprises adding a small constant equal to the size of the array element
to the bits of the pointer. On many architectures, pointer
incrementation is performed by the same hardware operation that does
integer arithmetic on pointer-sized numbers. In the context of the
present invention, the need for the extra check mandates that the
pointer-increment operation must be distinct from integer addition, and
the programmer (or the compiler) must distinguish the cases and use the
operation appropriate to the case at hand.

In operation, the pointer-increment operation adds the increment to the
pointer and checks whether the bit number or chunk fields changed. If
not, the increment was solely within the offset and is permitted.
However if the bit number or chunk fields did change then a program
error has been detected and will be reported using whatever facilities
the rest of the architecture provides for reporting of hardware-detected
inconsistencies.

The mechanism to support pointer increment used in this example is also
used for pointer decrement operations and for computing effective
addresses in load, store, load-effective-address, and other operations
in which an address is derived from but not identical to a pointer.
However, all these operations presume an initial pointer whose target
address is being modified. The CPU architecture using the present
invention must provide means by which a pointer may be constructed.
Typical means include conversion between integer and pointer, and
construction from a literal value contained in the program; both these
operations are likely to be implicit (requiring no code) except on
architectures in which data is identified by type at run time.

ï¿½Refineï¿½ is operation that can be profitably added to the instruction
set of a CPU using the present invention. The refine operation
decrements the bit number field of a pointer, checking for underflow.
The effect of this is to leave the pointer referring to the same byte
address (the concatenated chunk and offset bits are unchanged, although
the chunk field is larger and the offset smaller), but the size of the
encompassing chunk is reduced for checking purposes. The inverse
operation which increments the bit number field is inherently risky (the
range of memory within which the pointer is deemed to be valid is
increased), but may be of use in allocators.

If the target datum is not itself of a natural power-of-two size then
the chunk used in pointers referring to it must be larger than the
object itself; the excess may lie in front of the object, behind it, or
both. Moving the pointer from the object into these excess regions will
be an undetected error in the present invention. It is most common for
programs to step through arrays in increasing-address order, so placing
the excess in front of the object (with no excess after it) is more
likely to catch array overruns than other arrangements would.

Operations that use pointers but do not modify them must be aware of but
ignore the bit number field. Thus relational operations, which detect
whether a pointer refers to the same location as another pointer, or to
a greater or lesser address, must compare the concatenated chunk and
address regardless of the bit number. Similarly, the delta operation,
which determines the byte distance from one address to another, must
subtract only the concatenated chunk and offset, and ignore the bit
number fields.

Any program manipulation that leaves a pointer referring within the same
object as it did before the manipulation is by definition valid.
However, the C++ language also defines that a pointer may validly refer
to one element beyond an array, outside of the addressed object, but
only by one element size, and the result must not be dereferenced. If
the object is of less than power-of-two size this language requirement
can be accommodated by having the allocator place the object one element
from the end of the chunk. Increment of the pointer to the last-plus-one
position will then still be within the chunk and no error will be reported.

However, this stratagem is unavailable if the object is of exactly
power-of-two size, because leaving space for the after-the-end position
would wastefully force a chunk of double the natural size. Instead, the
present invention provides for this C++ usage by defining a single
ï¿½validï¿½ bit in the pointer representation; valid pointers may be
dereferenced normally, while attempts to dereference an invalid point
result in an exception. Program actions to create a pointer (by
conversion of a literal or an integer datum) produce a valid pointer. A
pointer increment of a valid pointer that causes the chunk field to
increment by one and the offset to become zero marks the pointer as
invalid without raising an exception.

Operations on invalid pointers are restricted. Invalid pointers may be
compared with valid or invalid pointers by relational operation without
error, and the distance between an invalid pointer and another pointer
may be determined, also without error. However address arithmetic (such
as increment, decrement, or computing an effective address) on an
invalid pointer must produce either the original invalid pointer
unchanged, or a pointer into the chunk immediately previous to the one
indicated by the invalid pointer, in which case the produced pointer is
marked valid. These rules permit incrementing a pointer to the
one-beyond-the-end position and then restoring it to refer within the
original object, while preventing access to the past-the-end-position or
to any wild address further on.

It will be evident that CPU architectures using the present invention
will be able to detect many, but not all, program errors involving
pointers. The data requirement is the same as for unchecked pointers and
no extra code is required for the added safety. Latency of pointer
manipulating operations is not increased because the checking is simple
and may be performed in parallel with the manipulation.

hitc...@gmail.com

unread,

Oct 15, 2013, 3:35:36 PM10/15/13

to

This is essentially hardware-supported Baggy Bounds Checking (Akritidis 2009 - http://research.microsoft.com/pubs/101450/baggy-usenix2009.pdf). See especially section 5.

It's a good low-overhead way to prevent many buffer overruns. Implementing it as completely zero-overhead could enhance usage at additional hardware complication, but most of the benefits could be gained by making the code sequence in Figure 18 as efficient as possible. Maybe some sort of "branch if upper N bits nonequal"?

In combination with compiler passes that hoist redundant bounds checks out of inner loops, overhead would be very low with a few accelerator instructions. Even on x86-64, the overhead on this kind of bounds checking is quite low, <10%.

Andy (Super) Glew

unread,

Oct 15, 2013, 10:16:45 PM10/15/13

to

On 10/15/2013 6:09 AM, Ivan Godard wrote:
> A while back there were several threads here about bounds checking of
> access to memory via pointers. The proposed solutions required, if not
> full capabilities, at least a double-width data structure comprising an
> address and a length. At the time, I said we had an improvement on that
> and would describe it when the relevant filing had been submitted.
>
> That filing is now in, so here's the description. The basic idea is to
> represent the address in floating point, and to flag any
> address-manipulation operation that alters the exponent. The method is
> not as general as a full descriptor and has several drawbacks: checked
> regions must be of power-of-two sizes, and erroneous internal
> misaddressing (such as mis-alignment of an indexed element) are not
> caught. On the other hand, checked pointers are of the same size as
> unchecked pointers, and many off-by-one or wild address errors will be
> caught by the hardware.

Similar ideas have been widely discussed. I.e. much prior art.

I am pretty sure that comp.arch has discussed, years ago, using the
"extra bit" trick to describe power-of-2 sized and aligned regions.

When I worked on this at Intel we considered and rejected this approach.
Insufficient benefit.

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

Terje Mathisen

unread,

Oct 16, 2013, 2:13:56 AM10/16/13

to

Andy (Super) Glew wrote:
> Similar ideas have been widely discussed. I.e. much prior art.
>
> I am pretty sure that comp.arch has discussed, years ago, using the
> "extra bit" trick to describe power-of-2 sized and aligned regions.

_You_ did, less than 2 years ago. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Ivan Godard

unread,

Oct 16, 2013, 3:58:06 PM10/16/13

to

On 10/15/2013 7:16 PM, Andy (Super) Glew wrote:
> On 10/15/2013 6:09 AM, Ivan Godard wrote:
>> A while back there were several threads here about bounds checking of
>> access to memory via pointers. The proposed solutions required, if not
>> full capabilities, at least a double-width data structure comprising an
>> address and a length. At the time, I said we had an improvement on that
>> and would describe it when the relevant filing had been submitted.
>>
>> That filing is now in, so here's the description. The basic idea is to
>> represent the address in floating point, and to flag any
>> address-manipulation operation that alters the exponent. The method is
>> not as general as a full descriptor and has several drawbacks: checked
>> regions must be of power-of-two sizes, and erroneous internal
>> misaddressing (such as mis-alignment of an indexed element) are not
>> caught. On the other hand, checked pointers are of the same size as
>> unchecked pointers, and many off-by-one or wild address errors will be
>> caught by the hardware.
>
> Similar ideas have been widely discussed. I.e. much prior art.
>
> I am pretty sure that comp.arch has discussed, years ago, using the
> "extra bit" trick to describe power-of-2 sized and aligned regions.
>
> When I worked on this at Intel we considered and rejected this approach.
> Insufficient benefit.

Do you have a link? All I recall on your site was using an extra-bit as
tagging to trigger checks against out-of-line metadata.

About "insufficient benefit": in your opinion would the same conclusion
be true for a new instructions set that does not have to deal with
legacy asm in an ISA lacking the feature?

Ivan Godard

unread,

Oct 16, 2013, 4:15:29 PM10/16/13

to

Again! Happens all the time - come up with a great idea only to discover
it is already well know to everybody - but you

The proposal is different from Baggies in that, while both use
power-of-two to achieve single-width pointer checking, the
implementation is for a hardware instruction set and not a software
facility on top of x86. The particular implementation may be novel
enough to survive the PTO, although Baggies will certainly need to be
cited as prior art.

p.s. Thank you.

Andy (Super) Glew

unread,

Oct 17, 2013, 12:49:57 AM10/17/13

to

I thought the Modriaan hardware bounds checking proposal used the
power-of-2 scheme, but re-checking their papers seemed to say not.
Still, I have vague memories of some other MIT proposal.

So, although this certainly was something we looked at at Intel, and
something I have thought about since the 1980s, if it did not get posted
on comp.arch, it may not be otherwise public.

I would be surprised if I have not posted about it somewhere, since once
you know the extra bit trick, it's obvious. But I am too lazy to go
searching - when I retire somebody can hire me to find the prior art to
break your patent. Or not. Hope I can paid for time spent, not success
or not.

By the way, although the extra bit trick I consider obvious, you have
got me thinking about a more completely FP-like implementation, with a
true mantissa and exponent. Possibly even with the hidden bit. (OK,
fractional bits seem silly...)

I think the extra bit trick suffices to specify all 2^N sized regions,
with N bit address, with just N+1 bits.

But a representation with a K bit exponent and and M bit mantissa allows
you to specify many different object sizes in an address space that may
be much larger than 2^(K+M). We might not want to use such a
represenation for the actual values in registers, but we might want to
use such a representation for addressing constants, e.g. embedded in
literals. Perhaps constant pool compaction. Trouble is, how choose K
and M?

Andy (Super) Glew

unread,

Oct 17, 2013, 1:04:43 AM10/17/13

to

On 10/16/2013 12:58 PM, Ivan Godard wrote:
> On 10/15/2013 7:16 PM, Andy (Super) Glew wrote:
>> On 10/15/2013 6:09 AM, Ivan Godard wrote:
>>> A while back there were several threads here about bounds checking of
>>> access to memory via pointers. The proposed solutions required, if not
>>> full capabilities, at least a double-width data structure comprising an
>>> address and a length. At the time, I said we had an improvement on that
>>> and would describe it when the relevant filing had been submitted.
>>>
>>> That filing is now in, so here's the description. The basic idea is to
>>> represent the address in floating point, and to flag any
>>> address-manipulation operation that alters the exponent. The method is
>>> not as general as a full descriptor and has several drawbacks: checked
>>> regions must be of power-of-two sizes, and erroneous internal
>>> misaddressing (such as mis-alignment of an indexed element) are not
>>> caught. On the other hand, checked pointers are of the same size as
>>> unchecked pointers, and many off-by-one or wild address errors will be
>>> caught by the hardware.
>>
>> Similar ideas have been widely discussed. I.e. much prior art.
>>
>> I am pretty sure that comp.arch has discussed, years ago, using the
>> "extra bit" trick to describe power-of-2 sized and aligned regions.
>>
>> When I worked on this at Intel we considered and rejected this approach.
>> Insufficient benefit.
>
> Do you have a link? All I recall on your site was using an extra-bit as
> tagging to trigger checks against out-of-line metadata.

I have been posting to comp.arch much longer than I have been working on
the wiki.

But, no, I don't have a link - although Terje seems to remember that the
comp.arch discussion happened recently, in the last two years or so. (I
have lousy luck searching comp.arch.)

>
> About "insufficient benefit": in your opinion would the same conclusion
> be true for a new instructions set that does not have to deal with
> legacy asm in an ISA lacking the feature?

I don't think that it is so much the assembly as the data structures:

"It is easier to change instruction sets than it is to change the size
and layout of standard data types." E.g. sizeof(pointer). E.g. offsets
in structs.

And, in particular, look at the security benchmarks on, what is it, the
NIST site. (This is a collection of C programs with various bugs typical
of security bugs, like buffer overflows.) Given a standard compiler,
like gcc/x86, with the ABI standard data structure addresses. Count how
many of them you can detect with 2^N sized and aligned regions. It is
certainly not all, and I don't think it is even most - the strings are
not all 2^N sized.

(You can do the same thing with various real-world buffer overflows
rather than the NIST microbenchmarks. More work dredging through open
source project bug reports. Perhaps some academic has collected them
into a better benchmark suite. (By the way, if you don't find several
bugs in the SPEC benchmarks when you have the bounds checking turned on,
you aren't trying hard enough.)

Certainly, 2^N checking is better than nothing.

It is sufficient if you can force everyone to change data structure layout.

It will certainly catch some, but not all, buffer overflows in real
world code.

I think it is the usual difference between you and me, Ivan:

At Intel I had to work in a highly constrained environment:
* I could not assume that everyone would recompile
* I could not assume that the compiler could change data layout

Plus, software bounds checking tools are pretty good. Hardware support
can only be justified if significantly faster and/or better coverage.
(And, IMHO, in the long run can only be justified if they can show a
path to higher levels of security, not just advisory bounds checking.)

Andy (Super) Glew

unread,

Oct 17, 2013, 1:08:55 AM10/17/13

to

On 10/16/2013 12:58 PM, Ivan Godard wrote:

> About "insufficient benefit": in your opinion would the same conclusion
> be true for a new instructions set that does not have to deal with
> legacy asm in an ISA lacking the feature?

Q: will the mill adhere to any established standards for data structure
layout?

I know, you don't have to.

But doing so increases the chance of people willing to try out a new ISA.

Like I said earlier, I think it is easier to change instruction set
completely than it is to change data structure layout, for a large body
of code that, e.g. saves data to disk, memory maps such files, and
exchanges such files between different ISAs.

Ivan Godard

unread,

Oct 17, 2013, 1:52:59 AM10/17/13

to

On 10/16/2013 10:08 PM, Andy (Super) Glew wrote:
> On 10/16/2013 12:58 PM, Ivan Godard wrote:
>
>> About "insufficient benefit": in your opinion would the same conclusion
>> be true for a new instructions set that does not have to deal with
>> legacy asm in an ISA lacking the feature?
>
> Q: will the mill adhere to any established standards for data structure
> layout?

I'm not sure where this question comes from. We follow language-standard
layouts throughout. In the places where the language leave it up to the
implementation:
1) we support quad
2) we support both binary and decimal FP, both conforming to IEEE754
3) we support only 64-bit pointers
4) we support bit fields that cross boundaries
5) we support both big- and little-endian memory organization, although
the core internally is little-endian

The floating-point pointers described in the filing are plain 64-bits
each, not 64+1 and not 128 (nor larger). Why do you think that the Mill
has a non-standard memory layout?

> I know, you don't have to.
>
> But doing so increases the chance of people willing to try out a new ISA.

Yes, we know, which is why we haven't changed memory layouts.

> Like I said earlier, I think it is easier to change instruction set
> completely than it is to change data structure layout, for a large body
> of code that, e.g. saves data to disk, memory maps such files, and
> exchanges such files between different ISAs.

All date that is possible to port will port in and out of a Mill as
easily as between any other architecture pair. So data is inherently not
portable: pointers and file descriptors won't port to a different
process, never mind a different ISA. But everything else goes.

Why did you think otherwise? Perhaps you should reread the filing?

Ivan

Message has been deleted

Ryan Hitchman

unread,

Oct 17, 2013, 2:39:01 PM10/17/13

to

On Wednesday, October 16, 2013 10:52:59 PM UTC-7, Ivan Godard wrote:
> 4) we support bit fields that cross boundaries

Are unaligned loads that involve two cachelines (cacheline splits) penalized?

Ivan Godard

unread,

Oct 17, 2013, 3:12:01 PM10/17/13

to

On 10/17/2013 11:34 AM, kek...@gmail.com wrote:
> On Wednesday, October 16, 2013 10:52:59 PM UTC-7, Ivan Godard wrote:

>> 4) we support bit fields that cross boundaries
>

> Are unaligned loads that involve two cachelines (cacheline splits) penalized?
>

Member dependent.

Individual members configure the number of physical pathways through the
hierarchy. While the two cache accesses involved in a cross-line load
are an atomic pair (nothing can come between them), they are not
necessarily concurrent, so if there are not two available pathways then
there will be a one-cycle penalty. However, due to the use of deferred
loads in the Mill (as described in yesterday's Stanford talk and the
video that will be up soon), Mill code generally does not see cache
latency anyway, so the extra cycle will often be invisible even on a
low-end member with only a single pathway.

Öö Tiib

unread,

Oct 17, 2013, 5:34:43 PM10/17/13

to

On Tuesday, 15 October 2013 16:09:23 UTC+3, Ivan Godard wrote:
> A while back there were several threads here about bounds checking of
> access to memory via pointers. The proposed solutions required, if not
> full capabilities, at least a double-width data structure comprising an
> address and a length. At the time, I said we had an improvement on that
> and would describe it when the relevant filing had been submitted.

Your whole Mill seems interesting.

If I understand correctly you want to split address bits in pointer to
mutable and immutable parts (for some "operations"). The proportions are
indicated by part of pointer that does not encode address (you call it
"bit number").

A suggestion (that I did not see addressed, sorry if it actually is).
Make sure that there is a state where whole pointer is immutable for
those "operations". Like if "bit number" is 0 then the pointer is
immutable. Make it easy for program to set it. Rationale: Sometimes
programmers accidentally navigate with pointers that are not meant
for that (pointer to some member, not array element).

Then that "invalid pointer". Do you use 7-th bit for "invalidness"?
Sorry, it may be obvious, I perhaps need to make some calculations
myself. Anyway I see several ways how 'std::vector' in C++ (most
popular "array" there) may benefit from such approach.

About incrementing and decrementing pointers. It left me dim a bit.
Usually it does not happen by one byte (or word or bit, whatever the
granularity of "address" in that architecture) but by length of element
of array or by offset of member object of a struct. So that "immediately
previous to the one indicated by the invalid pointer" is likely either
ambiguous description of the idea or somewhat useless location.

Typical memory managers provide memory in certain chunks. I don't know
any that give less than 16 bytes, it just is not rational in current
world of cheap memory. That means there seem to be even more ways
to fine-tune things. It benefits both architecture and operating
system if those assume sane attitude from each other. Also, what you
do with unused "bit number" values (59, 60, 61, 62, 63)?

Feel free to use these thoughts for benefit of your Mill ... whatever
is it academic project or real thing ... I like it. :)

Andy (Super) Glew

unread,

Oct 18, 2013, 12:41:09 AM10/18/13

to

On 10/16/2013 10:52 PM, Ivan Godard wrote:
> On 10/16/2013 10:08 PM, Andy (Super) Glew wrote:
>> On 10/16/2013 12:58 PM, Ivan Godard wrote:
>>
>>> About "insufficient benefit": in your opinion would the same conclusion
>>> be true for a new instructions set that does not have to deal with
>>> legacy asm in an ISA lacking the feature?
>>
>> Q: will the mill adhere to any established standards for data structure
>> layout?
>
> I'm not sure where this question comes from. We follow language-standard
> layouts throughout. In the places where the language leave it up to the
> implementation:
> 1) we support quad
> 2) we support both binary and decimal FP, both conforming to IEEE754
> 3) we support only 64-bit pointers
> 4) we support bit fields that cross boundaries
> 5) we support both big- and little-endian memory organization,
> although the core internally is little-endian
>
> The floating-point pointers described in the filing are plain 64-bits
> each, not 64+1 and not 128 (nor larger). Why do you think that the Mill
> has a non-standard memory layout?

Our usual terminology mismatch. What you list above are data element
SIZES, not data layout (or, only the most basic part of data layout).

As far as I know, no language standard (except Ada) defines data layout.
C/C++ slight exception: they say that structure elements must be in
increasing address, but otherwise you can add padding.

What I mean by data layout:

When you say

struct S {
uint8_t_t a;
uint64_t b;
uint32_t c;
}

what are sizeof(S),
the alignment of S (aligned on an address that is a multiple of N)
and the offsets of a, b?

packed 13/1/0,1,9
typical x86 16/4/0,4,12
- many x86 systems only align uint64_t on a 32 bit boundary
fully aligned and padded to ensure alignment 24/20/0,8,16
fully aligned and padded to cache line bank 32/20/0,8,16

etc.

Ada does a good job of allowing programmer specification of data layout,
and if the programmer doesn't specify, I think allows a lot of freedom
to the compiler.

Many languages make "packed" moderately well defined, and give compiler
freedom otherwise.

C/C++ only say that the offsets are monotonically increasing, but not
much else.

ABIs often define data layout in more detail.

I think Java allows much rearrangement of fields.

Reason I ask: when you have 2^N sizes, it is *nice* to sort fields from
largest to smallest, to reduce fragmentation:

struct S {
uint64_t b;
uint32_t c;
uint8_t_t a;
}

But you aren't allowed to do that in C/C++.

Still, naturally aligning all fields to 2^ceil(log2(sizeof(field)))
will make 2^N bounds more secure. In fact, I think as secure as
arbitrary base,bounds.

But you aren't allowed to do that in many ABIs.

And it wastes a lot of memory.

see, for example
"An empirical investigation of OR indexing." ACM SIGMETRICS Performance
Evaluation Review 17.2 (1990): 41-49.

Now, I say again, 2^N bounds are better than nothing, even if you must
use a legacy data layout.

They will catch many more bugs if you can use a custom data layout.

So I was just asking if you were considering a custom data layout.

I was guessing that you were willing to use a more custom data layout
than I would be.

> All date that is possible to port will port in and out of a Mill as
> easily as between any other architecture pair. So data is inherently not
> portable: pointers and file descriptors won't port to a different
> process, never mind a different ISA. But everything else goes.

Umm, I work with HSA, the Heterogenous Systems Architecture.

This decidedly does try to make pointers port to different ISAs.

> Why did you think otherwise? Perhaps you should reread the filing?

I haven't read the filing at all, Ivan.

Working engineers at most companies, in my experience, are under
direction from corporate legal not to go and read patents.

Ivan Godard

unread,

Oct 18, 2013, 1:31:27 AM10/18/13

to

You have me mightily confused. The Mill is *hardware*. Hardware does not
define data organization above the data elements that are operands of
*hardware* operations. Thus in C terms it does not define structs,
unions, or arrays (although it might define "vectors", which are not the
same as arrays). Anything not eaten by a functional unit is the province
of languages, compilers, runtime systems and OSs, not of a hardware spec.

The Mill *does* define (for example) the "layout" of quad decimal FP
(and defines it to be as specified in IEEE754R). It does not define the
layout of structs, because the Mill has no operations that take structs
as input. The compilers can do whatever they want.

The proposal, had you read it, explains that it is possible to check
pointer arithmetic 1) using 64 bit pointers with no metadata; 2) within
the bounds of a power-of-two granule; 3) but not catching misaligment
within a granule; 4) nor off-object-but-within-granule when the object
is smaller than its granule.

As all of memory is a power-of-two, the checking is satisfied if all
allocators use the maximal granule throughout; there will be no
unexpected exceptions, though of course there will be no checking either.

>> All date that is possible to port will port in and out of a Mill as
>> easily as between any other architecture pair. So data is inherently not
>> portable: pointers and file descriptors won't port to a different
>> process, never mind a different ISA. But everything else goes.
>
> Umm, I work with HSA, the Heterogenous Systems Architecture.
>
> This decidedly does try to make pointers port to different ISAs.

Not hardware value of pointer type you don't. You may reconstruct a
pointer at the destination that refers to the reconstructed object that
the original referenced, but you did not copy the bit pattern, which is
all the hardware knows.

>> Why did you think otherwise? Perhaps you should reread the filing?
>
> I haven't read the filing at all, Ivan.
>
> Working engineers at most companies, in my experience, are under
> direction from corporate legal not to go and read patents.

That explains a lot. Perhaps you should not comment on postings you
haven't read; it confused the rest of us.

Robert Wessel

unread,

Oct 18, 2013, 2:33:58 AM10/18/13

to

On Thu, 17 Oct 2013 21:41:09 -0700, "Andy (Super) Glew"
<an...@SPAM.comp-arch.net> wrote:

>As far as I know, no language standard (except Ada) defines data layout.
> C/C++ slight exception: they say that structure elements must be in
>increasing address, but otherwise you can add padding.

Cobol comes reasonably close. While many of the important basic types
do have some implementation defined aspects (include range and exact
representation*), the way a structure ("record") is laid out is
defined, and there is no padding unless you ask for it (by declaring
some or all of the structure "SYNCHRONIZED").

*For example a "USAGE BINARY-SHORT SIGNED" item must be binary and
support a range of -32767 to +32767, it could, like the equivalent C
type be bigger, and could be one's complement, big or little endian,
etc.

Ivan Godard

unread,

Oct 18, 2013, 3:12:34 AM10/18/13

to

Yeah, but that's COBOL talking, not hardware. Andy asked whether the
*Mill* restructured data layout. The Mill doesn't, although COBOL, on a
Mill or otherwise, might.

Terje Mathisen

unread,

Oct 18, 2013, 4:52:31 AM10/18/13

to

The Mill could require all data items to be "naturally aligned", i.e.
all long doubles have to start on a 16-byte boundary, or it can allow
arbitrary alignment (which is what I think it does, right?)

Once upon a time I used an interesting compiler on my Turbo Pascal code
(sorry, I can't remember the name just now...), it had C, ModulaII and
Pascal front ends and a common back end.

Using Pascal it would rearrange the order of local variables in a
procedure, so that they would be properly aligned and so that any pair
of small (i.e. char/byte) variables that were updated near each other
could be written with a single 16-bit store instead of two 8-bit operations.

Bill Findlay

unread,

Oct 18, 2013, 8:04:31 AM10/18/13

to

On 18/10/2013 05:41, in article 5260BBE5...@SPAM.comp-arch.net, "Andy

(Super) Glew" <an...@SPAM.comp-arch.net> wrote:

> As far as I know, no language standard (except Ada) defines data layout.

Just to be clear: It is not the language that defines data layout, it is the
programmer using standard Ada language features who does that; failing
which, the compiler chooses according to implementation-dependent criteria.

> What I mean by data layout:
>
> When you say
>
> struct S {
> uint8_t_t a;
> uint64_t b;
> uint32_t c;
> }
>
> what are sizeof(S),
> the alignment of S (aligned on an address that is a multiple of N)
> and the offsets of a, b?
>
> packed 13/1/0,1,9
> typical x86 16/4/0,4,12
> - many x86 systems only align uint64_t on a 32 bit boundary
> fully aligned and padded to ensure alignment 24/20/0,8,16
> fully aligned and padded to cache line bank 32/20/0,8,16

> ...

> Ada does a good job of allowing programmer specification of data layout,
> and if the programmer doesn't specify, I think allows a lot of freedom
> to the compiler.

Yup.
E.g. the first of your layouts (I have not thought about the others):

type S is record
a : Unsigned_8;
b : Unsigned_64;
c : Unsigned_32;
end record with Alignment => 64, Size => 104;

for S use record
a at 0 range 0 .. 7;
b at 1 range 0 .. 63;
c at 9 range 0 .. 31;
end record;

If you do not want to go into that level of detail, you can tell the
compiler to focus on space-saving, or time-saving; or just let it use its
defaults. To be honest, you cannot expect the compiler to implement
arbitrary layouts. They have to be reasonably implementable, FSVO
reasonably; e.g., specifying an alignment of 17 and a size of 113 might well
evoke an error message.

--
Bill Findlay
with blueyonder.co.uk;
use surname & forename;

Paul A. Clayton

unread,

Oct 18, 2013, 11:33:28 AM10/18/13

to

On Friday, October 18, 2013 1:31:27 AM UTC-4, Ivan Godard wrote:
[snip]

>>> On 10/16/2013 10:08 PM, Andy (Super) Glew wrote:

[snip]

>>>> Q: will the mill adhere to any established standards for data structure
>>>> layout?

[snip]

> You have me mightily confused. The Mill is *hardware*. Hardware does not
> define data organization above the data elements that are operands of
> *hardware* operations. Thus in C terms it does not define structs,
> unions, or arrays (although it might define "vectors", which are not the
> same as arrays). Anything not eaten by a functional unit is the province
> of languages, compilers, runtime systems and OSs, not of a hardware spec.

Architectures can provide ABIs which define not only
translation of C types to architectural types but
also provide more details of structure layout than C
provides. Providing such information in the ABI has
the obvious benefit that different compilers are
pushed to greater interoperability (avoiding the need
for competing compiler teams to come to an agreement).

As Andy Glew pointed out, there are cases where
compatibility of data layout is desired across ISAs.

------

I dislike the absolute constraint on data layout based
on structure layout. In my opinion, there is a
difference between a layout that is most friendly to
the programmer and a layout (or even layouts) that are
most friendly to execution. ISTR reading that Sun used
an array-of-structures to structure-of-arrays
transformation on the SPEC CPU art benchmark component;
such is presumably illegal in C but legal for SPEC CPU.

I have suggested the concept of implementation
overloading, where more than one expression with the
same basic meaning could be provided--similar to method
overloading except that instead of differing based on
different object/parameters, the difference would be
based on context such as hardware used or even dataset
characteristics.

Andy Glew suggested a similar concept with more
constrained use but with an emphasis on proven
equivalence. His concept would allow providing
assembly code versions of code written in a
higher level (more programmer-friendly) language:
http://blog.andy.glew.ca/2012/04/compiler-use-and-prove-asm.html

Ivan Godard

unread,

Oct 18, 2013, 1:05:37 PM10/18/13

to

On 10/18/2013 1:52 AM, Terje Mathisen wrote:
> Ivan Godard wrote:
>> On 10/17/2013 11:33 PM, Robert Wessel wrote:
>>> *For example a "USAGE BINARY-SHORT SIGNED" item must be binary and
>>> support a range of -32767 to +32767, it could, like the equivalent C
>>> type be bigger, and could be one's complement, big or little endian,
>>> etc.
>>>
>>
>> Yeah, but that's COBOL talking, not hardware. Andy asked whether the
>> *Mill* restructured data layout. The Mill doesn't, although COBOL, on a
>> Mill or otherwise, might.
>
> The Mill could require all data items to be "naturally aligned", i.e.
> all long doubles have to start on a 16-byte boundary, or it can allow
> arbitrary alignment (which is what I think it does, right?)

Yes, we support unaligned. But assuming that we did not, the restriction
would affect *access*, not structure. The hardware would not prevent the
compiler from mapping a struct any way it wanted, but would require that
accessing parts (or all) of the struct conform to hardware access rules,
possibly requiring multiple loads that the code then LEGOs together,
just like we do with bit fields now.

Hardware is *not* a high level language :-)

> Once upon a time I used an interesting compiler on my Turbo Pascal code
> (sorry, I can't remember the name just now...), it had C, ModulaII and
> Pascal front ends and a common back end.
>
> Using Pascal it would rearrange the order of local variables in a
> procedure, so that they would be properly aligned and so that any pair
> of small (i.e. char/byte) variables that were updated near each other
> could be written with a single 16-bit store instead of two 8-bit
> operations.

An that is legal for the *compiler* to do. The hardware has nothing to
say about it.

Chris Gray

unread,

Oct 18, 2013, 3:44:19 PM10/18/13

to

"Paul A. Clayton" <paaron...@gmail.com> writes:

> I dislike the absolute constraint on data layout based
> on structure layout. In my opinion, there is a
> difference between a layout that is most friendly to
> the programmer and a layout (or even layouts) that are
> most friendly to execution.

My small contribution here:

My language Zed has both 'struct' and 'record' types. 'struct' is pretty much
a C 'struct'. 'record' is like those in AlgolW - they are always referenced
via a pointer, and can only be created by allocation.

For 'struct' types, the compiler strictly follows the order of the fields as
declared, and only inserts padding as needed (8 byte alignment for 64 bit
values). For 'record' types, the compiler is free to do whatever it wants,
and it does do that. For example, if you scatter a bunch of 'bool' fields
through a 'record', they will usually end up all together. Similar for larger
fields that are less than the full 8 byte alignment. The same offset
allocation code is re-used for variable allocation. Note that 'capsule's
(Zed's classes) do the same, including when adding fields to a capsule being
extended.

--
-Chris Gray

Ivan Godard

unread,

Oct 18, 2013, 4:05:19 PM10/18/13

to

A reasonable language approach. My Mary2 did it similarly but with
different syntax: by default the Mary equivalent did rearranging the way
your record types do, but rather than another type constructor (your
"struct") we use a type attribute "sic" that could be applied to other
types as well. Tghus "sic struct {...}".

Another use for "sic" was to define peculiar FP representations,
especially in sub-word FP where there is no standard representation
across machines and often no hardware and several different software FP
reps in the same program. With "sic" you got the compiler's choice of
representation that was big enough to hold the exponent range and value
precision given in the type; with "sic" you had bit-level control over
the rep.

willia...@gmail.com

unread,

Oct 18, 2013, 9:31:46 PM10/18/13

to

Patches to add bounds checking into gcc http://williambader.com/bounds/example.html

Modules compiled with and without bounds checking may be linked together without special preparation.
Checked pointers may be passed to modules compiled without bounds checking.
Structures containing pointers may be passed between modules compiled with and without bounds checking.
Modules compiled without bounds checking still check some pointer accesses.
Checked pointers (pointers created by malloc or by modules compiled with bounds checking) are checked in calls to the str* and mem* library functions and in calls to modules compiled with bounds checking.
The checks are similar to a malloc debug library, except more powerful because they also check local and static variables created in modules compiled with checking.
The GCC_BOUNDS_OPTS environment variable passes options to the bounds checker.
The -never-fatal option continues after errors.
The -print-heap option shows memory leaks.

Andy (Super) Glew

unread,

Oct 21, 2013, 5:01:33 PM10/21/13

to

Ivan: it appears that my assumptions about what you were proposing are
correct - assuming that what you say in the post I am replying to are
correct:

There's nothing particularly fancy about it

> > I was guessing that you were willing to use a more custom data layout
> > than I would be.
>
> You have me mightily confused. The Mill is *hardware*. Hardware does not
> define data organization above the data elements that are operands of
> *hardware* operations. Thus in C terms it does not define structs,
> unions, or arrays (although it might define "vectors", which are not the
> same as arrays). Anything not eaten by a functional unit is the province
> of languages, compilers, runtime systems and OSs, not of a hardware spec.
>
> The Mill *does* define (for example) the "layout" of quad decimal FP
> (and defines it to be as specified in IEEE754R). It does not define the
> layout of structs, because the Mill has no operations that take structs
> as input. The compilers can do whatever they want.

> The proposal, had you read it, explains that it is possible to check
> pointer arithmetic 1) using 64 bit pointers with no metadata; 2) within
> the bounds of a power-of-two granule; 3) but not catching misaligment
> within a granule; 4) nor off-object-but-within-granule when the object
> is smaller than its granule.

Which is what I assumed.

> As all of memory is a power-of-two, the checking is satisfied if all
> allocators use the maximal granule throughout; there will be no
> unexpected exceptions, though of course there will be no checking either.

And this is what I was commenting about. You seem to be happy with
allocators using power-of-2 sizes. I was not.

Which I think amounts to our usual disagreement: you seem to be happy
requiring software to change. I am not.

Computer architects think about not just hardware, not just compilers,
but also the ecosystem.

If you think that your customers, the customers who might want to use
your checked memory, would be happy with power-of-2 sizes, power to you!

Andy (Super) Glew

unread,

Oct 21, 2013, 5:13:41 PM10/21/13

to

OK, Ivan, how about

"Do you (Out of the Box Computing) or does the software ecosystem that
you have or hope to develop for the Mill, or for a hypothetical system,
which might include the Mill or some other CPU, which uses the
power-of-2 memory checking we have been discussing, expect to use some
standard data layout? Or do you think that your customers would be
willing to live with more use of power-of-2 sized objects, whether
returned by dynamic allocators like malloc, or defined as fields of a
struct or elements of an array?"

Long way of saying it. Have I got all the qualifications in the right
places?

Andy (Super) Glew

unread,

Oct 21, 2013, 5:16:58 PM10/21/13

to

Cool. It will be interesting to compare your patches to whatever Intel
makes available to use MPX.

It would bed particularly embarassing if your pure SW implementation is
faster than Intel's using the MPX instructions.

Ivan Godard

unread,

Oct 21, 2013, 5:33:24 PM10/21/13

to

On 10/21/2013 2:01 PM, Andy (Super) Glew wrote:
<snip>

>> As all of memory is a power-of-two, the checking is satisfied if all
>> allocators use the maximal granule throughout; there will be no
>> unexpected exceptions, though of course there will be no checking either.
>
> And this is what I was commenting about. You seem to be happy with
> allocators using power-of-2 sizes. I was not.
>
> Which I think amounts to our usual disagreement: you seem to be happy

> *requiring* software to change. I am not.

> Computer architects think about not just hardware, not just compilers,
> but also the ecosystem.
>
> If you think that your customers, the customers who might want to use
> your checked memory, would be happy with power-of-2 sizes, power to you!
>
>

Again,if you read the writeup you would see that it does not *require*
software to do anything. The software/OS/RTS can treat all of process
memory as a monolith (maximal granule) and allocate from that exactly as
it does on a conventional. Hardware checks will always pass, because the
address arithmetic always results in a pointer that still points into
all-of-memory.

However, if the allocator *wants* finer granule checking without the
overhead and/or program rewrite of a software checker, it *can* bind its
allocations into the smallest encompassing power-of-two granule. The
allocation itself is not restricted to power-of-two, but address errors
will only be caught if they move the address outside the granule as well
as outside the object.

More, restriction may nest. mmap() may or may not return a bounded
pointer encompassing the mmap'd space, from which malloc may *if it
wants* return bounded pointers to malloc'd objects. If mmap returns
bounded, then arithmetic that moves a pointer outside the mmap will
fault. If malloc returns bounded,then arithmetic that moves outside the
object wrapper will fault regardless of what mmap did.

You are creating a straw man, arguing against what the proposal doesn't
do in the first place. Please don't treat me like an idiot - I know
perfectly well that a CPU design will fail in the market if it tries to
*force* changes to existing programs or HLLs, and *nothing* in the Mill
forces that.

Well, in my ire I am somewhat exaggerating: the Mill *does* force 64-bit
pointers and twos complement arithmetic. So what.

To repeat the question I asked here: use of this facility is not
required, and if allocators do use it then some "working" programs that
violate language standards will get faults. Will the user community 1)
not bother to use it?; 2) use it and complain that the CPU is broken
when their code faults?

In other words, would it be used and if used will the gain in error
checking be worth the nuisance?

I would appreciate a real, thoughtful answer based on what the proposal
*actually does*.

Öö Tiib

unread,

Oct 21, 2013, 5:52:32 PM10/21/13

to

On Tuesday, 22 October 2013 00:33:24 UTC+3, Ivan Godard wrote:
> I would appreciate a real, thoughtful answer based on what the proposal
> *actually does*.

Most implementations of C++ 'std::vector' already try to double the
capacity of underlying buffer when vector runs out of room. Doubling
a buffer results with next power-of-2 anyway. Some people sometimes
classify that as "inefficiency" but practice is opposite. Well-written
C++ program is very hard to beat in performance.

Your design fits perfectly with most popular container of a popular
well-performing language. Checked iterators of vector for free? That
alone sounds like good market so Andy is likely just jealous. ;)

Ivan Godard

unread,

Oct 21, 2013, 6:05:00 PM10/21/13

to

On 10/21/2013 2:13 PM, Andy (Super) Glew wrote:

>
> OK, Ivan, how about
>
> "Do you (Out of the Box Computing) or does the software ecosystem that
> you have or hope to develop for the Mill, or for a hypothetical system,
> which might include the Mill or some other CPU, which uses the
> power-of-2 memory checking we have been discussing, expect to use some
> standard data layout? Or do you think that your customers would be
> willing to live with more use of power-of-2 sized objects, whether
> returned by dynamic allocators like malloc, or defined as fields of a
> struct or elements of an array?"
>
> Long way of saying it. Have I got all the qualifications in the right
> places?
>

No, you haven't.

OOTBC does not propose to define or require data structure format beyond
that of the individual operand, at all. That is: we don't accept ones
complement numbers, but you can organize your structs any way you like.

We expect that existing ecosystems will migrate to the Mill using their
existing data-format ABIs unchanged. We have done our best to ensure
that the Mill hardware will not get in the way of the designers of
future ecosystems as yet undreamt of.

By not reading the proposal, you have confused power-of-two objects with
power-of-two checking granules. This proposal and the Mill generally
have no idea what constitutes a program object, and the proposal does
not and cannot check violations of object addressing.

The proposal permits pointer addresses to be bound within a checking
granule, using a single-width representation of checked pointers. It
then checks that address arithmetic remains within a *granule*. The
granule is power-of-two sized and aligned, regardless of the size and
alignment of the pointed-to object.

What granule size should be used for a given allocation is up to the
allocator. Two cases seem most useful. In the one case, the granule is
the smallest power-of-two convex hull of the object. In the other, the
granule is the largest possible one, encompassing the process's whole
address space. In the former, moves outside the object but within the
granule are not caught, but moves outside the granule will fault. In the
latter, no moves of any kind will fault.

There is no requirement that distinct objects have distinct granules.

There is no requirement that the layout of objects be changed.

There is no requirement that the algorithm use by or the addresses
returned from allocators be changed.

There is no requirement that the checking be used at all, and it won't
be unless code is added to the allocator to set a granule.

Relational, increment, decrement, assignment and delta operations on
pointers work on checked pointers as they do on any machine. There are
some additional operations for granule query and management; you don't
have to use them.

Certain programs that assume (in violation of language standards) that
pointers are integers may fail. This is true of Mill code in general;
even without this proposal Mill pointers are not simple integers. For
example, the program cannot assume that the upper byte of a pointer can
be used to store flags, as was assumed in early Apples.

The proposal is an aid to debugging only. It cannot be used to enforce
system or application security.

Andy (Super) Glew

unread,

Oct 24, 2013, 1:30:31 AM10/24/13

to

Blah, blah, blah.

Do you think that overflowing a non-power-of-2 object within a
power-of-2 granule, overflowing into another object within the same
granule, without detecting an overflow is acceptable?

> The proposal is an aid to debugging only. It cannot be used to enforce
> system or application security.

Right.

Ivan Godard

unread,

Oct 24, 2013, 2:25:32 AM10/24/13

to

On 10/23/2013 10:30 PM, Andy (Super) Glew wrote:
> On 10/21/2013 3:05 PM, Ivan Godard wrote:

<snip>

>> There is no requirement that the checking be used at all, and it won't
>> be unless code is added to the allocator to set a granule.
>
> Blah, blah, blah.
>
> Do you think that overflowing a non-power-of-2 object within a
> power-of-2 granule, overflowing into another object within the same
> granule, without detecting an overflow is acceptable?

Acceptable? When the alternative is no checking at all?

The proposal is unfortunately minimal, I agree. But something is better
than nothing IMO, and I personally welcome all the checking I can get,
complete or not. However, I am not the target market. Fr the market, I'm
much less worried about what we don't catch than I am about user
reactions when we *do* catch something in their "working" programs.

The checking alternative that I know of I consider unacceptable to the
market:

1) wide pointers will break far too many programs that are otherwise
valid and do not in fact violate address restrictions

2) extra-bit schemes are too invasive of the OS to have hope of being
incorporated into commercial OSs

3) software checking by compiler modification is also unlikely to be
adopted by major tools, and in any case runs afoul of the
third-party-code problem. Such checking is useful when a bug is known to
exist and is being looked for; no software scheme seems suitable for
always-on production code usage.

The proposal is not invasive, and requires no app, compiler, or OS
changes and minimal (one function call) changes to allocators, so it has
some chance of adoption. On the other hand, while it will catch truly
wild addresses it will miss many of the common "off by one" bugs. The
hardware cost (mostly entropy from needing distinct pointer-arithmetic
operations) is trivial.

>
>> The proposal is an aid to debugging only. It cannot be used to enforce
>> system or application security.
>
> Right.
>
>

Exactly: no more has ever been claimed. The question remains: will the
occasional and incomplete aid be appreciated enough by the developer
community to outweigh those who don't want to hear about faults?

This is a user-psyche question that I put to the group here, not a
technical one. The proposal cannot be changed for tighter checking; it
is inherently limited, so complaining that it isn't tight enough is
pointless - it is what it is. As it stands, I would still welcome your
opinion whether incorporation it in the Mill would be, on average, seen
as a bug or a feature by our potential market.

Ivan

Noob

unread,

Oct 24, 2013, 4:24:06 AM10/24/13

to

Ivan Godard wrote:

> We have (so far) not included this idea in the Mill definition; we are
> not sure that the game is worth the candle. It costs six bits out of the
> 64-bit address space, and as a SAS machine we are careful of address
> space. In addition, it will break some language-invalid but functionally
> correct code, such as code that moves a pointer off a target array but
> then moves it back before dereferencing.

Hmmm, UB is UB, nasal demons and all.

But I get your point. If the new kid on the block faults while
the other players don't flinch, people are going to blame the
new kid...

cf. the memcpy/memmove Linus (yes, "s") fiasco.

Or this one:
int foo(struct toto *p) {
int n = p->u;
if (p == NULL) <- optimized away

Or this one:
for (int i=1; i > 0; ++i) { <- optimized into an infinite loop

Terje Mathisen

unread,

Oct 24, 2013, 4:45:03 AM10/24/13

to

Noob wrote:
> cf. the memcpy/memmove Linus (yes, "s") fiasco.

Link?

>
> Or this one:
> int foo(struct toto *p) {
> int n = p->u;
> if (p == NULL) <- optimized away

That is simply broken code, it should fault on the "int n..." statement.

>
> Or this one:
> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
>

Is that legal?

With twos-complement the integer _will_ wrap around into negatie values,
while a sign+mantissa implementation will wrap to zero at the same point.

In fact, I don't think you can come up with any (fixed/limited-storage)
integer representation which won't wrap around to either negative or zero!

Noob

unread,

Oct 24, 2013, 4:53:55 AM10/24/13

to

Andy (Super) Glew wrote:

> But, no, I don't have a link - although Terje seems to remember that the
> comp.arch discussion happened recently, in the last two years or so. (I
> have lousy luck searching comp.arch.)

Google makes baby jesus cry. (So much pointless javascript...
Remember a few years back, when Google was light-weight, and
didn't sleep with the NSA?)
https://groups.google.com/advanced_search
https://groups.google.com/groups/search?as_epq=buffer+overflow&as_ugroup=comp.arch&as_uauthors=glew

Date: Sun, 02 Aug 2009 12:46:41 -0700
From: "Andy \"Krazy\" Glew"
Subject: HardBound and SoftBound (was "The State of Software")
https://groups.google.com/forum/message/raw?msg=comp.arch/iQs02MvjHo4/7r_AxbqMiGMJ

Noob

unread,

Oct 24, 2013, 5:03:29 AM10/24/13

to

Andy (Super) Glew wrote:

> Working engineers at most companies, in my experience, are under
> direction from corporate legal not to go and read patents.

Evidence that the patent system is borked.

Noob

unread,

Oct 24, 2013, 5:31:59 AM10/24/13

to

Terje Mathisen wrote:

> Noob wrote:
>
>> cf. the memcpy/memmove Linus (yes, "s") fiasco.
>
> Link?

When glibc changed the memcpy implementation to one that failed
if src and dest overlapped (AS SPECIFIED) Linus argued that memcpy
should behave like memmove, because that's what most people expect.

https://lwn.net/Articles/414467/
https://bugzilla.redhat.com/show_bug.cgi?id=638477

>> Or this one:
>> int foo(struct toto *p) {
>> int n = p->u;
>> if (p == NULL) <- optimized away
>
> That is simply broken code, it should fault on the "int n..." statement.

No fault in kernel code => pwnage, CVE, etc
https://lwn.net/Articles/342330/
http://blog.cr0.org/2009/06/bypassing-linux-null-pointer.html

>> Or this one:
>> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
>
> Is that legal?

What do you mean? Is it a constraint violation? No.
The platform is allowed to assume "no overflow on signed integers
arithmetic", so the positive integer will remain positive.

> With twos-complement the integer _will_ wrap around into negatie values,
> while a sign+mantissa implementation will wrap to zero at the same point.
>
> In fact, I don't think you can come up with any (fixed/limited-storage)
> integer representation which won't wrap around to either negative or zero!

They won't wrap because /The Scripture/ says so! ;-)

http://stackoverflow.com/questions/7682477/why-does-integer-overflow-on-x86-with-gcc-cause-an-infinite-loop
https://lwn.net/Articles/511259/

Regards.

Anton Ertl

unread,

Oct 24, 2013, 6:10:54 AM10/24/13

to

Terje Mathisen <"terje.mathisen at tmsw.no"> writes:

>Noob wrote:
>> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
>>
>
>Is that legal?
>
>With twos-complement the integer _will_ wrap around into negatie values,
>while a sign+mantissa implementation will wrap to zero at the same point.
>
>In fact, I don't think you can come up with any (fixed/limited-storage)
>integer representation which won't wrap around to either negative or zero!

All correct. But the current breed of C compilers (or C miscompilers)
works based on the assumption that signed overflow does not happen,
because it is undefined behaviour in the C standard. And from that
assumption it "optimizes" "i>0" in the loop above into 1.

While the motivation for not defining signed-integer overflow in the C
standard was the existence of several different signed-integer
representations with different results on overflow, C miscompilers use
this non-definition as an excuse for miscompiling, even if they only
ever generate code for twos-complement machines.

Ironically signed-integer overflow is defined in Java, so Java may be
a better low-level language these days than C.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

MitchAlsup

unread,

Oct 24, 2013, 1:11:58 PM10/24/13

to

On Thursday, October 24, 2013 3:24:06 AM UTC-5, Noob wrote:
> Or this one:
>
> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop

But it is NOT an infinite loop. It is a loop that runs for 2^31 or
2^63 passes. The 32-bit version will only take a few seconds on a
modern fast processor--hardly infinite.

Mitch

George Neuner

unread,

Oct 24, 2013, 9:06:15 PM10/24/13

to

On Thu, 24 Oct 2013 10:53:55 +0200, Noob <ro...@127.0.0.1> wrote:

>Andy (Super) Glew wrote:
>
>> (I have lousy luck searching comp.arch.)
>

>Remember a few years back, when Google was light-weight

Remember [back before Google took over] when the AltaVista engine
actually excelled at searching newsgroups?

George

George Neuner

unread,

Oct 24, 2013, 9:32:38 PM10/24/13

to

On Fri, 18 Oct 2013 08:33:28 -0700 (PDT), "Paul A. Clayton"
<paaron...@gmail.com> wrote:

>I dislike the absolute constraint on data layout based
>on structure layout. In my opinion, there is a
>difference between a layout that is most friendly to
>the programmer and a layout (or even layouts) that are
>most friendly to execution.

And yet another that's necessary for hardware interfacing ...
something C used to be really good at before application programmers
co-opted it 8-(

>ISTR reading that Sun used
>an array-of-structures to structure-of-arrays
>transformation on the SPEC CPU art benchmark component;
>such is presumably illegal in C but legal for SPEC CPU.

I don't think it's prohibited per se because, although the layout of
structures is defined, the results of comparing the addresses of
structure members technically is undefined behavior.

So long as the transformation preserves the relative ordering of the
structure addresses - which it does - I think it possibly could be
argued that the transformation either is legal or is acceptable as
implementation defined behavior.

Even if not, it is a good option to have because it can save a lot of
data swizzling for certain SIMD codes.

George

Andy (Super) Glew

unread,

Oct 24, 2013, 9:52:32 PM10/24/13

to

On 10/23/2013 11:25 PM, Ivan Godard wrote:
> On 10/23/2013 10:30 PM, Andy (Super) Glew wrote:
>> On 10/21/2013 3:05 PM, Ivan Godard wrote:
>
>
> <snip>
>
>>> There is no requirement that the checking be used at all, and it won't
>>> be unless code is added to the allocator to set a granule.
>>
>> Blah, blah, blah.

Sorry, I was grumpy last night. Annoyed at you explaining obvious things
to me, and not answering my question. Although perhaps you did:

I was thinking that you were more willing than I am to say "If you want
to get good checking, round your data structures up to power-of-2 sizes".

I think that instead you are saying "power-of-2 checking only: if you
don't round your data structures up, (1) some errors may not be detected
at all, and (2) some errors may be detected very late".

Although I suspect that you are going to tell me once again that I have
totally misunderstood you, and that I am not allowed to comment on your
proposal.

>>
>> Do you think that overflowing a non-power-of-2 object within a
>> power-of-2 granule, overflowing into another object within the same
>> granule, without detecting an overflow is acceptable?
>
> Acceptable? When the alternative is no checking at all?

I suppose the example of guard pages shows that people are willing to
live with checking that catches errors potentially a very long time
after the actual overflow.

> The proposal is unfortunately minimal, I agree. But something is better
> than nothing IMO, and I personally welcome all the checking I can get,
> complete or not. However, I am not the target market. Fr the market, I'm
> much less worried about what we don't catch than I am about user
> reactions when we *do* catch something in their "working" programs.
>
> The checking alternative that I know of I consider unacceptable to the
> market:

You forgot the MPX scheme. Which is pretty well proven not to break
programs or OSes.

Although I agree with you that Intel's chosen implementation requires a
bit too much compiler involvement for my taste.

Actually, my feeling bad about grumpy "blah, blah, blah" leads me to
suspect that there may be a possibility of tighter checking.

One of the design principles behind my work that led to MPX... (I'm
getting tired of that circumlocution - I called it "Cp/PL". I think that
there is no harm in letting that be known. "Cp" stood for "checked
pointers"; "PL" was short for a longer codename in the Intel geographic
pattern. I won't say that.)

One of the design principles behind my work that led to MPX or Cp/PL was
"no struct descriptors".

Capability ISAs are not so bad, whether using power-of-2 bounded
pointers or wide pointers with [lwb,upb) (especially if the wide bits
are stored non-adjacent).

---+ Struct and array descriptors considered bad

But many capability ISAs eventually have array descriptors - not simple
[lwb,upb) descriptors, but
[base_address,elem_type,ndim,dim1,dim2...].

Worse still, many capability ISAs have struct descriptors:
[struct_type,nfields,field1_type, field2_type,...]

Variable length.

Not something I want to manage in hardware.

---+ Living without direct struct descriptors

Part of the thought process that led to Cp-PL / MPX involved
+ I don't want struct descriptors
+ okay, how do I protect structs - how do I prevent overflow of an
element with a struct

Reluctantly applying the adage "any problem in computer science can be
solved by a level of indirection":

Instead of struct descriptors, use pointers to struct elements.
Possibly in a separate struct or array:

struct Target { int a; char b; double c; ... };

struct Pointers_to_Target_Fields(Target& t) {
int* a_ptr(&t.a);
char* b_ptr(&t.b);
double* c_ptr(&t.c);
...
};

Basically, this amounts to saying that a compiler can create its own
struct_descriptors, if it really wants to, albeit using this level of
indirection.

Now, considering where Cp-PL ended up, this is largely irrelevant. But
nevertheless, it was important to my thought process: I would not have
bothered to continue thinking about Cp-PL if I did not see how it could
ultimately be extended to security just as good as any other capability
system.

---+ What does this have to do with power-of-2 checked pointers?

Most primitive objects understood by the hardware have power-of-2 sizes:
8, 16, 32, 64, 128 bits.

You could probably get away with power-of-2 sized pointers in the thing
that looks like a struct descriptor:

struct Pointers_to_Target_Fields(Target& t) {
Power_of_2_Checked_Pointer<int*> a_ptr(&t.a);
Power_of_2_Checked_Pointer<char*> b_ptr(&t.b);
Power_of_2_Checked_Pointer<double*> c_ptr(&t.c);
...
};

noting that sizeof<Power_of_2_Checked_Pointer<T*>> = sizeof<T*>.

Tight checking.

Unfortunately, this doesn't handle pointing to non-power-of-2 arrays as
elements in structs. Nor does it handle the fact that the
Pointers_to_Target_Fields thing that looks like a struct descriptor is
non-power-of-2.

--

Oh, by the way this reminds me: one of the biggest reasons not to use
1-extra-bit power-of-two pointers was that such pointers are only
[ptr-base-address,size), i.e. the pointer is always to the bottom of the
memory region. Instead of ptr,[lwb,ubp) - where the pointer is separate
from the bounds. C allows the pointers to be outside the bounds, in a
very limited way, but never to be dereferenced. Many C programs use
negative offsets. Disallowing negative offsets breaks far too many
programs.

If I am correct in understanding, Ivan's proposal is not my
1-extra-bit trick. Instead it provides an N-bit field (at one time Ivan
described it as floating point, at another time as a 6-bit field).

If my understanding is correct, Ivan's proposal indicates how many
low bits to ignore in the 2^N bounds comparison, but allows a pointer
into the middle of the region. So it does not have the problem that my
1-extra-bit power-of-two pointers had. Although it costs 6 extra address
bits.

So, see: the 1-extra-bit approach is prior art for power-of-2-aligned
checking. (If it is publicly visible anywhere.) But Ivan's N-bit field
approach is different.

(BTW^2: when Cp/PL - MPX started, it was still considered essential to
make it available on a 32-bit machine, so consuming 5 address bits would
not have been acceptable. Even considering 1 was borderline. On
64-bits, taking away 6 address bits may be acceptable. But then it
interferes with the various other software systems that do things like
placing tags in the upper or lower bits of the address. Like ARMv8
allowing the upper 8 bits of address to be ignored by the hardware
virtual address translation facility, reserved for tagged pointers.
Therefore for MPX/Cp-PL we just left the user's pointer bits alone, and
placed our metadata somewhere else.)

Andy (Super) Glew

unread,

Oct 24, 2013, 10:07:05 PM10/24/13

to

First, I must say: companies do not ignore prior art. They direct
engineers and inventors not to go looking for prior art. The inventors
disclose any prior art they know about, but don't go searching for it
themselves. Instead the patent department hires professional patent
searchers to search for prior art. And engineers are directed not to
read patents "just because they are interesting", whether at work or in
their own spare time.

I agree: this is a pity. I used to learn a lot from reading patents,
early in my career, before the patent wars became so heated. I read them
far less often nowadays. (Although, as you can imagine, I read
thousands and thousands of patents while I was at Intellectual Ventures.
If I am ever accused of having violated a patent that I read while I
was at IV, my defense is that I read so many that I cannot remember
details of any of them.)

Quite regularly some independent inventor, somebody with a "startup"
that is him and a couple of friends working for free, asks me to look at
their patents. I nearly always decline, since this is a bit of a red
flag - such startups oftem fail, and such patents often end up being
sold to patent trolls, ahem, Non-Practicing Entities (or even to
practicing entities that will not actually implement the patent, but
will just use it to sue other companies - often big companies suing the
next generation of startups, or at least frightening off their funding),
so having an email record of a conversation where you review a patent is
often likely to land you in 3X trouble.

Andy (Super) Glew

unread,

Oct 24, 2013, 10:12:11 PM10/24/13

to

Thanks. Although I don't think that discloses using the 1-extra-bit
trick for power-of-2 sized pointers.

Ivan Godard

unread,

Oct 24, 2013, 10:19:40 PM10/24/13

to

On 10/24/2013 6:52 PM, Andy (Super) Glew wrote:
> On 10/23/2013 11:25 PM, Ivan Godard wrote:
>> On 10/23/2013 10:30 PM, Andy (Super) Glew wrote:
>>> On 10/21/2013 3:05 PM, Ivan Godard wrote:
>>
>>
>> <snip>
>>
>>>> There is no requirement that the checking be used at all, and it won't
>>>> be unless code is added to the allocator to set a granule.
>>>
>>> Blah, blah, blah.
>
> Sorry, I was grumpy last night. Annoyed at you explaining obvious things
> to me, and not answering my question. Although perhaps you did:
>
> I was thinking that you were more willing than I am to say "If you want
> to get good checking, round your data structures up to power-of-2 sizes".

Likely.

> I think that instead you are saying "power-of-2 checking only: if you
> don't round your data structures up, (1) some errors may not be detected
> at all, and (2) some errors may be detected very late".

Yep

> Although I suspect that you are going to tell me once again that I have
> totally misunderstood you, and that I am not allowed to comment on your
> proposal.

Nope

>>>
>>> Do you think that overflowing a non-power-of-2 object within a
>>> power-of-2 granule, overflowing into another object within the same
>>> granule, without detecting an overflow is acceptable?
>>
>> Acceptable? When the alternative is no checking at all?
>
> I suppose the example of guard pages shows that people are willing to
> live with checking that catches errors potentially a very long time
> after the actual overflow.
>

People who want checking will take what they can get IMO. I'm one.

>> The proposal is unfortunately minimal, I agree. But something is better
>> than nothing IMO, and I personally welcome all the checking I can get,
>> complete or not. However, I am not the target market. Fr the market, I'm
>> much less worried about what we don't catch than I am about user
>> reactions when we *do* catch something in their "working" programs.
>>

<snip>

>>
>>>
>>>> The proposal is an aid to debugging only. It cannot be used to enforce
>>>> system or application security.
>>>
>>> Right.
>>>
>>>
>>
>> Exactly: no more has ever been claimed. The question remains: will the
>> occasional and incomplete aid be appreciated enough by the developer
>> community to outweigh those who don't want to hear about faults?
>>
>> This is a user-psyche question that I put to the group here, not a
>> technical one. The proposal cannot be changed for tighter checking; it
>> is inherently limited, so complaining that it isn't tight enough is
>> pointless - it is what it is. As it stands, I would still welcome your
>> opinion whether incorporation it in the Mill would be, on average, seen
>> as a bug or a feature by our potential market.
>>
>> Ivan
>
> Actually, my feeling bad about grumpy "blah, blah, blah" leads me to
> suspect that there may be a possibility of tighter checking.

Go patent it then; I don't see one. MPX is much too much hardware IMO,
and the latencies work only on OOO.

Yes, if I understand you. Either you can go unchecked (or barely
checked, as in guard pages or my floating-pointers), or you go all the
way to a type (or at least structure) system in the hardware; there
seems no middle ground to me. I think that there is no technical reason
not to go all the way - ops are complex, but you need fewer of them and
we have the gates, so fully typed may actually be faster than purist
RISC, 432 not withstanding.

But I can't sell it, and neither could you :-(

> --
>
> Oh, by the way this reminds me: one of the biggest reasons not to use
> 1-extra-bit power-of-two pointers was that such pointers are only
> [ptr-base-address,size), i.e. the pointer is always to the bottom of the
> memory region. Instead of ptr,[lwb,ubp) - where the pointer is separate
> from the bounds. C allows the pointers to be outside the bounds, in a
> very limited way, but never to be dereferenced. Many C programs use
> negative offsets. Disallowing negative offsets breaks far too many
> programs.
>
> If I am correct in understanding, Ivan's proposal is not my
> 1-extra-bit trick. Instead it provides an N-bit field (at one time Ivan
> described it as floating point, at another time as a 6-bit field).

Six bit exponent in a floating-point representation.

> If my understanding is correct, Ivan's proposal indicates how many
> low bits to ignore in the 2^N bounds comparison, but allows a pointer
> into the middle of the region. So it does not have the problem that my
> 1-extra-bit power-of-two pointers had. Although it costs 6 extra address
> bits.

Yes :-(

> So, see: the 1-extra-bit approach is prior art for power-of-2-aligned
> checking. (If it is publicly visible anywhere.) But Ivan's N-bit field
> approach is different.

So I had thought.

> (BTW^2: when Cp/PL - MPX started, it was still considered essential to
> make it available on a 32-bit machine, so consuming 5 address bits would
> not have been acceptable. Even considering 1 was borderline. On
> 64-bits, taking away 6 address bits may be acceptable. But then it
> interferes with the various other software systems that do things like
> placing tags in the upper or lower bits of the address. Like ARMv8
> allowing the upper 8 bits of address to be ignored by the hardware
> virtual address translation facility, reserved for tagged pointers.
> Therefore for MPX/Cp-PL we just left the user's pointer bits alone, and
> placed our metadata somewhere else.)

We are already using four bits for other reasons although only one of
those is truly essential to the architecture; using six more has to
offer real benefit.

Andy (Super) Glew

unread,

Oct 24, 2013, 10:20:37 PM10/24/13

to

On 10/23/2013 11:25 PM, Ivan Godard wrote:

What does "incorporation in the Mill" mean?

Implementing it in your simulator? FPGA? Sure - as long as it does not
displace something more important, or make it too big too fit.

If you are using your own $$$ to pay for a TSMC real chip - probably not.

Placing it in RTL - sure, modulo effort. It can't hurt, except for
effort. It may attract some customers. But I think that it would be
more useful to evaluate what fraction of security bugs - e.g. in your
favorite security benchmark suite - your technique finds. That would be
data that customers might be interested in.

But be sure it can be disabled. As I mentioned in my last post, it can
be very annoying to run into a program that does things like sticking
tags in upper bits, or shifting the pointer left, in a manner than
interferes with your "floating point address checked pointers".

Ivan Godard

unread,

Oct 24, 2013, 10:38:54 PM10/24/13

to

On 10/24/2013 7:20 PM, Andy (Super) Glew wrote:
> On 10/23/2013 11:25 PM, Ivan Godard wrote:
>> On 10/23/2013 10:30 PM, Andy (Super) Glew wrote:
>>> On 10/21/2013 3:05 PM, Ivan Godard wrote:
>
>> Exactly: no more has ever been claimed. The question remains: will the
>> occasional and incomplete aid be appreciated enough by the developer
>> community to outweigh those who don't want to hear about faults?
>>
>> This is a user-psyche question that I put to the group here, not a
>> technical one. The proposal cannot be changed for tighter checking; it
>> is inherently limited, so complaining that it isn't tight enough is
>> pointless - it is what it is. As it stands, I would still welcome your
>> opinion whether incorporation it in the Mill would be, on average, seen
>> as a bug or a feature by our potential market.
>>
>> Ivan
>
> What does "incorporation in the Mill" mean?

Define it as part of the architecture.

> Implementing it in your simulator? FPGA? Sure - as long as it does not
> displace something more important, or make it too big too fit.

Implementation, power and area costs are trivial.

> If you are using your own $$$ to pay for a TSMC real chip - probably not.
>
> Placing it in RTL - sure, modulo effort. It can't hurt, except for
> effort. It may attract some customers. But I think that it would be
> more useful to evaluate what fraction of security bugs - e.g. in your
> favorite security benchmark suite - your technique finds. That would be
> data that customers might be interested in.
>
> But be sure it can be disabled. As I mentioned in my last post, it can
> be very annoying to run into a program that does things like sticking
> tags in upper bits, or shifting the pointer left, in a manner than
> interferes with your "floating point address checked pointers".

We'll be facing that anyway. Mill pointers are not integers, whether
using this proposal or not.

What this will break, that wouldn't have broken anyway; is:

int* p = new[100]int;
p += 200;
p -= 150;
int i = *p;

The above code is illegal in the language, but will "work" if not
checked. Should it be faulted?

Andy (Super) Glew

unread,

Oct 25, 2013, 1:55:07 AM10/25/13

to

For Cp-PL / MPX, we chose not to fault. Fault only at dereference.

You have to allow

int* p = new[100]int;

p += 100;
p -= 1;
int i = *p;

I did not want Cp's checked pointers to have knowledge of the element type.

Ryan Hitchman

unread,

Oct 25, 2013, 3:13:46 AM10/25/13

to

On Thursday, October 24, 2013 7:20:37 PM UTC-7, Andy (Super) Glew wrote:
> Placing it in RTL - sure, modulo effort. It can't hurt, except for
> effort. It may attract some customers. But I think that it would be
> more useful to evaluate what fraction of security bugs - e.g. in your
> favorite security benchmark suite - your technique finds. That would be
> data that customers might be interested in.

Agreed. I doubt the gains in adoption from such a security feature would outweigh the complication in the ISA/CPU, especially for a startup. I suspect extremely low-overhead equivalent implementations are possible by implementing general purpose instructions like "extract/insert bitfield" (Cf. ARM's UBFX/BFI/BFC) and "branch if bitfields not equal".

For other ISA-level security features, allowing --X permissions, where code can be executed but not loaded as data, would be beneficial for preventing on-the-fly ROP-chain assembly (http://cs.unc.edu/~fabian/papers/oakland2013.pdf), though I'm not sure how vulnerable the Mill is to ROP chains in the first place. IP-relative calls and loads are essential for PIC and ASLR.

Noob

unread,

Oct 25, 2013, 6:05:08 AM10/25/13

to

MitchAlsup wrote:

> Noob wrote:
>
>> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
>
> But it is NOT an infinite loop.

As far as the C abstract machine is concerned, it is the
responsibility of the programmer to guarantee that arithmetic
with signed integers NOT overflow. Therefore, there is no
such thing as "int wrap-around" (contrary to unsigned int,
where wrap-around is fully specified).

Maybe you are arguing that this is a defect in C89/C99/C11,
but I'm telling it like it is, not like how it should be.

Relying on int wrap-around is relying on UB.

If you want wrap-around, use unsigned integers.

Öö Tiib

unread,

Oct 25, 2013, 8:07:53 AM10/25/13

to

On Friday, 25 October 2013 13:05:08 UTC+3, Noob wrote:
> MitchAlsup wrote:
>
> > Noob wrote:
> >
> >> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
> >
> > But it is NOT an infinite loop.
>
> As far as the C abstract machine is concerned, it is the
> responsibility of the programmer to guarantee that arithmetic
> with signed integers NOT overflow.

Nitpick: it is so for "standard C abstract machine".

> Therefore, there is no
> such thing as "int wrap-around" (contrary to unsigned int,
> where wrap-around is fully specified).

That may be defined behaviour by compiler; for example gcc and
clang (most popular compilers i believe) have option -fwrapv.

> Maybe you are arguing that this is a defect in C89/C99/C11,
> but I'm telling it like it is, not like how it should be.
>
> Relying on int wrap-around is relying on UB.
>
> If you want wrap-around, use unsigned integers.

I generally share that attitude but it is attitude _not_ universal
truth. People do what they want/need/are told to do and so most
code uses various extensions in practice.

Ivan Godard

unread,

Oct 25, 2013, 10:32:48 AM10/25/13

to

As described, the proposal permits this - it's the "one past the end" of
the host languages, and requires one more reserved bit to support.

Refusing to read proposals does seem counterproductive, whe you are
going to be exposed to the idea anyway if you participate in discussions
like this board. I realize that it's not your personal choice but
something that is pushed upon you, but even for your bosses purpose the
rule seems a half-measure - they should either permit and encourage you
to read and discuss everything you can find, or force you to complete
hermitage.

> I did not want Cp's checked pointers to have knowledge of the element type.

The proposal has no knowledge of type.

Ivan Godard

unread,

Oct 25, 2013, 11:04:04 AM10/25/13

to

On 10/25/2013 12:13 AM, Ryan Hitchman wrote:
> On Thursday, October 24, 2013 7:20:37 PM UTC-7, Andy (Super) Glew
> wrote:
>> Placing it in RTL - sure, modulo effort. It can't hurt, except
>> for effort. It may attract some customers. But I think that it
>> would be more useful to evaluate what fraction of security bugs -
>> e.g. in your favorite security benchmark suite - your technique
>> finds. That would be data that customers might be interested in.
>
> Agreed. I doubt the gains in adoption from such a security feature
> would outweigh the complication in the ISA/CPU, especially for a
> startup.

We judge implementation issues to be trivial: we distinguish ADDP from
ADDI, ADDS, ADDB, and ADDD anyway, so all this is a minor logic change
in ADDP, and equivalent on other operations.

I suspect extremely low-overhead equivalent implementations
> are possible by implementing general purpose instructions like
> "extract/insert bitfield" (Cf. ARM's UBFX/BFI/BFC) and "branch if
> bitfields not equal".

Hard to beat one op/one cycle :-)

Implementing checking via extract/compare/branch is expensive - you have
three ops (internally the relational and the branch are separate ops),
plus the branch predictor slot. The only not to use a single checked
ADDP is a religious preference for RISC instruction sets.

> For other ISA-level security features, allowing --X permissions,
> where code can be executed but not loaded as data, would be
> beneficial for preventing on-the-fly ROP-chain assembly
> (http://cs.unc.edu/~fabian/papers/oakland2013.pdf), though I'm not
> sure how vulnerable the Mill is to ROP chains in the first place.

Not very. The requirement to decode in both directions from a branch
entry point means that you would be looking for a very far-fetched bit
pattern in the code. And our code isn't writeable.

> IP-relative calls and loads are essential for PIC and ASLR.
>

Static flow-of-control ops are all PIC.

Andy (Super) Glew

unread,

Oct 25, 2013, 2:09:55 PM10/25/13

to

On 10/25/2013 7:32 AM, Ivan Godard wrote:
> On 10/24/2013 10:55 PM, Andy (Super) Glew wrote:

>>> What this will break, that wouldn't have broken anyway; is:
>>>
>>> int* p = new[100]int;
>>> p += 200;
>>> p -= 150;
>>> int i = *p;
>>>
>>> The above code is illegal in the language, but will "work" if not
>>> checked. Should it be faulted?
>>
>> For Cp-PL / MPX, we chose not to fault. Fault only at dereference.
>>
>> You have to allow
>>
>> int* p = new[100]int;
>> p += 100;
>> p -= 1;
>> int i = *p;
>
> As described, the proposal permits this - it's the "one past the end" of
> the host languages, and requires one more reserved bit to support.

Perhaps I should have said

You have to allow

int* p = new[100]int;

int *p2 = p - 1;
int *p3 += 100;
int i = *p3;

This is "one element before the end".

But in terms of address, p2 = p - sizeof(int).

>> I did not want Cp's checked pointers to have knowledge of the element
>> type.
>
> The proposal has no knowledge of type.

Perhaps yours would not.

But mine, where pointers are just addresses with metadata, would have
required knowledge of type, or at least sizeof(type),

By the way, here's another advantage of the "metadata implied by the
user modifiable pointer" approach - either your "floating point
addresses", or my "1-extra-bit", both power-of-2 sized:

You can do computations like XORing pointers, and still get checking.
Advisory checking.

---

Another "By The Way": I wanted to have 1 single bit inside the pointer,
1 single user modifiable bit, to improve performance / reduce hardware -
1 bit that said "no need to go look up the non-adjacent metadata",
to improve performance for "insecure" pointers. But this did not fly.
Descriptor lookaside buffers all the way,

Ivan Godard

unread,

Oct 25, 2013, 3:53:21 PM10/25/13

to

Yes, programs do this even though it is illegal in the host language,
and the proposal would fault the usage.

>>> I did not want Cp's checked pointers to have knowledge of the element
>>> type.
>>
>> The proposal has no knowledge of type.
>
> Perhaps yours would not.
>
> But mine, where pointers are just addresses with metadata, would have
> required knowledge of type, or at least sizeof(type),
>
> By the way, here's another advantage of the "metadata implied by the
> user modifiable pointer" approach - either your "floating point
> addresses", or my "1-extra-bit", both power-of-2 sized:
>
> You can do computations like XORing pointers, and still get checking.
> Advisory checking.

The result (and arguments) of XOR are integral, not pointers, so there
must be a reinterpret_cast<...>(...) before the result can be used as an
address. There's an op to do such casting (and another to cast in the
other direction), which forces the granule to maximum. The result can be
used happily as an address, but you've lost all checking unless the code
explicitly reasserts the granule.

Stefan Monnier

unread,

Oct 25, 2013, 4:17:23 PM10/25/13

to

> In other words, would it be used and if used will the gain in error checking
> be worth the nuisance?

It might get some use (after all, people have use "hardening"
C compilers and OS tricks, like the "canary" scheme, or the NX bit).

But this can only benefit weakly typed languages. While C++ is still
commonly used, I don't think its use is still growing, C is on the
decline, and all the growth is in languages that are (or can be)
strongly typed, where the compiler can do array bounds checking
efficiently, and reliably, without any special hardware support.

Stefan

Andy (Super) Glew

unread,

Oct 25, 2013, 9:16:46 PM10/25/13

to

Last stats I saw, C and C++ are still growing, in terms of number of
lines of code shipping.

But they are declining in relative importance, i.e. as a fraction of all
lines of code shipping.

But you raise another point, or more than one:

As you may have noticed, Ivan and I disagree on the importance of
advisory versus mandatory checking. Or perhaps not the importance: I
think Ivan likes capabilities much as I do, perhaps more given his
Burroughs background. Market viability.

Obviously I was willing to work on Cp-PL / MPX, when it is "only"
advisory. Only a debugging aid. Similarly, Ivan's checked pointers are
"only" a debugging aid, he emphasizes.

However, I hope that we can provide more than a debugging aid. I tried
to design an instruction set extension that could eventually be used to
provide "full security", as strong and as hard to avoid as page table
based virtual memory. One of the reasons I am a bit unhappy with MPX is
that they gave that up.

So, two points:

* I think that it would be good if any hardware extension for memory
checking (buffer overflow, dangling pointer, etc.) was fast enough that
it could be used by "languages that are (or can be) strongly typed,

where the compiler can do array bounds checking efficiently, and
reliably, without any special hardware support."

Power-of-2 sized objects are less useful for this.

* Bad experience has biased people against language or compiler or
assembler based security. Languages/compilers/assemblers have bugs;
they are large programs, hard to prove correct, and they have bugs that
can be exploited.

Arbitrary bounds and 2^N-bounds can both be applied here.

Paul A. Clayton

unread,

Oct 25, 2013, 9:28:34 PM10/25/13

to

On Thursday, October 24, 2013 9:32:38 PM UTC-4, George Neuner wrote:
> On Fri, 18 Oct 2013 08:33:28 -0700 (PDT), "Paul A. Clayton"
> <paaron...@gmail.com> wrote:
>
>
>>I dislike the absolute constraint on data layout based
>>on structure layout. In my opinion, there is a
>>difference between a layout that is most friendly to
>>the programmer and a layout (or even layouts) that are
>>most friendly to execution.
>
> And yet another that's necessary for hardware interfacing ...
> something C used to be really good at before application programmers
> co-opted it 8-(

In a sense that is "friendly to execution" in that it
can guarantee that peculiar memory behaviors (typically
I/O addresses where accesses have side effects, I
suspect) are properly handled (more correct execution
rather than higher performance, more energy efficient,
or more predictable/less variable in performance/timing
[or energy efficiency]).

From some comments posted on comp.arch (and/or perhaps
elsewhere), it sounds like Ada might provide annotation
mechanisms that are friendlier than C to to-the-metal
programming. (I receive the impression that the
compilers for embedded systems are more reliable in
this area, of necessity.)

>>ISTR reading that Sun used
>>an array-of-structures to structure-of-arrays
>>transformation on the SPEC CPU art benchmark component;
>>such is presumably illegal in C but legal for SPEC CPU.
>
> I don't think it's prohibited per se because, although the layout of
> structures is defined, the results of comparing the addresses of
> structure members technically is undefined behavior.

I am not a language lawyer (I am not even a programmer), but
I thought I had read somewhere that it was not a legal C
optimization (though I may be misremembering the statement
and the writer might have been little more of a language
lawyer for C than I am).

> So long as the transformation preserves the relative ordering of the
> structure addresses - which it does - I think it possibly could be
> argued that the transformation either is legal or is acceptable as
> implementation defined behavior.
>
> Even if not, it is a good option to have because it can save a lot of
> data swizzling for certain SIMD codes.

I do like the optimization, even without SIMD it can
have significant caching (and bandwidth) benefits. In
art, the structure was, I think, unusually large and
loops only touched some members.

I am a bit uncomfortable with non-standard language
variants. Such seems to work against the benefits of
standardization. Having a bunch of almost C languages
whose compilers can be used as C compilers seems
suboptimal.

Noob

unread,

Oct 26, 2013, 4:15:37 AM10/26/13

to

Öö Tiib wrote:
> On Friday, 25 October 2013 13:05:08 UTC+3, Noob wrote:
>> MitchAlsup wrote:
>>
>>> Noob wrote:
>>>
>>>> for (int i=1; i > 0; ++i) { <- optimized into an infinite loop
>>>
>>> But it is NOT an infinite loop.
>>
>> As far as the C abstract machine is concerned, it is the
>> responsibility of the programmer to guarantee that arithmetic
>> with signed integers NOT overflow.
>
> Nitpick: it is so for "standard C abstract machine".

I do not understand the distinction.

The C standard defines an abstract machine model, which one might
call the "C abstract machine" or "standard C abstract machine".

What's the difference to you?

>> Therefore, there is no
>> such thing as "int wrap-around" (contrary to unsigned int,
>> where wrap-around is fully specified).
>
> That may be defined behaviour by compiler; for example gcc and
> clang (most popular compilers i believe) have option -fwrapv.

And as such, it is an extension, thus not C.
It is "C with extensions", "gnuc", etc.

>> Maybe you are arguing that this is a defect in C89/C99/C11,
>> but I'm telling it like it is, not like how it should be.
>>
>> Relying on int wrap-around is relying on UB.
>>
>> If you want wrap-around, use unsigned integers.
>
> I generally share that attitude but it is attitude _not_ universal
> truth. People do what they want/need/are told to do and so most
> code uses various extensions in practice.

In the embedded space, people are often tied to one tool-chain,
and can thus depend on their tool's extensions. In the desktop
space, especially when one is aiming for portability, one
cannot rely on extensions (obviously).

Regards.

Anton Ertl

unread,

Oct 26, 2013, 6:09:03 AM10/26/13

to

Noob <root@localhost> writes:

>嘱 Tiib wrote:
>> On Friday, 25 October 2013 13:05:08 UTC+3, Noob wrote:

>>> [...] the C abstract machine [...]

>> Nitpick: it is so for "standard C abstract machine".
>
>I do not understand the distinction.

There is a difference between C and its standardized subset.

>In the embedded space, people are often tied to one tool-chain,
>and can thus depend on their tool's extensions. In the desktop
>space, especially when one is aiming for portability, one
>cannot rely on extensions (obviously).

It seems to me that most desktop programs are tied to the toolchain
just as much as embedded programs. In particular, Windows programs
are tied to MSVC and/or the .net infrastructure.

Noob

unread,

Oct 26, 2013, 11:44:22 AM10/26/13

to

Anton Ertl wrote:

> There is a difference between C and its standardized subset.

Are you /really/ arguing that "C" is the union of
1) a standardized "subset"
2) all proprietary extensions under the sun?

Are you trying to make baby jesus cry?
(In fact, it is the "subset" that is C.)

> It seems to me that most desktop programs are tied to the toolchain
> just as much as embedded programs. In particular, Windows programs
> are tied to MSVC and/or the .net infrastructure.

Take a look at projects such as lwip, a light-weight
network stack. It is intended to be used on scores
of different platforms (different architectures,
toolchains, OS). Which extensions would you make use
of in such a project?

I think it can be argued that, in that space, gcc is
dominant, but I'm not sure it's a good idea to program
in gnuc (and I despise proprietary compilers as much
as the next programmer).

Regards.

Noob

unread,

Oct 26, 2013, 11:58:17 AM10/26/13

to

Andy (Super) Glew wrote:
> On 10/24/2013 1:53 AM, Noob wrote:
>> Andy (Super) Glew wrote:
>>
>>> But, no, I don't have a link - although Terje seems to remember that the
>>> comp.arch discussion happened recently, in the last two years or so. (I
>>> have lousy luck searching comp.arch.)
>>
>> Google makes baby jesus cry. (So much pointless javascript...
>> Remember a few years back, when Google was light-weight, and
>> didn't sleep with the NSA?)
>> https://groups.google.com/advanced_search
>> https://groups.google.com/groups/search?as_epq=buffer+overflow&as_ugroup=comp.arch&as_uauthors=glew
>>
>> Date: Sun, 02 Aug 2009 12:46:41 -0700
>> From: "Andy \"Krazy\" Glew"
>> Subject: HardBound and SoftBound (was "The State of Software")
>> https://groups.google.com/forum/message/raw?msg=comp.arch/iQs02MvjHo4/7r_AxbqMiGMJ
>
>
> Thanks. Although I don't think that discloses using the 1-extra-bit
> trick for power-of-2 sized pointers.

Hmmm, this one perhaps?

https://groups.google.com/forum/message/raw?msg=comp.arch/sU-ymcogrMM/OCwKbi4qTUwJ

Date: Sat, 07 Jul 2012 08:18:13 -0700
From: "Andy (Super) Glew"
Subject: Re: COMA

I think I've lost my google-fu, which is to be expected
since I don't use google anymore.

Anton Ertl

unread,

Oct 27, 2013, 9:14:29 AM10/27/13

to

Noob <root@localhost> writes:
>Anton Ertl wrote:
>
>> There is a difference between C and its standardized subset.
>
>Are you /really/ arguing that "C" is the union of
>1) a standardized "subset"
>2) all proprietary extensions under the sun?

No. That's a pretty lame straw man.

C is a low-level language that programs were written in and that C
compilers used to understand and compile correctly (if you turn off
optimization, they mostly still do).

A subset of C was standardized; it's often hard to specify what should
happen portably, so they copped out and made many things "undefined
behaviours". E.g., if you want to support 2s-complement,
1s-complement, and sign-magnitude representations for signed integers,
how to specify what happens on overflow? So they did not specify it.
As a result, there are a large number of "undefined behaviour"s in the
standard.

For a while, that was fine; so C had a standard with a lot of holes,
that hopefully would be filled in a future version of the standard,
but compilers and programs still worked fine. Sure, maybe the loop
elsewhere in this thread would produce an overflow exception on a
sign-magnitude machine, but portability among 2s-complement machines
is good enough for nearly all current programs.

Then the C compiler writers got the idea to use the "undefined
behaviours" as entitlements for miscompiling perfectly fine programs.
They do it piecemeal, and if the cries of pain are too loud, they add
an option like -fwrapv (but only after publushing a version where the
miscompilation cannot be turned off), or they turn off the
miscompilation for a particularly high-profile application like a SPEC
benchmark.

But the long-term course they are heading is to turn C from a
successful low-level language to a variant of Pascal with Java syntax.

>> It seems to me that most desktop programs are tied to the toolchain
>> just as much as embedded programs. In particular, Windows programs
>> are tied to MSVC and/or the .net infrastructure.
>
>Take a look at projects such as lwip, a light-weight
>network stack. It is intended to be used on scores
>of different platforms (different architectures,
>toolchains, OS). Which extensions would you make use
>of in such a project?

That's not the kind of program I was thinking of when you mentioned
"desktop". Anyway, I don't know about lwip, but I know about Gforth,
which is also intended to be widely portable. We use gcc, and we
actually use GNU C extensions (in particular labels-as-values, which
gives a factor of 2 speedup for threaded code and a factor of 4
speedup if we calso consider dynamic superinstructions). We have not
had problems with the availability of gcc on the platforms build on.

We still have come to regret choosing gcc, because gcc has turned into
a miscompiler; but that seems to be a common attitude among C compiler
maintainers these days (the LLVM people have the same attitude:
<http://blog.llvm.org/2011/05/what-every-c-programmer-should-know.html>),
so getting rid of GNU C dependencies would not really help; the only
solution is to get rid of C.

Öö Tiib

unread,

Oct 27, 2013, 10:19:12 AM10/27/13

to

On Saturday, 26 October 2013 11:15:37 UTC+3, Noob wrote:
> Öö Tiib wrote:
> > On Friday, 25 October 2013 13:05:08 UTC+3, Noob wrote:
> >> Therefore, there is no
> >> such thing as "int wrap-around" (contrary to unsigned int,
> >> where wrap-around is fully specified).
> >
> > That may be defined behaviour by compiler; for example gcc and
> > clang (most popular compilers i believe) have option -fwrapv.
>
> And as such, it is an extension, thus not C.
> It is "C with extensions", "gnuc", etc.

So what? That is only strictly standard C where 'for(int i=1;i>0;++i){foo();}'
may be optimized into nasal demons. We use real compilers not language
standards for writing programs and so no nasal demons.

In practice on lot of compilers it is (or can be configured into) defined
behavior. '-ftrapv' on gcc and clang will also make it defined behavior (and
guarantee crash).

...

> In the embedded space, people are often tied to one tool-chain,
> and can thus depend on their tool's extensions. In the desktop
> space, especially when one is aiming for portability, one
> cannot rely on extensions (obviously).

We often are tied to tool chain. Switching tools (and even versions) is
always expensive. It is because it is only possible to write and to test
platform-specific code on each platform we target. We can not test on
C standard because it is ... pile of paper.

Dombo

unread,

Oct 27, 2013, 12:23:37 PM10/27/13

to

Op 26-Oct-13 12:09, Anton Ertl schreef:
> Noob <root@localhost> writes:

>> ï¿œï¿œ Tiib wrote:
>>> On Friday, 25 October 2013 13:05:08 UTC+3, Noob wrote:
>>>> [...] the C abstract machine [...]
>
>>> Nitpick: it is so for "standard C abstract machine".
>>
>> I do not understand the distinction.
>
> There is a difference between C and its standardized subset.
>
>> In the embedded space, people are often tied to one tool-chain,
>> and can thus depend on their tool's extensions. In the desktop
>> space, especially when one is aiming for portability, one
>> cannot rely on extensions (obviously).
>
> It seems to me that most desktop programs are tied to the toolchain
> just as much as embedded programs. In particular, Windows programs
> are tied to MSVC and/or the .net infrastructure.

Even though MSVC and .NET are popular development environments for
developing MS Windows software, there are alternatives like Intel,
MinGW, Borland/Embarcadero...etc. Then fact that one can use MSVC and/or
the .NET infrastructure for Windows programs doesn't mean one HAS too.
The embedded space the options are typically much more limited though
often even there one can choose between more than one compiler when
targeting a popular processor core, such ARM.

Andy (Super) Glew

unread,

Oct 27, 2013, 8:05:19 PM10/27/13

to

On 10/26/2013 8:58 AM, Noob wrote:
> Andy (Super) Glew wrote:
>> On 10/24/2013 1:53 AM, Noob wrote:
>>> Andy (Super) Glew wrote:
>>>
>>>> But, no, I don't have a link - although Terje seems to remember that the
>>>> comp.arch discussion happened recently, in the last two years or so. (I
>>>> have lousy luck searching comp.arch.)
>>>
>>> Google makes baby jesus cry. (So much pointless javascript...
>>> Remember a few years back, when Google was light-weight, and
>>> didn't sleep with the NSA?)
>>> https://groups.google.com/advanced_search
>>> https://groups.google.com/groups/search?as_epq=buffer+overflow&as_ugroup=comp.arch&as_uauthors=glew
>>>
>>> Date: Sun, 02 Aug 2009 12:46:41 -0700
>>> From: "Andy \"Krazy\" Glew"
>>> Subject: HardBound and SoftBound (was "The State of Software")
>>> https://groups.google.com/forum/message/raw?msg=comp.arch/iQs02MvjHo4/7r_AxbqMiGMJ
>>
>>
>> Thanks. Although I don't think that discloses using the 1-extra-bit
>> trick for power-of-2 sized pointers.
>
> Hmmm, this one perhaps?
>
> https://groups.google.com/forum/message/raw?msg=comp.arch/sU-ymcogrMM/

Close, but not quite. The post itself is mainly about explicit
metadata, whether stored non-adjacent the way MPX does, or in space
freed up by compression (as I talked about at ASPLOS WACI One in 1998).

There's a passing reference to implicit metadata, "like requiring the
bounds to be power of 2 aligned". But it doesn't talk in detail.
(Although given my love of the extra bit trick, which I have been using
since undegrad in the 1980s...)

> But a "secure pointer" must have something like upper bound
> (and, for C, lower bound) associated with it. Many folks
> have played games like requiring the bounds to be power
> of 2 aligned. Myself, I just see the extra stuff as
> metadata. It must be stored somewhere, for a secure
> pointer, but it should not be required for insecure pointers.

> I think I've lost my google-fu, which is to be expected
> since I don't use google anymore.

What do you use?

Noob

unread,

Oct 28, 2013, 3:46:08 AM10/28/13

to

Andy (Super) Glew wrote:

> Noob wrote:
>
>> I think I've lost my google-fu, which is to be expected
>> since I don't use google anymore.
>
> What do you use?

(Do you mean for searches in general, or usenet searches
in particular?)

I use startpage (which is a google meta-engine AFAIU, so
I should have said "I don't use google's /UI/ anymore".)

https://startpage.com/eng/aboutstartpage/
https://startpage.com/eng/privacy-policy.html
https://en.wikipedia.org/wiki/Ixquick

startpage has several selling points over google:
1) it is not google
2) it is not US-based
3) it promises not to collect any personal information

Regards.

Noob

unread,

Oct 28, 2013, 12:45:39 PM10/28/13

to

Noob wrote:

> As far as the C abstract machine is concerned, it is the
> responsibility of the programmer to guarantee that arithmetic
> with signed integers NOT overflow. Therefore, there is no
> such thing as "int wrap-around" (contrary to unsigned int,
> where wrap-around is fully specified).
>
> Maybe you are arguing that this is a defect in C89/C99/C11,
> but I'm telling it like it is, not like how it should be.
>
> Relying on int wrap-around is relying on UB.

Piotr Wyderski

unread,

Oct 30, 2013, 9:00:57 AM10/30/13

to

Ivan Godard wrote:

> We'll be facing that anyway. Mill pointers are not integers, whether
> using this proposal or not.

Then that's a problem: our company uses a lot of pointer bit stealing
tricks to achieve things not achievable using more standard techniques.

Say, you have a naturally aligned uint64_t pointer. It means
its 3 lowest bits are 0. It means we can grab them and encode
some additional information there. The zeros must be restored
just before dereferencing the pointer. Extremely useful.

Best regards, Piotr

Ivan Godard

unread,

Oct 30, 2013, 11:25:28 AM10/30/13

to

During the time you have stolen the bits, the value is no longer a
pointer. You can see that in your own source: I bet there's a
reinterpret_cast<> in there. Steal away :-)

We steal the high bits, which (in pointer-land) means that the OS
reserved a portion of the address space. Quite a few other architectures
do too; the amount stolen varies.

If you want more than three bits, you should keep an index rather than a
pointer in your data structures. On the Mill an indexed load/store costs
no more an an unindexed one.

Piotr Wyderski

unread,

Oct 30, 2013, 12:13:20 PM10/30/13

to

Ivan Godard wrote:

> During the time you have stolen the bits, the value is no longer a
> pointer.

But we keep it within an instance of T*. If that is not
a problem for the Mill, it is not a proble for me either.

> If you want more than three bits, you should keep an index rather than a
> pointer in your data structures.

It depends. There are also page-aligned chunks, which yields
as many as 12+ bits and that is barely below the infinity in
terms of typical stealing needs.

Best regards, Piotr

Stefan Monnier

unread,

Oct 30, 2013, 2:24:50 PM10/30/13

to

> * I think that it would be good if any hardware extension for memory
> checking (buffer overflow, dangling pointer, etc.) was fast enough that it
> could be used by "languages that are (or can be) strongly typed, where the
> compiler can do array bounds checking efficiently, and reliably, without any
> special hardware support."

But, if "the compiler can do array bounds checking efficiently, and
reliably, without any special hardware support", why would you waste
resources on some hardware extension?

> * Bad experience has biased people against language or compiler or assembler
> based security. Languages/compilers/assemblers have bugs; they are large
> programs, hard to prove correct, and they have bugs that can be exploited.

I have not noticed such a bias, to tell you the truth. Maybe I'm just
deluded ;-)

E.g. have there been known problems with Java's array bounds checking,
where people were able to circumvent the checks?

Stefan

Andy (Super) Glew

unread,

Nov 3, 2013, 11:37:02 PM11/3/13

to

On 10/30/2013 11:24 AM, Stefan Monnier wrote:
>> * I think that it would be good if any hardware extension for memory
>> checking (buffer overflow, dangling pointer, etc.) was fast enough that it
>> could be used by "languages that are (or can be) strongly typed, where the
>> compiler can do array bounds checking efficiently, and reliably, without any
>> special hardware support."
>
> But, if "the compiler can do array bounds checking efficiently, and
> reliably, without any special hardware support", why would you waste
> resources on some hardware extension?

Security.

OS-level partitioning.

I don't trust assembly language programmers. And I suspect that I will
still want to support assembly language programmers.

I don't trust compiler writers or Javascript or Java implementors not to
have bugs.

I freely admit that I want something like capability level security.

And by the way: I don't stop at array bounds checking. I wanto to extend
similar techniques to handling other security problems like dangling
pointers, and possibly also taint propagation.

>
>> * Bad experience has biased people against language or compiler or assembler
>> based security. Languages/compilers/assemblers have bugs; they are large
>> programs, hard to prove correct, and they have bugs that can be exploited.
>
> I have not noticed such a bias, to tell you the truth. Maybe I'm just
> deluded ;-)
>
> E.g. have there been known problems with Java's array bounds checking,
> where people were able to circumvent the checks?

There have been quite a few months when Java was the most
security-bug-prone piece of software on the net.

Many security folk recommend uninstalling Java.
http://krebsonsecurity.com/2013/10/java-update-plugs-51-security-holes/

http://arstechnica.com/security/2013/09/security-of-java-takes-a-dangerous-turn-for-the-worse-experts-say/

Admittedly, most Java security bugs are not bugs in Java itself, but are
bugs in the native layer, C, C++, or assembly code. But this is much of
my point: very few software systems are entirely self contained. Nearly
all higher level language systems (from not so high level (Java, Python,
Perl, Javascript) to much higher level) depend on fairly large libraries
of foreign code, C, C++, assembly, some of which are developed by the
language maintainer, some of which come from the OS, and some from other
places. These are often NOT small, well-debugged libraries. Oftentimes
the bugs occur not in the actual native libraries, but in the interface
code that converts the calling languages' concepts to the callee
language concepts - and the OS concepts, and ...

>
>
> Stefan

Michael S

unread,

Nov 4, 2013, 3:58:36 AM11/4/13

to

On Monday, November 4, 2013 6:37:02 AM UTC+2, Andy (Super) Glew wrote:
> On 10/30/2013 11:24 AM, Stefan Monnier wrote:
>
> >> * I think that it would be good if any hardware extension for memory
>
> >> checking (buffer overflow, dangling pointer, etc.) was fast enough that it
>
> >> could be used by "languages that are (or can be) strongly typed, where the
>
> >> compiler can do array bounds checking efficiently, and reliably, without any
>
> >> special hardware support."
>
> >
>
> > But, if "the compiler can do array bounds checking efficiently, and
>
> > reliably, without any special hardware support", why would you waste
>
> > resources on some hardware extension?
>
>
>
> Security.
>
>
>
> OS-level partitioning.
>
>
>
> I don't trust assembly language programmers. And I suspect that I will
>
> still want to support assembly language programmers.
>
>
>
> I don't trust compiler writers or Javascript or Java implementors not to
>
> have bugs.
>
>
>
> I freely admit that I want something like capability level security.
>
>
>
> And by the way: I don't stop at array bounds checking. I wanto to extend
>
> similar techniques to handling other security problems like dangling
>
> pointers, and possibly also taint propagation.
>
>
>
>

What are those "capabilities" that you are mentioning so often?
In particular, how they help to augment chain of trust for "safe" languages?
Lacking imagination, I can't see how "Quis custodiet ipsos custodes?" puzzle could be possibly solved.

Ivan Godard

unread,

Nov 4, 2013, 4:57:21 AM11/4/13

to

On 11/4/2013 12:58 AM, Michael S wrote:
<snip>

> What are those "capabilities" that you are mentioning so often?
> In particular, how they help to augment chain of trust for "safe" languages?
> Lacking imagination, I can't see how "Quis custodiet ipsos custodes?" puzzle could be possibly solved.
>

Start here:
https://en.wikipedia.org/wiki/Capability-based_security
https://en.wikipedia.org/wiki/Capability-based_addressing

After that: Google is your friend

Michael S

unread,

Nov 4, 2013, 5:23:03 PM11/4/13

to

The first article appears very loosely related to bound checks.
The second is a bit closer, but not much so. It still mostly discusses methods of process isolation. But process isolation is simply not interesting - it's solved problem, solved by several different solutions, all of which are good enough.
The article does not describe how capability-based addressing is going to help a compiler/language run-time to detect out-of-bound access to "automatic", in C sense, array that hits the belly of another, unrelated, automatic array in the same process. I assume that language run-time is not going to ask an OS to allocate separate capability for each and every object, including most short-lived?

>
>
> After that: Google is your friend
>

Except google can't index Andy's brain. Or, may be, it can, but results are reserved for google itself and a couple of TLAs.

Anne & Lynn Wheeler

unread,

Nov 4, 2013, 6:38:07 PM11/4/13

to

tymshare did gnosis in the late 70s & early 80s for mainframe 370
... which was spun-off as keykos when M/D bought tymshare (disclaimer: i
was brought in to evaluate gnosis as part of the spinoff).
http://www.cis.upenn.edu/~KeyKOS/

one of the objectives of gnosis was to provide ability to offer 3rd
party applications on commercial online service bureau platform ...
with use charges accounting traced back to each application use
(allowing prorated remittance to the 3rd parties). I estimated that
1/3rd of system pathlength was involved in that accounting. in
transition to keykos ... all that accounting overhead was removed
significantly improving keykos throughput/performance.

and
http://www.cis.upenn.edu/~KeyKOS/KeyTXF/KeyTXF.html

from above:

2.1 Persistent Memory

Satish M. Thatte describes a design for a persistent virtual memory
system [9] with transactions for object-oriented garbage collected
programming systems and object-oriented databases. In contrast with the
approach to persistent memory used in KeyKOS, his design requires
unforgeable pointers, bounds checking, and automatic garbage
collection. His transaction manager keeps an external redo log outside
virtual memory, while KeyTXF keeps the log in virtual memory.

... snip ...

more recent systems based on gnosis/keykos
http://www.eros-os.org/
http://en.wikipedia.org/wiki/EROS_%28microkernel%29
http://www.coyotos.org/
http://en.wikipedia.org/wiki/Coyotos
http://www.capros.org/
http://en.wikipedia.org/wiki/CapROS

--
virtualization experience starting Jan1968, online at home since Mar1970

Ivan Godard

unread,

Nov 4, 2013, 7:11:04 PM11/4/13

to

On 11/4/2013 2:23 PM, Michael S wrote:
> On Monday, November 4, 2013 11:57:21 AM UTC+2, Ivan Godard wrote:
>> On 11/4/2013 12:58 AM, Michael S wrote:
>>
>> <snip>
>>
>>
>>
>>> What are those "capabilities" that you are mentioning so often?
>>> In particular, how they help to augment chain of trust for "safe"
>>> languages? Lacking imagination, I can't see how "Quis custodiet
>>> ipsos custodes?" puzzle could be possibly solved.
>>
>>>
>>
>>
>>
>> Start here:
>> https://en.wikipedia.org/wiki/Capability-based_security
>> https://en.wikipedia.org/wiki/Capability-based_addressing
>>
>
> The first article appears very loosely related to bound checks. The
> second is a bit closer, but not much so. It still mostly discusses
> methods of process isolation. But process isolation is simply not
> interesting - it's solved problem, solved by several different
> solutions, all of which are good enough. The article does not
> describe how capability-based addressing is going to help a
> compiler/language run-time to detect out-of-bound access to
> "automatic", in C sense, array that hits the belly of another,
> unrelated, automatic array in the same process. I assume that
> language run-time is not going to ask an OS to allocate separate
> capability for each and every object, including most short-lived?

You assume wrongly. In caps, any data that you wish to distinguish from
other data using hardware-enforced security will need a cap of its own.

If you don't care to make the distinction - for example, you if don't
care to have hardware-enforced distinction between .re and .im in a
complex number, then you don't need caps for them.

Caps are a quite general secure-protection mechanism. How that mechanism
is mapped to host language, or to the individual program, is not
something the underlying hardware either knows or cares. In particular,
your program (or the compiler, or the RTS) can use caps to detect
out-of-bounds for something that your program (or the compiler or RTS)
thinks of as an "array". Hardware doesn't know "arrays", and caps don't
either. Yet a cap can protect an address range - hardware does know
about addresses.

Ivan

Ivan Godard

unread,

Nov 4, 2013, 7:14:12 PM11/4/13

to

On 11/4/2013 3:38 PM, Anne & Lynn Wheeler wrote:
>

<snip much info about KeyKos and its descendants>

Biographical note:

Norm Hardy, who did KeyKos, has been a valued member of the Mill team
for the last five years.

Bakul Shah

unread,

Nov 4, 2013, 10:22:46 PM11/4/13

to

Prof. Hank Levy's excellent but out of print book is
now available as a set of PDFs from his webpage:

http://homes.cs.washington.edu/~levy/capabook/

Definitely worth reading as it describes a bunch
of cap. systems in detail.

Andy (Super) Glew

unread,

Nov 5, 2013, 3:07:03 AM11/5/13

to

On 11/4/2013 2:23 PM, Michael S wrote:

> On Monday, November 4, 2013 11:57:21 AM UTC+2, Ivan Godard wrote:
>> On 11/4/2013 12:58 AM, Michael S wrote:
>>
>> <snip>
>>
>>
>>
>>> What are those "capabilities" that you are mentioning so often?
>>> In particular, how they help to augment chain of trust for "safe" languages?
>>> Lacking imagination, I can't see how "Quis custodiet ipsos custodes?" puzzle could be possibly solved.
>>
>>>
>>
>>
>>
>> Start here:
>> https://en.wikipedia.org/wiki/Capability-based_security
>> https://en.wikipedia.org/wiki/Capability-based_addressing
>>
>
> The first article appears very loosely related to bound checks.
> The second is a bit closer, but not much so. It still mostly discusses methods of process isolation. But process isolation is simply not interesting - it's solved problem, solved by several different solutions, all of which are good enough.
> The article does not describe how capability-based addressing is going to help a compiler/language run-time to detect out-of-bound access to "automatic", in C sense, array that hits the belly of another, unrelated, automatic array in the same process.

> I assume that language run-time is not going to ask an OS to allocate separate capability for each and every object, including most short-lived?

The trouble with wikipedia is that it is not allowed to have imagination.

Or not even imagination - I don't think that I am the first guy to
figure out how to do capabilities with OS-suitable security without OS
overhead. But, no, I don't have a reference off the top of my head.

In the world I would like, and tried to to build, you don't need to do a
system call to the OS to allocate a capability.

There are simple machine instructions to manipulate capabilities. E.g.
an allocator might have a capability for a large contiguous chunk of
memory, and divides this up into smaller capabilities to return to users.

I went to the Intel MPX manuals for an example, but they don't really
have the right instruction support.

So briefly:

Imagine that you have bounds [lwb,upb) in a register. Probably adorned
with permissions and non-foregability stuff.

You can always tighten the bounds to create a new capability

NewCapReg := NARROW( newbase, newsize, OldCapReg )
if [newbase,newbase+newsize) not in [lwb,upb) error
return [newbase,newbase+newsize)

That's pretty much all that is required to create a new capability in a
register. A few checks, comparrisons, basically a few adds and subtracts.

To store it in memory, Intel MPX's BNDSTX ionstruction.

The fun part is making that last step secure - having it write to memory
locations that only BNDSTX and BNDLDX can access.

Michael S

unread,

Nov 5, 2013, 10:00:07 AM11/5/13

to

On Tuesday, November 5, 2013 5:22:46 AM UTC+2, Bakul Shah wrote:
>
>
> Prof. Hank Levy's excellent but out of print book is
> now available as a set of PDFs from his webpage:
>
> http://homes.cs.washington.edu/~levy/capabook/
>
> Definitely worth reading as it describes a bunch of cap. systems in detail.

Thank you.

Andy (Super) Glew

unread,

Nov 5, 2013, 9:16:46 PM11/5/13

to

But I warn you: most of the cap systems that Hank's book describes are
old. In particular, most are descriptor table based.

One of my design goals was never to have a descriptor table that is
required to be a contiguous region of virtual or, worse, physical
memory. Finite regions always overflow.

I think that one of the problems with old cap systems was that their
objects were nearly always addressed by a tuple
(descriptor_number,offset). One of which was always too small. Worse,
if the physical memory, the largest contiguous array that you ever
wanted to access (which I assume is all of physical memory) is N bits,
then the pair (descriptor_number,offset) wants to be 2N bits long. A
fat pointer.

In my opinion the best, and probably most successful, semi-recent cap
system was the IBM AS/400 family,
http://en.wikipedia.org/wiki/IBM_System_i. These used fat pointers -
128 bits on a 64-bit PowerPC, with a 64 bit virtual address. Originally
they had an additional bit, stolen from ECC, that was used to make
non-forgeable pointers, that could be passed around and copied by user
code, but which could not

Rumor has it various PowerPC processors had 65-bit registers, with the
65th bit being the "security" bit that was stored in memory as ECC.

Some may argue that the AS-400 is not a "true" or "ideal" capability
system. I think that is theology. Indeed, wikipedia says that the AS/400
replaced the System/38, and removed capability-based addressing.
Nevertheless, the AS/400 family's secure pointers have properties
considerably tighter than ordinary C/C++ pointers = integers = addresses.

Besides: although I am a fan of capabilities, I am not a purist. There
are problems with most capability security models that I want to go
beyond capabilities to solve.

By the way, cap systems often come with a number of other features, that
people often think are necessary for a cap system. Not necessarily so:

Caps + SAS (Single Address Space)
- not necessarily required. Caps enable SAS, although you can
implement SAS with page granular protection. But, also, you can build
caps for regular multiple virtual address space systems.

Caps + Fat Pointers
- not necessarily required. In fact, eliminating the need for fat
pointers was one of the key insights. An insight which, by the way, I
gained in a comp.arch discussion circa 2004.

Caps + Segments
- not necessarily required. IMHO caps have to be able to live within
an existing standard paged virtual memory system.

Ivan Godard

unread,

Nov 5, 2013, 9:43:10 PM11/5/13

to

I too support caps, though I differ with Andy on some of the details. We
agree that descriptor tables are anathema, but differ on whether fat
pointers are too. I agree that caps work will with SAS, but I support
SAS with or without caps (the Mill is SAS), so I see caps as irrelevant
to SAS.

Lastly, I see no reason to hold with pages in general, and am quite
happy with (properly supported) segments. The Mill is something of a mix
in that respect.

Michael S

unread,

Nov 6, 2013, 8:37:17 AM11/6/13

to

On Wednesday, November 6, 2013 4:16:46 AM UTC+2, Andy (Super) Glew wrote:
>
> But I warn you: most of the cap systems that Hank's book describes are
> old. In particular, most are descriptor table based.
>

I guessed that already. That was one of the reasons behind my original question - I am interested to know what sort of capability-based addressing *you* want to have, with emphasis on you.

>
>
> One of my design goals was never to have a descriptor table that is
> required to be a contiguous region of virtual or, worse, physical
> memory. Finite regions always overflow.
>
>
> I think that one of the problems with old cap systems was that their
> objects were nearly always addressed by a tuple
> (descriptor_number,offset). One of which was always too small. Worse,
> if the physical memory, the largest contiguous array that you ever
> wanted to access (which I assume is all of physical memory) is N bits,
> then the pair (descriptor_number,offset) wants to be 2N bits long. A
> fat pointer.

In context of smartphones and content-consumption oriented tablets, 36-bit physical address could be sufficient up until 2025 or a little more. So, wouldn't 28:36 split of 64-bit register work fine? Or are you afraid that 28 bits are insufficient for key?

>
>
>
> In my opinion the best, and probably most successful, semi-recent cap
> system was the IBM AS/400 family,

So, if the most successful capability-based system is built on non-cap CPU. Does not it prove that capability support in hardware is counter-productive?

>
> http://en.wikipedia.org/wiki/IBM_System_i. These used fat pointers -
> 128 bits on a 64-bit PowerPC, with a 64 bit virtual address. Originally
> they had an additional bit, stolen from ECC, that was used to make
> non-forgeable pointers, that could be passed around and copied by user
> code, but which could not

could not what?

>
> Rumor has it various PowerPC processors had 65-bit registers, with the
> 65th bit being the "security" bit that was stored in memory as ECC.
>
> Some may argue that the AS-400 is not a "true" or "ideal" capability
> system. I think that is theology. Indeed, wikipedia says that the AS/400
> replaced the System/38, and removed capability-based addressing.
> Nevertheless, the AS/400 family's secure pointers have properties
> considerably tighter than ordinary C/C++ pointers = integers = addresses.
>
>
>
> Besides: although I am a fan of capabilities, I am not a purist. There
> are problems with most capability security models that I want to go
> beyond capabilities to solve.
>
> By the way, cap systems often come with a number of other features, that
> people often think are necessary for a cap system. Not necessarily so:
>
> Caps + SAS (Single Address Space)
> - not necessarily required. Caps enable SAS, although you can
> implement SAS with page granular protection. But, also, you can build
> caps for regular multiple virtual address space systems.
>
>
>
> Caps + Fat Pointers
> - not necessarily required. In fact, eliminating the need for fat
> pointers was one of the key insights. An insight which, by the way, I
> gained in a comp.arch discussion circa 2004.
>

Sharing that particular insight will be especially helpful.

Joe keane

unread,

Nov 6, 2013, 6:09:22 PM11/6/13

to

In article <jwvsivi1v0y.fsf-...@gnu.org>,

Stefan Monnier <mon...@iro.umontreal.ca> wrote:
>But, if "the compiler can do array bounds checking efficiently, and
>reliably, without any special hardware support", why would you waste
>resources on some hardware extension?

In article <5277246E...@SPAM.comp-arch.net>,

Andy (Super) Glew <an...@SPAM.comp-arch.net> wrote:
>Security.
>
>OS-level partitioning.
>
>I don't trust assembly language programmers. And I suspect that I will
>still want to support assembly language programmers.

You forgot one:

The designers *enjoy* banging their heads against a wall.

Andy (Super) Glew

unread,

Nov 7, 2013, 1:37:39 AM11/7/13

to

On 11/6/2013 5:37 AM, Michael S wrote:
> On Wednesday, November 6, 2013 4:16:46 AM UTC+2, Andy (Super) Glew wrote:
>>
>> But I warn you: most of the cap systems that Hank's book describes are
>> old. In particular, most are descriptor table based.
>>
>
> I guessed that already. That was one of the reasons behind my original question - I am interested to know what sort of capability-based addressing *you* want to have, with emphasis on you.
>
>>
>>
>> One of my design goals was never to have a descriptor table that is
>> required to be a contiguous region of virtual or, worse, physical
>> memory. Finite regions always overflow.
>>
>>
>> I think that one of the problems with old cap systems was that their
>> objects were nearly always addressed by a tuple
>> (descriptor_number,offset). One of which was always too small. Worse,
>> if the physical memory, the largest contiguous array that you ever
>> wanted to access (which I assume is all of physical memory) is N bits,
>> then the pair (descriptor_number,offset) wants to be 2N bits long. A
>> fat pointer.
>
> In context of smartphones and content-consumption oriented tablets, 36-bit physical address could be sufficient up until 2025 or a little more.

Wanna bet?

Physical maybe. Probably not virtual. And I am not so certain about
physical.

Consider a SAS system with a 64GB flash or other NVRAM backing store.

>So, wouldn't 28:36 split of 64-bit register work fine? Or are you afraid that 28 bits are insufficient for key?

28 bits is insufficient for key. And 36 bits will soon be insufficient
for virtual address.

Ivan Godard

unread,

Nov 7, 2013, 2:38:16 AM11/7/13

to

On 11/6/2013 10:37 PM, Andy (Super) Glew wrote:
> On 11/6/2013 5:37 AM, Michael S wrote:

>> So, wouldn't 28:36 split of 64-bit register work fine? Or are you
>> afraid that 28 bits are insufficient for key?
>
> 28 bits is insufficient for key. And 36 bits will soon be insufficient
> for virtual address.
>

60 bits of virtual is enough of a problem to make me doubtful about
taking 7 of them for the floating-point pointer checking scheme.

Michael S

unread,

Nov 7, 2013, 3:22:06 AM11/7/13

to

First, you are SAS. Second, you aim way above my target of "smartphones and content-consumption oriented tablets".
The first probably more important than the second.

Michael S

unread,

Nov 7, 2013, 3:37:13 AM11/7/13

to

On Thursday, November 7, 2013 8:37:39 AM UTC+2, Andy (Super) Glew wrote:
> On 11/6/2013 5:37 AM, Michael S wrote:
>
> > On Wednesday, November 6, 2013 4:16:46 AM UTC+2, Andy (Super) Glew wrote:
> >>
> >> But I warn you: most of the cap systems that Hank's book describes are
> >> old. In particular, most are descriptor table based.
> >>
> >
>
> > I guessed that already. That was one of the reasons behind my original question - I am interested to know what sort of capability-based addressing *you* want to have, with emphasis on you.
>
> >
> >>
> >>
> >> One of my design goals was never to have a descriptor table that is
> >> required to be a contiguous region of virtual or, worse, physical
> >> memory. Finite regions always overflow.
> >>
> >>
>
> >> I think that one of the problems with old cap systems was that their
> >> objects were nearly always addressed by a tuple
> >> (descriptor_number,offset). One of which was always too small. Worse,
> >> if the physical memory, the largest contiguous array that you ever
> >> wanted to access (which I assume is all of physical memory) is N bits,
> >> then the pair (descriptor_number,offset) wants to be 2N bits long. A
> >> fat pointer.
> >
>
> > In context of smartphones and content-consumption oriented tablets, 36-bit physical address could be sufficient up until 2025 or a little more.
>
>
> Wanna bet?
> Physical maybe. Probably not virtual. And I am not so certain about physical.
>
> Consider a SAS system with a 64GB flash or other NVRAM backing store.
>

I don't want to consider SAS.
I prefer to consider minimally modified Linux (Android) or Windows kernel and to think, first, how cap-inspired addressing can make Davlik or .net engine simpler or faster or more robust. Second, I'd start to ask the same question about "native" apps.

>
> >So, wouldn't 28:36 split of 64-bit register work fine? Or are you afraid that 28 bits are insufficient for key?
>
>
> 28 bits is insufficient for key. And 36 bits will soon be insufficient
> for virtual address.
>

Now I understand that I still did not "got it". I am still thinking in terms (segment selector, segment offset) tuples, with 36 bits being an offset. Sounds like I should read more about capabilities.

Andy (Super) Glew

unread,

Nov 7, 2013, 9:25:01 PM11/7/13

to

On 11/7/2013 12:37 AM, Michael S wrote:
> On Thursday, November 7, 2013 8:37:39 AM UTC+2, Andy (Super) Glew wrote:
>> On 11/6/2013 5:37 AM, Michael S wrote:
>>
>>> On Wednesday, November 6, 2013 4:16:46 AM UTC+2, Andy (Super) Glew wrote:
>>>>
>>>> But I warn you: most of the cap systems that Hank's book describes are
>>>> old. In particular, most are descriptor table based.
>>>>
>>>
>>
>>> I guessed that already. That was one of the reasons behind my original question - I am interested to know what sort of capability-based addressing *you* want to have, with emphasis on you.
>>
>>>
>>>>
>>>>
>>>> One of my design goals was never to have a descriptor table that is
>>>> required to be a contiguous region of virtual or, worse, physical
>>>> memory. Finite regions always overflow.
>>>>
>>>>
>>
>>>> I think that one of the problems with old cap systems was that their
>>>> objects were nearly always addressed by a tuple
>>>> (descriptor_number,offset). One of which was always too small. Worse,
>>>> if the physical memory, the largest contiguous array that you ever
>>>> wanted to access (which I assume is all of physical memory) is N bits,
>>>> then the pair (descriptor_number,offset) wants to be 2N bits long. A
>>>> fat pointer.
>>>
>>
>>> In context of smartphones and content-consumption oriented tablets, 36-bit physical address could be sufficient up until 2025 or a little more.
>>
>>
>> Wanna bet?
>> Physical maybe. Probably not virtual. And I am not so certain about physical.
>>
>> Consider a SAS system with a 64GB flash or other NVRAM backing store.
>>
>
> I don't want to consider SAS.

Fair enough. I also tend to ignore SAS, because it is not what we have
now. Plus I think that security and testing guys will always want to be
ABLE to put processes (and guest OSes) in completely separate address
spaces - if only to get reproduceable behavior for testing. (By the
way, SAS and isolated virtual machines???)

But... SAS has long been "of interest". Some of the early Java phones
were SAS hardware. I think the Newton was SAS. So, while not high on
my list, it is just something I always think about.

By the way, I realized that I had erred in conflating "SAS (Single
Address Space)" and "SLS (Single Level Store)" - SLS=>all files are
memory mapped at all times. They often go together, but they don't have to.

Anyway, something here is relevant because the on-march of NVRAM
(Non-Volatile Memory) such as flash makes it very, very, likely that all
storage on a device like a cellphone or tablet *could* be memory mapped
at all times. Which would probably have pedrformance advantages.
Although potentially reliability disadvantages - the separation between
DRAM and disk means that you can usually clear DRAM and reboot.

> I prefer to consider minimally modified Linux (Android) or Windows kernel and to think,
> first, how cap-inspired addressing can make Davlik or .net engine simpler or faster or more robust.
> Second, I'd start to ask the same question about "native" apps.

To me, the questions are inseparable:

Caps - at least, where I think caps can be taken - give a software
virtual machine like Dalvik or .net or Javascript the ability to pass
language objects to native code, and be confident that the native code
is not corrupting anything outside the minimal privileges of the
capabilities that the native code has been granted by the VM.

Basically, caps give people the ability to pass linked data structures
via shared memory, without having to marshall the linked data structure
into a serialized representation.

By the ways: present day capability privilege systems are not
necessarily as good in this regard as I think they should be. Mostly
they can do read and write protection. But I think that we may need to
have the ability to give native code the ability to call a method on an
object, and have that method read and write the object data, without
having code outside the method do so.

If we can get to this, then I think that we will have removed one of the
big barriers and impediments to creating secure systems, with different
subsystems having the least amount of privilege needed. At the moment,
most of our systems have "privilege domains" that are far too broad -
because there is overhead in dealing with privilege domains.

The software VMs like Java make it easier to create separate privilege
domains. But they do it at the cost of leaving large swatches of code,
the native code libraries and/or other software VMs, outside their
security system.

It is possible that hardware assists may improve the performance of
Dalvik/Java/.net/Javascript. But I think that it is worth doing even if
the interpreter/JITter

By the way: I can imagine an on-the fly binary rewriting tool that
rewrote the native libraries to enforce capability like protection
without hardware assists. In fact, that is something like how Cp/PL or
possibly MPX was simulated. It wasn't fast enough, but I can imagine it
being made fast enough to make hardware assists unnecessary. I am not
religious about hardware assists. I am evangelical about ubiquity - we
want this fine grain security to apply to both Dalvik/Java/etc. and to
native code. For security, you want to have the security checks done in
as few places as possible.

>
>>
>>> So, wouldn't 28:36 split of 64-bit register work fine? Or are you afraid that 28 bits are insufficient for key?
>>
>>
>> 28 bits is insufficient for key. And 36 bits will soon be insufficient
>> for virtual address.
>>
>
> Now I understand that I still did not "got it".
> I am still thinking in terms (segment selector, segment offset) tuples, w

> With 36 bits being an offset. Sounds like I should read more about capabilities.

No, I think you get it.

36 bits would be an offset. Probably the largest object that you could
create in a 28/36 split would have 2^36, 64GB of memory (although some
would argue that it could be 64GB * sizeof(array_element_type)).

My rule of thumb is that I always want to be able to create an array of
bytes the size of ... DRAM? Whatever my virtual address space is? And
36 bits sounds like it will not last very long. Tablets, remember, are
quickly coming to rival and surpass desktop PCs in capability.

28 bits would be the number of objects. 256M objects sounds very small.

The only static partition that I would be happy with TODAY would be
32/40 or thereabouts - 4 billion objects, 2^40 bytes for the largest
object. And I do not think that 40 bits will last very long.

I am more happy with dynamic partitions - e.g. there is a 4-6 bit field
that says where the segnum/segoff partition is drawn. So you can have
20/40 or 30/30 or ...

But the real killer is that software just plain doesn't like segments.
Lots of software wants to increment the offset blindly, and not worry
about carry propagating into the selector.

---

In the above, I say "object:", when I really say "object allocated in
contiguous memory". Software may of course create non contiguous objects.

---

Also, in the above we have only discussed segnum/segoff. You may want
to place other metadata in your capabilities - like permissions bits,
object IDs that must be matched, etc. Everything that you add like this
inside a "pointer" gets subtracted from segnum/segoff.

Probably my big claim to fame[*] here is realizing that you can stick
much such metadata outside of a pointer per se, so that you are not
limited by 64 or 128 or even 256 bits. (Not sure if I have priority; I
don't think Intel applied for any patent on this, so probably somebody
else realized it first. Or maybe I did, way back in the 1980s. Whatever.

Michael S

unread,

Nov 8, 2013, 7:17:04 AM11/8/13

to

> at all times. Which would probably have performance advantages.

I don't see why there will be performance advantage.
If it was NOR flash then, may be, esp. on SoC with very big LLC, it would make sense to copy to main memory only at write access and then read-only pages (hopefully, overwhelming majority) can be accessed directly from flash, which could have performance advantage for accesses with low spacial or temporal locality.
But NOR flash tech is stagnating for decade, still [barely] good enough for many embedded systems, but in phones/tablets it's now relegated to the same role, it plays in PCs -initial boot and nothing else. All NVRAM that we have in phones/tablets today is NAND, and not naked NAND chips, but NAND shielded behind pretty sophisticated controller. From software perspective it's not very different from rotating disk, except that access blocks are bigger than traditional 512B.
I don't see why NAND disk memory mapped at all times would have any performance advantage. Even if you use memory mapping for storage access (which, by itself, is not very good idea most of the time) you want to map files, not disk blocks, so mapping is unlikely to remain constant over the time.

> Although potentially reliability disadvantages - the separation between
> DRAM and disk means that you can usually clear DRAM and reboot.
>
>
> > I prefer to consider minimally modified Linux (Android) or Windows kernel and to think,
> > first, how cap-inspired addressing can make Davlik or .net engine simpler or faster or more robust.
> > Second, I'd start to ask the same question about "native" apps.
>
>
>
> To me, the questions are inseparable:
> Caps - at least, where I think caps can be taken - give a software
> virtual machine like Dalvik or .net or Javascript the ability to pass
> language objects to native code, and be confident that the native code
> is not corrupting anything outside the minimal privileges of the
> capabilities that the native code has been granted by the VM.
>
> Basically, caps give people the ability to pass linked data structures
> via shared memory, without having to marshall the linked data structure
> into a serialized representation.
>
>

But here you greatly increase complexity of VM/RTS rather than reduce it, all in favor of modest, in most cases, performance advantage. I can't say that I like the idea.

Several years ago you said that one of your pet ideas is improving memory copying. Somehow I think that (improved memory copying => cheap serialization) is fundamentally better approach to the same problem than fine-grained access control.

>
> By the ways: present day capability privilege systems are not
> necessarily as good in this regard as I think they should be. Mostly
> they can do read and write protection. But I think that we may need to
> have the ability to give native code the ability to call a method on an
> object, and have that method read and write the object data, without
> having code outside the method do so.
>
> If we can get to this, then I think that we will have removed one of the
> big barriers and impediments to creating secure systems, with different
> subsystems having the least amount of privilege needed. At the moment,
> most of our systems have "privilege domains" that are far too broad -
> because there is overhead in dealing with privilege domains.
>

Is what you propose similar to i286/386 call gates?
I always found the idea usable, but I am not in the OS/VM/RTS racket. Those, who are "in" seemed to never like call gates.

>
>
> The software VMs like Java make it easier to create separate privilege
> domains. But they do it at the cost of leaving large swatches of code,
> the native code libraries and/or other software VMs, outside their
> security system.
>
> It is possible that hardware assists may improve the performance of
> Dalvik/Java/.net/Javascript. But I think that it is worth doing even if
> the interpreter/JITter
>
>
>
> By the way: I can imagine an on-the fly binary rewriting tool that
> rewrote the native libraries to enforce capability like protection
> without hardware assists. In fact, that is something like how Cp/PL or
> possibly MPX was simulated. It wasn't fast enough, but I can imagine it
> being made fast enough to make hardware assists unnecessary. I am not
> religious about hardware assists. I am evangelical about ubiquity - we
> want this fine grain security to apply to both Dalvik/Java/etc. and to
> native code. For security, you want to have the security checks done in
> as few places as possible.
>

Practicing professional claim the opposite. Security works best when you have multiple fences that only minimally trust each other.

Michael S

unread,

Nov 8, 2013, 8:18:14 AM11/8/13

to

On Friday, November 8, 2013 4:25:01 AM UTC+2, Andy (Super) Glew wrote:
> On 11/7/2013 12:37 AM, Michael S wrote:
> > On Thursday, November 7, 2013 8:37:39 AM UTC+2, Andy (Super) Glew wrote:
> >> On 11/6/2013 5:37 AM, Michael S wrote:
> >>
> >>> On Wednesday, November 6, 2013 4:16:46 AM UTC+2, Andy (Super) Glew wrote:
> >>>>
>
>
> >>> So, wouldn't 28:36 split of 64-bit register work fine? Or are you afraid that 28 bits are insufficient for key?
> >>
> >>
> >> 28 bits is insufficient for key. And 36 bits will soon be insufficient
> >> for virtual address.
> >>
> >
>
> > Now I understand that I still did not "got it".
> > I am still thinking in terms (segment selector, segment offset) tuples, w
> > With 36 bits being an offset. Sounds like I should read more about capabilities.
>
>
> No, I think you get it.
>
> 36 bits would be an offset. Probably the largest object that you could
> create in a 28/36 split would have 2^36, 64GB of memory (although some
> would argue that it could be 64GB * sizeof(array_element_type)).
>
>
> My rule of thumb is that I always want to be able to create an array of
> bytes the size of ... DRAM? Whatever my virtual address space is? And
> 36 bits sounds like it will not last very long. Tablets, remember, are
> quickly coming to rival and surpass desktop PCs in capability.
>

That's why I repeat, in each and every post "smartphones and content-consumption oriented tablets".
[Although it's not really a good definition. Because these phablets/tablets are actively used for shooting videos/stills and for audio recording, all of which is content-creation. May be, even for basic photo editiong. I feel the difference, but can't express it in sufficiently short words.]

Content-creation oriented tablets are completely different kettle of fish. Those are essentially light PCs/laptops with touch-oriented interface. They do not necessarily run the same OS (OSX/WinPro/Ubuntu/Mint vs iOS/WinRT/Andoroid). More importantly, not today, but in the future, they hopefully will be operated mostly by computer-literate users (as today's masses of illiterates quickly migrate to the first category of devices).
Computer-literate users are far less likely to be hit by malware if for nothing else then because they are more aware of its existence. They are also more capable of manual managing of sandboxes, including very coarse grain sandoboxes in for of full virtual machines.

In short, in those PC-like tablets, smart automatic security is less necessary, and often less wanted.

Not so in the first type of devices. Here as good as possible automatic security is *very* desirable, esp. on [less reliant on trusted app store and more close to traditional "wild" PC model] Android side.
And here, I say, 36-bit index will be sufficient at least until 2025.
May be you don't pay attention, but in that space we still have new devices with 256MB of RAM actively sold, although, frankly, they suck. And 512MB devices are still capable to fulfill majority of user's needs.

>
>
> 28 bits would be the number of objects.

The one thing that I don't understand at all in the whole non-OS-managed capabilities enterprise, is how we withdraw keys for short-lived objects. And it seems related to a question of seriousness of 28-bit limitation through the problem of key recycling. Recycling of the same key sounds like a serious security hole. And with 28 bits recycling [why still in the same process] is, probably, inevitable.

> 256M objects sounds very small.
> The only static partition that I would be happy with TODAY would be
> 32/40 or thereabouts - 4 billion objects, 2^40 bytes for the largest
> object. And I do not think that 40 bits will last very long.
>
> I am more happy with dynamic partitions - e.g. there is a 4-6 bit field
> that says where the segnum/segoff partition is drawn. So you can have
> 20/40 or 30/30 or ...
>
> But the real killer is that software just plain doesn't like segments.
> Lots of software wants to increment the offset blindly, and not worry
> about carry propagating into the selector.
>

So what's your solution?
Do you think that sparse population of total key space with legal values is sufficient?

Michael S

unread,

Nov 8, 2013, 8:29:11 AM11/8/13

to

May be, out of ignorance, but I never thought about key part of tuple as a descriptor. I always though about it as a selector - an index in [optionally associative] array of descriptors, with hardware-managed SW-invisible TLB taking care of [average] speed issues. Not unlike hierarchical page tables.

Ivan Godard

unread,

Nov 8, 2013, 11:07:40 AM11/8/13

to

On 11/8/2013 5:18 AM, Michael S wrote:
> On Friday, November 8, 2013 4:25:01 AM UTC+2, Andy (Super) Glew

<snip>

>> 28 bits would be the number of objects.
>
> The one thing that I don't understand at all in the whole non-OS-managed
capabilities enterprise, is how we withdraw keys for short-lived objects.
And it seems related to a question of seriousness of 28-bit limitation
through the problem of key recycling. Recycling of the same key sounds
like a serious security hole. And with 28 bits recycling [why still in
the same process] is, probably, inevitable.
>

You have hit on two of the classical design issues in any caps syste,
and most security systems in general: how do you do revoke, and how do
you allocate/recover ids. There are a range of solutions to both
problems known, and probly many yet to be invented, but both must be
addressed in any system.

>> 256M objects sounds very small.
>> The only static partition that I would be happy with TODAY would be
>> 32/40 or thereabouts - 4 billion objects, 2^40 bytes for the largest
>> object. And I do not think that 40 bits will last very long.
>>
>> I am more happy with dynamic partitions - e.g. there is a 4-6 bit field
>> that says where the segnum/segoff partition is drawn. So you can have
>> 20/40 or 30/30 or ...

That was my floating-point pointer proposal, floated here. It is a
debugging aid, but (as proposed) cannot be used for system security.

>> But the real killer is that software just plain doesn't like segments.
>> Lots of software wants to increment the offset blindly, and not worry
>> about carry propagating into the selector.

Lots of software violates the rules of the host language and indulges in
undefined behavior. So?

EricP

unread,

Nov 8, 2013, 1:32:07 PM11/8/13

to

Andy (Super) Glew wrote:
>
> Fair enough. I also tend to ignore SAS, because it is not what we have
> now. Plus I think that security and testing guys will always want to be
> ABLE to put processes (and guest OSes) in completely separate address
> spaces - if only to get reproduceable behavior for testing. (By the
> way, SAS and isolated virtual machines???)

And since the kernel can always diddle the hardware to force
multiple address spaces, there is no such thing as SAS.
For example (and I haven't re-watched the video with this in mind),
on the Mill one might accomplish an address spaces swap by
flushing the entire virtual cache back to physical DRAM,
then then switching physical page table roots.

Probably not what the designers had in mind though.

Eric

Ivan Godard

unread,

Nov 8, 2013, 2:22:13 PM11/8/13

to

Well, flushing the entire Mill cache would realize all the backless
memory, so everything would at least possess a physical address.
Currently there's no way to switch the page table root = it persists in
the hardware - but I suppose we could add one.

But I don't think the idea would work, First, there's the problem of
existing inter-address-space pointers: you could wind up with two
different cache lines with the same physical address, and I don't see
any evident way to avoid that (other then using physically-addressed
cache per normal).

Second, the code and data that is doing all this switching has to be in
cache too, so it's not really possible to completely flush the cache -
sort of "who flushes the flusher" :-) So for that to work, at least the
flusher has to live, at the same virtual address, in every address
space, i.e. SAS. Many "overlay" systems in early boxes with tiny address
spaces worked that way; the overlay code was always present, while
overlays were switched in and out.

Me, I'd call that "SAS with overlays" (the boxes had no MMU), rather
than "multi-address-space"; YMMV.

As for whether security types want to have aliased address spaces, that
has not been true in our investigations. There have been a lot more bugs
(and exploits) found that arise from code forgetting to translate
addresses of data shared across address spaces (e.f. a read() pointer
argument, which must refer into the client's space, not the OSs) than
bugs arising from someone forgetting to set access permissions on a
grant in a SAS.

For that matter, I haven't heard much complaint about all the systems
that reserve half the address space for a SAS OS, using the other half
for aliased client spaces.

Stefan Monnier

unread,

Nov 8, 2013, 2:36:15 PM11/8/13

to

> And since the kernel can always diddle the hardware to force
> multiple address spaces, there is no such thing as SAS.

The fact that the Mill is "virtually-address cache" is so far the only
thing that struck me a clear threat to its viability. I know that there
is some support in some versions of the Linux kernel for embedded
systems without MMUs, but as for running the kind of OS you'd want for
a heavy server workload I wonder how easy it is.

Sure, you can flush the caches when switching processes, but your
performance will be seriously affected (especially since it seems that
their caches are virtually indexed even in the level-2 caches, so you'd
need to "flush deep").

Stefan

Ivan Godard

unread,

Nov 8, 2013, 3:27:24 PM11/8/13

to

It's SAS; you do not flush the caches at all. The lines stay there, in
the cache (until they LRU and get evicted of course). Protection is *in
front* of the cache, so other processes cannot address those lines, even
though they are present in the cache.