Intel announces Memory Protection Extensions

Paul A. Clayton

unread,

Jul 28, 2013, 7:54:37 PM7/28/13

to

Some links to information on the Intel website are
available here:
http://software.intel.com/en-us/intel-isa-extensions#pid-16080-1495

I have not looked at this yet, but the description I looked over
(http://www.realworldtech.com/forum/?threadid=135174&curpostid=135274 )
seemed to indicate that the base/bound values can either be
directly loaded or be loaded from a general table (allowing
one to avoid the need for fat pointers).

From my extremely vague impression, this might be similar
to a technique referenced by Andy Glew called HardBound
(Devietti et al., 2008, "HardBound: Architectural Support
for Spatial Safety of the C Programming Language"--I have not
read this paper either!).

Sorry not to provide any real content on this, but since
no one else has posted this news here I thought that it
would be better to submit a low-content post quickly.

EricP

unread,

Jul 31, 2013, 11:59:16 AM7/31/13

to

The instruction manual PDF is here

Intel® Architecture Instruction Set Extensions Programming Reference
http://software.intel.com/sites/default/files/319433-015.pdf

It also covers AVX-512, SHA (Secure Hash), Random number instructions,
paging mode accesses enhancements, various other new instructions.

Eric

EricP

unread,

Jul 31, 2013, 1:08:32 PM7/31/13

to

It looks unnecessarily complicated and less functional
than a straight forward CompareAndTrap instruction.

It stores bounds in a hardware defined table structure,
with each entry as a 4-tuple consisting of a
lower bound, upper bound, pointer, and reserved fields.

But any language that has bounds checks already define their own
software defined buffer descriptors and don't put them all in one table
(usually they are created on the fly on the stack).
So this is pretty much useless for existing checking languages.

Having a single global bounds table looks problematic.

For any buffers that are allocated dynamically, it requires
management software for dynamically allocating and freeing
buffer descriptors in its global bounds table.
That adds unnecessary overhead.

Switching tables on the fly looks possible (and expensive) but
then you'd loose access to the descriptors in the prior table.

Eric

Terje Mathisen

unread,

Jul 31, 2013, 3:52:15 PM7/31/13

to

EricP wrote:
> The instruction manual PDF is here
>
> Intel� Architecture Instruction Set Extensions Programming Reference
> http://software.intel.com/sites/default/files/319433-015.pdf

Thanks!

>
> It also covers AVX-512, SHA (Secure Hash), Random number instructions,
> paging mode accesses enhancements, various other new instructions.

AVX-512 seems a _lot_ like the Larrabee/MIC 512-bit vector facility, the
only obvious exception I noticed was that it seems like they've skipped
the graphics-specific unpack/distribute functions.

I.e. AVX-512 is the fp parts of MIC: This is what's needed to enable a
generic IA core to run sw originally targeted for a MIC coprocessor?

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

Michael S

unread,

Jul 31, 2013, 4:32:23 PM7/31/13

to

On Wednesday, July 31, 2013 10:52:15 PM UTC+3, Terje Mathisen wrote:
> EricP wrote:
>
> > The instruction manual PDF is here
>
> >
>
> > Intel® Architecture Instruction Set Extensions Programming Reference
>
> > http://software.intel.com/sites/default/files/319433-015.pdf
>
>
>
> Thanks!
>
> >
>
> > It also covers AVX-512, SHA (Secure Hash), Random number instructions,
>
> > paging mode accesses enhancements, various other new instructions.
>
>
>
> AVX-512 seems a _lot_ like the Larrabee/MIC 512-bit vector facility, the
>
> only obvious exception I noticed was that it seems like they've skipped
>
> the graphics-specific unpack/distribute functions.
>
>
>
> I.e. AVX-512 is the fp parts of MIC: This is what's needed to enable a
>
> generic IA core to run sw originally targeted for a MIC coprocessor?
>
>

Just recompile :-) Instructions encoding is totally different.

BTW, LRBNI is officially dead. Intel announced that the next MIC (Knights Landing) will run AVX512.

http://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions

Ivan Godard

unread,

Aug 1, 2013, 12:32:30 AM8/1/13

to

I'm somewhat surprised that Intel found this to be worth doing, but the
customers must want it. It is only usable with compiler co-operation, so
it's not a cap but a debugging aid. Because the table is keyed by
pointer location rather than by pointer value it requires the compiler
to track where a given pointer came from, which seems buggy to me. And
it's a big hunk of hardware.

Several years ago we explored a sort of caps-lite approach that would
provide similar debugging assistance, but set it aside because we felt
that it didn't provide real (caps like) security and so wasn't worth the
trouble. Intel does listen to the customers, so I guess there's demand,
so I suppose we should dust ours off.

Comparing the two, both use natural-sized pointers, which is a must to
avoid breaking every data structure. Both permit pointer construction,
comparison and delta (address difference) operations, and conversion
to/from integer. Intel's is optional; ours is not although you can
program around it by treating the pointer as an integer, for example for
xor-linked lists.

Intel's requires extra code at point of use; ours does not. Intel's
potentially catches all invalid memory references; ours catches only
most of them. Intel's permits arbitrary (wild) address arithmetic; ours
treats that as an error whether you dereference the result or not (i.e.
we follow the C/C++ standard), although we do handle the C++
one-past-the-end correctly. Intel's has a range cache and memory tables
and associated hardware; ours does not. Intel's has a set of additional
registers to be managed; ours does not.

We will file ours, although it is not clear whether it will be included
in production Mills.

Terje Mathisen

unread,

Aug 1, 2013, 5:34:53 AM8/1/13

to

Michael S wrote:
> On Wednesday, July 31, 2013 10:52:15 PM UTC+3, Terje Mathisen wrote:
>> AVX-512 seems a _lot_ like the Larrabee/MIC 512-bit vector facility, the
>> only obvious exception I noticed was that it seems like they've skipped
>> the graphics-specific unpack/distribute functions.
>>
>> I.e. AVX-512 is the fp parts of MIC: This is what's needed to enable a
>> generic IA core to run sw originally targeted for a MIC coprocessor?
>>
>
> Just recompile :-) Instructions encoding is totally different.

That seemed pretty obvious since the encodings originally selected for
LRB overlap a lot with stuff that has been used for other purposes in
modern IA versions.

Having to recompile to an architecture with the (nearly) exact same hw
features isn't problem at this point, since MIC sw has been
custom-written up to now. I.e. everyone using it has their own source code.

>
> BTW, LRBNI is officially dead. Intel announced that the next MIC (Knights Landing) will run AVX512.
>
> http://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions

As long as AVX-512 is a 3-operand, 32-register, 512-bit, fully
predicated short vector machine, it looks a _lot_ more like the "second
coming of LRBNI" than a simple progression of MMX/SSE/AVX!
:-)

I.e. I believe I said something a few years ago that I expected to see
regular IA cpus capable of running LRB code, this is Intel biting the
incompatibility bullet caused by overlapping encodings.

I also expected heterogeneous cores but that might not happen just yet.
With a common instruction decoder for all cores it should be easier to
migrate tasks between various core types.

Michael S

unread,

Aug 1, 2013, 6:43:02 AM8/1/13

to

On Thursday, August 1, 2013 12:34:53 PM UTC+3, Terje Mathisen wrote:
> Michael S wrote:
>
> > On Wednesday, July 31, 2013 10:52:15 PM UTC+3, Terje Mathisen wrote:
>
> >> AVX-512 seems a _lot_ like the Larrabee/MIC 512-bit vector facility, the
>
> >> only obvious exception I noticed was that it seems like they've skipped
>
> >> the graphics-specific unpack/distribute functions.
>
> >>
>
> >> I.e. AVX-512 is the fp parts of MIC: This is what's needed to enable a
>
> >> generic IA core to run sw originally targeted for a MIC coprocessor?
>
> >>
>
> >
>
> > Just recompile :-) Instructions encoding is totally different.
>
>
>
> That seemed pretty obvious since the encodings originally selected for
>
> LRB overlap a lot with stuff that has been used for other purposes in
>
> modern IA versions.
>
>
>
> Having to recompile to an architecture with the (nearly) exact same hw
> features isn't problem at this point, since MIC sw has been
> custom-written up to now. I.e. everyone using it has their own source code.
>

I wonder, which percentage of Knight's Corner software is written specifically for its ISA (asm, intrinsics, etc...), which percentage relies on OpenMP and which targets higher level frameworks like OpenCL.

> >
>
> > BTW, LRBNI is officially dead. Intel announced that the next MIC (Knights Landing) will run AVX512.
>
> >
>
> > http://software.intel.com/en-us/blogs/2013/07/10/avx-512-instructions
>
>
>
> As long as AVX-512 is a 3-operand, 32-register, 512-bit, fully
> predicated short vector machine, it looks a _lot_ more like the "second
> coming of LRBNI" than a simple progression of MMX/SSE/AVX!
>
> :-)
>
>
>
> I.e. I believe I said something a few years ago that I expected to see
> regular IA cpus capable of running LRB code, this is Intel biting the
> incompatibility bullet caused by overlapping encodings.
>

Pay attention, that the blog post above only indicates that Knights Landing is going to run AVX512. Everything he said with regard to regular IA cpus is much less concrete. "Some future Xeon processors scheduled to be introduced after Knights Landing" does not sound especially assuring. It could easily mean successor to Knights Landing rather than mainline IA core.

>
>
> I also expected heterogeneous cores but that might not happen just yet.
> With a common instruction decoder for all cores it should be easier to
> migrate tasks between various core types.
>

I think that in context of general-purpose computing heterogeneous cores in a single coherence domain is a horrible idea and full software compatibility between cores (like Cortex-A7 with Cortex-A15) does not make it any less horrible. Now, big super-computers on the one end and smartphones on the other end are only border-line general-purpose, so for them it can possibly work, despite all ugliness. But starting from 10'' tablet, through laptop, desktop, DP server and up to likes of IBM p795, IMHO, there is no place for [cache-coherent] heterogeneous.

MitchAlsup

unread,

Aug 1, 2013, 12:26:38 PM8/1/13

to

On Wednesday, July 31, 2013 11:32:30 PM UTC-5, Ivan Godard wrote:
> Intel does listen to the customers, so I guess there's demand,
> so I suppose we should dust ours off.

Yes, there is a fairly large demand that wild pointer dereferencing
be managed somehow.

No, this does not seem to be the way--primarily because it requires
old codes to be recompiled. Just how does one recompile 15 year old
binaries?

Ivan Godard

unread,

Aug 1, 2013, 3:06:17 PM8/1/13

to

On 8/1/2013 3:43 AM, Michael S wrote:

<snip>

>>
>> I also expected heterogeneous cores but that might not happen just yet.
>> With a common instruction decoder for all cores it should be easier to
>> migrate tasks between various core types.
>>
>
> I think that in context of general-purpose computing heterogeneous cores
in a single coherence domain is a horrible idea and full software
compatibility between cores (like Cortex-A7 with Cortex-A15) does not
make it any less horrible. Now, big super-computers on the one end and
smartphones on the other end are only border-line general-purpose, so
for them it can possibly work, despite all ugliness. But starting from
10'' tablet, through laptop, desktop, DP server and up to likes of IBM
p795, IMHO, there is no place for [cache-coherent] heterogeneous.

Why do you say so?

We reached the opposite conclusion, but are just speculating and have no
real customer feedback as evidence. We thought that a reasonable
configuration could be a couple of whopping big FP-oriented cores and a
much smaller integer-oriented core for the OS, to make a reasonable chip
for something like an NMR machine. We expected that having coherence
would make the programmer (app and OS both) job much easier.

You differ?

Of course, an NMR is only marginally general purpose. But how about a
laptop with a couple of big GP cores and one small one, where you shut
off the big ones while idling?

The alternative to hetero is of course homo with severe clock
throttling. On an x86, you have to do so much overhead to do anything at
all that even hard throttling won't get the power down to what a 386 (in
modern process) would use. On a Mill the power cost is much closer to
proportional to the work to be done so throttling is a more reasonable
approach, but even so I don't expect to be able to throttle a Gold down
to a Tin power consumption.

Ivan

Robert Wessel

unread,

Aug 1, 2013, 10:26:08 PM8/1/13

to

I'm not sure that being able to safe 15 year old binaries is a
meaningful goal. At best it would be tremendously difficult,
involving some sort of binary recompilation to attempt to extract
enough semantic context from the binary to instrument the recompiled
code.

OTOH, if providing some additional hardware facilities allows
compilers to start implementing good bounds checking at very low cost,
then that's certainly a win going forward, and could produce "safe"
versions of a substantial portion of the executed code in a reasonably
short period of time.

As regards MPX in particular, I see three big impediments. First
availability: even though the new instructions appears to be defined
to turn into NoOps when run on a processor not supporting MPX, there's
not going to be much interest from developers until a significant user
base exists (although I suppose having the developers workstations
support MPX would be a decent win all on its own). Second, it's not
really clear to me that the MPX extensions as defined really make
broadly useful bounds checking sufficiently inexpensive, and there are
existing (relatively) low overhead techniques that compilers can
implement that can provide some benefit - MPX has to be meaningfully
better than that. Third, arguably a better option in many cases, is
to switch to a safer language than C or C++.

MitchAlsup

unread,

Aug 2, 2013, 11:42:10 AM8/2/13

to

On Thursday, August 1, 2013 2:06:17 PM UTC-5, Ivan Godard wrote:
> We thought that a reasonable
> configuration could be a couple of whopping big FP-oriented cores and a
> much smaller integer-oriented core for the OS, to make a reasonable chip

What is wrong with 4 big wopping CPUs attached to 1024 FP units of a GPGPU?
More perf, less heat.

Mitch

Ivan Godard

unread,

Aug 2, 2013, 2:31:53 PM8/2/13

to

Nothing, for jobs with enough regular structure to map to a gridder like
that. Grid architectures will own the grid application space, and we
won't try to compete, no more than we will compete into the spaces best
filled by custom ASICs. Or at least, we won't try until we grid Mills.

The conffiguration I described is competitive in the space which has a
few threads that are irregular in structure but FP-heavy. In that space
it would be better than any alternative, including GPGPU. Whether the
space is big enough to justify one of our configs is a market desicion I
don't have to make :-)

Ivan

Andy (Super) Glew

unread,

Aug 4, 2013, 5:13:11 PM8/4/13

to Ivan Godard

On 7/31/2013 9:32 PM, Ivan Godard wrote:
> On 7/28/2013 4:54 PM, Paul A. Clayton wrote:
>> Some links to information on the Intel website are
>> available here:
>> http://software.intel.com/en-us/intel-isa-extensions#pid-16080-1495
>>
>> I have not looked at this yet, but the description I looked over
>> (http://www.realworldtech.com/forum/?threadid=135174&curpostid=135274 )
>> seemed to indicate that the base/bound values can either be
>> directly loaded or be loaded from a general table (allowing
>> one to avoid the need for fat pointers).
>>
>> From my extremely vague impression, this might be similar
>> to a technique referenced by Andy Glew called HardBound
>> (Devietti et al., 2008, "HardBound: Architectural Support
>> for Spatial Safety of the C Programming Language"--I have not
>> read this paper either!).
>>
>> Sorry not to provide any real content on this, but since
>> no one else has posted this news here I thought that it
>> would be better to submit a low-content post quickly.
>>
>

By the way, the quick reference to the detailed description is
http://download-software.intel.com/sites/default/files/319433-015.pdf

I suspect, and am happy, but also embarrassed and concerned, to say, the
MPX is probably the emasculated remnant of my last major project at Intel.

By the way, I must emphasize that I am not trying to claim credit for
MPX. I left what may have been a related project circa 2009, and left
Intel completely a few months later for the second and final time. The
guys who brought MPX to the point where Intel can announce it certainly
deserve credit for years of hard work. Thomas Edison said invention is
1% inspiration and 99% perspiration: I worked on the 1% inspiration
across many years, dating back to when I was at Gould in the late 1980s;
although I think that the 3-4 full time years I spent on it at Intel
counts as perspiration, my time was undoubtedly only a small fraction of
the work spent on MPX. Plus, my work may not even have fed into MPX at all.

Suspect: the people blogging about MPX are people I worked with, who
were co-inventors, e.g. on US patent application 20110078389 MANAGING
AND IMPLEMENTING METADATA IN CENTRAL PROCESSING UNIT USING REGISTER
EXTENSIONS.
(Example of an Intel blog about MPX:
http://software.intel.com/en-us/blogs/2013/07/23/intel-memory-protection-extensions-intel-mpx-design-considerations.)

Happy: It's nice to see some part of that work be made publicly visible.
It's also nice to see Intel providing some support for security, even
if it is quite limited.

Embarrassed and concerned: MPX, as announced, runs the risk of being
like the old Intel BOUNDS instruction - not very useful. Can be
emulated with other instructions. Only a performance tweak, and I would
be surprised if it is a sizable performance tweak.

Embarrassed: having separate instructions to check lower bounds and
upper bounds looks very much like famous example of the VAX index
instruction
http://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Arch/RISC-patterson.pdf.
Instead of doing A <= B <= C, do
unsigned B-A <= C-A. Since C-A can be pre-calculated, I can only
imagine that Intel did not want to create another instruction that used
a 3-input adder, as would be necessary to calculate
precalc(C-A) + -A + B
and do the proper checks.

Emasculated: (a) MPX is not an ISA that could be used to create
capability security, even if it is not that right now. Basically, it
doesn't do the checks as part of the memory access. (b) Excessive
overhead, e.g. separate lower and upper bound checks.

Concerned: my main concern is that MPX might give bounds checking such
a bad name that this avenue of ISA evolution may be cut off.

I hope that MPX will be good enough to demonstrate customer demand that
will justify further improvements. I am worried that it will not be good
enough.

> I'm somewhat surprised that Intel found this to be worth doing, but the
> customers must want it.

I am also surprised that Intel found the present form of MPX worth
doing. I would have expected the instruction overhead to be too high.

However, it seems that Intel may have made a software-only
implementation of this available http://software.intel.com/sites/products/
parallelmag/singlearticles/issue11/7080_2_IN_
ParallelMag_Issue11_Pointer_Checker.pdf

I can only conjecture that this relatively minor instruction set
extension in MPX provides enough incremental performance improvement to
be justified. And this is why I have hope: perhaps a succession of such
small evolutionary steps will lead to an MPX 2.0 that would not be
embarrassing.

Another possibility: as you know, Intel recently changed CEO, replacing
Paul Otellini. Andy Grove's favorite, Renee James, in #2 in a box at the
top, president, with Brian Krzanich.

James may be in search of a success to justify, prove, establish herself
as worthy of being #1. In particular, most of James' experience is in
managing Intel's SW group, including the acquisition of McAfee for 7.68
billion dollars. She probably needs to stamp her mark on a chip,
hardware rather than software, to establish credibility. Getting an
instruction set extension like MPX into a chip would be one such way -
even if it is really not

However, this worries me: we may have seen something similar happen and
not turn out well. Pat Gelsinger IMHO pushed Larrabee prematurely from a
prototype into what was supposed to be a product. Larrabee and Pat both
crashed and burned. I worry that James may be pushing MPX into a
project - perhaps not prematurely (since I happen to know that stuff
like MPX had had simulated several billion times more instructions that
Larrabee), but with insufficient hardware support. I.e. MPX may have
been forced to make excessive compromises in order to fit into whatever
hardware budget it has been given for first ship.

> It is only usable with compiler co-operation, so
> it's not a cap but a debugging aid.

Any capability system requires memory allocator cooperation, e.g. to
create the bounds that are to be checked as part of the capability.
Compiler cooperation, if you are to create new tighter bounds every time
you take the address, e.g. of a field in a structure.

But, yes, MPX is a debugging aid only because the compiler may choose to
generate bounds checking code or not. I.e. bounds checking is at the
discretion of the compiler, code generator, or programmer choosing
compiler options. The fact that the bounds check instruction is
decoupled, independent of, the memory reference makes it hard to do
mandatory security.

This is the thing that I find most embarrassing. MPX, as announced,
cannot be used as the basis for a real capability security architecture.

On the other hand, Ivan, we members of the cult of capability security
are a very small minority. In many, most, subcultures of computer
architecture, not being a capability architecture would be a good thing.

The really embarrassing and concerning thing is that there is a lot of
instruction overhead to use MPX. Possibly less overhead than if similar
bounds checking were generated with existing instructions - but I would
be interested to see how much.

> Because the table is keyed by pointer location rather than by pointer
value it requires the compiler
> to track where a given pointer came from, which seems buggy to me.

Here I must correct you. (Actually, this whole post is mainly to correct
this single wrong statement.)

Think about it:

Any fat pointer capability system, like that if the IBM AS400, really
has the bounds associated with the address of the memory location
containing the pointer not the address that the pointer points to.

MPX is just like that - except instead of using a fat pointer the
bounds, which in the AS400 would be adjacent to the actual pointer - are
placed in an outlying table. You might call it a non-adjacent fat pointer.

Actually, keying by pointer value is what would constitute breaking the
capability security model - most definitely if the table were global. A
global table indexed by pointer value is ambient security, whereas a
table indexed by pointer location is capability security, but only if
the path to any such pointer location itself is protected by
capabilities. (It is this latter that MPX does not make mandatory.)

You might argue that a table keyed by pointer value might be made into a
capability system by manipulating it constantly, e.g. by removing
pointers a particular function or method is not allowed to access.
I.e. by creating a local descriptor table, not a global descriptor
table. However, as far as I know the only really commercially
successful almost-capability systems, like the IBM AS-400, have been fat
pointer based systems. (I know, Norm Hardy and KeyKOS - but I am afraid
that I don't consider that commercially successful. And I don't know
the details of its table manipulations.)

> And it's a big hunk of hardware.

Actually, I doubt that it is a big chunk of hardware. 4 128 bit
registers - which might even be obtained by reusing the existing x86
segment registers, which *may* not be heavily used in 64 bit code.
(Which originally were not supposed to be used in 64-bit code at all,
although more and more their use has been revived.)

Something like a page table walker for lookup. Possibly microcode, and
hence cheap to build, but slow to use.

The bounds check instructions themselves are quite simple. IMHO
excessively simple, as mentioned above.

>
> Several years ago we explored a sort of caps-lite approach that would
> provide similar debugging assistance, but set it aside because we felt
> that it didn't provide real (caps like) security and so wasn't worth the
> trouble. Intel does listen to the customers, so I guess there's demand,
> so I suppose we should dust ours off.
>
> Comparing the two, both use natural-sized pointers, which is a must to
> avoid breaking every data structure. Both permit pointer construction,
> comparison and delta (address difference) operations, and conversion
> to/from integer. Intel's is optional; ours is not although you can
> program around it by treating the pointer as an integer, for example for
> xor-linked lists.

I agree that making such checks mandatory is desirable.

But I think that requiring such mandatory security in Day 1 would be the
kiss of death for such a product. Too much legacy code would not work.

The theory with which I started the project that may have led to MPX is
that mandatory checking must be optional, at least at first. It must be
possible for a compiler or assembly language programmer to bypass all
bounds checks and capability checks, at least at first. I would like
to be able to create systems where security is mandatory, OS level -
e.g. by marking particular pages so that full checking is required for
any reference to them, e.g. potentially for all memory. But not at first.

The other part of my theory is that there must be secure pointers and
insecure pointers. But that the distinction should be transparent, as
much as possible, to compiler or assembly language programmer. Secure
pointers will always have some overhead, in power if not performance;
but if the same instructions manipulate both security and insecure
pointers, then compilers will always generate code sequences and library
functions to which secure pointers, possibly required mandatory
checking, can be passed.

I.e. my theory is that the most important aspect of such a security ISA
extension is making (possibly) secure code run as fast as possible on
insecure, unchecked, pointers. The speed with which secure code runs
secure, checked, pointers, is secondary.

Ideally, secure code running insecure pointers runs as fast as insecure
code, that does not have the possibility of checking.

MPX does not meet this goal.

However, I will admit: the customer demand for security was high enough
that perhaps this secure code/insecure data theory is not needed.

By the way: much of my thinking on this issue started out in comp.arch,
e.g. in a thread entitled Segments, capabilities, buffer overrun
attacks, 5/14/03.

In that posting thread I also described how and why I thought that it
was important to be able to do an instruction set extension without OS
intervention, and how to do it. (Whether security related, or
otherwise.) (A security extension without OS intervention is probably
always going to be advisory security, not mandatory.)

Probably most important, in that posting thread I figured out how to do
2X fat pointers - pointers that contained the address of the object, and
the address of a descriptor, slightly less overhead that a full fat
pointer which typically requires 4X the word size (address of object,
lower bound, upper bounds, stuff). And I figured out how to make 2X fat
pointers secure and non-forgeable, by allocating the full 4X descriptor
in a protected memory page, with a pointer back to the pointer in user
memory. I stopped posting as soon as I realized that I was onto
something. But my employer at the time, AMD, was not interested. (Let
me emphasize: I only started that thread because I thought it was
something that was totally bluesky. I did not think it was going to lead
to something useful. I stopped as soon as I realized that I might be
onto something, but I don't think I was ever able to do any AMD
proprietary work on it.)

A few years later, at Intel, I picked up the work from where it had been
left off, in public, on comp.arch. At Intel I finally realized that I
could eliminate the fat pointer in user data space completely, so that
there would be no incentive for a compiler to ever generate a
datastructure that could not hold a secure pointer - since
sizeof(secure-pointer) was the same as sizeof(insecure-pointer), 4 bytes
on a 32 bit machine, 8 bytes in a 64 bit machine, albeit with 32 or more
bytes allocated 256+ bits.

The fat pointer in user data space can be eliminated completely by
mapping the address of the pointer (stored in memory) to a descriptor.

Milo Martin and his students (and others) did this by a linear address
translation: Address_of_Descriptor = Address_of_ptr * Scale + Offset.
Typically this requires 4X the address space to be used for descriptors
as is used for the actual user data. Much of the descriptor space would
be empty. Because such a linear transformation is 1:1, it inherently
supports non-forgeability (if the descriptor memory is protected).

A more flexible table, like a tree, like the tree used for page tables /
virtual address translation, might require less memory overhead to store
descriptors. But required the back pointer to make the secure pointer
non-forgeable.

Intel MPX's BNDLX/BNDSTX instructions, with the simple BD (Bounds
Directory) and BT (Bounds Table) data structure is intermediate. It
requires less address space than Martin et al's linear transformation,
but it still has a glass jaw if sparse - e.g. if a single pointer is
allocated in a quarter page, an entire 4K page of memory is required,
with only a single pointer descriptor (Bound Table Entry) stored in it.

INTERESTING TECHNICAL NOTE (at least interesting to me): from
http://download-software.intel.com/sites/default/files/319433-015.pdf

I notice that MPX's BTE (Bound Table Entry) contains
* valid bit (of course)
* lower bound (of course)
* upper bound (of course)
* pointer value - ??
* reserved

The bounds are obvious.

The "pointer value" sounds like the backpointer, the address of the
pointer, that I first figured out the need for in old comp.arch USEnet
postings the Segments, capabilities, buffer overrun attacks, 5/14/03.
But you only need the address of the pointer / backpointer if you have a
fairly arbitrary datastructure mapping pointers stored in memory to
their bounds. I needed it for the 2003 post, with semi-fat pointers.
You need it for fairly arbitrary datastructures, like classic binary
trees. But you don't need it if you are doing address scaling/linear
transformations, like Martin et al. And you don't need it for the
BD/BT/BTE bounds table that the MPX docs seem to describe - which can be
described as semi-linear, or linar in the BD and BT. You might need it
for the MPX tables if two BDs can point to the same BT, e.g. as an
optimization to permit reducing memory by sharing BTs, using the address
of the pointer to disambiguate. But usually I am the only guy krazy
enough to think of doing something like that to reduce memory
consumption. Barring that, the MPX tables do not need to have the
"address of the pointer".

Notice also that they call it the "pointer value", whereas I have talked
about "address of pointer" or "backpointer". "Pointer value" sounds
like the address of the object that the pointer is pointing to, rather
than the address of the pointer itself.

Later, on page 9-10 of the reference, you can see examples of the
compact notation I use to discuss this: "Bounds in memory are associated
with the memory address where the pointer is stored, i.e., Ap. Linear
address LAp ..."

I.e. the pointer lives in user memory at address Ap, and contains the
address of the object, Ao. The descriptor lives somewhere else,
non-adjacent - I call it address Ad, for obvious reasons - and contains
Ao_lwb, and Ao_upb. My "backpointer" in the descriptor contains the
original Ap. It is not clear to me if the MPX descriptor / bound table
entry's "pointer value" is Ap or Ao.

The MPX BTE doesn't need at Ap for lookup. Noteven fr full security.

But for a large array, many possible Ao's could be used.

Q: what is this? What is this used for?

Prefetching?

Here's a thought that might be hopeful for those of us who have drunk
the Koolaid of capabilities: if there is circuitry to check that the
"pointer value in the BTE matches the pointer value, Ao, used by the
checking instructions BNDCL / BNDCU, and also the pointer used in an
actual memory reference, then this COULD be used for full, mandatory,
capability security.

But that seems extremely complicated. I would never propose that for an
efficient implementation. I am sure that Intel is not currently doing
it. But it is the sort of thing that I might keep in an unpublished
"future directions for the architecture" spec, if I felt that full
mandatory security was desirable even for the present, limited, MPX ISA.

So, it's an open question: exactly what is this"pomyer value" in the BTE
used for?

It is possible that this

> Intel's requires extra code at point of use; ours does not.

I consider it unfortunate that Intel's requires extra code to (a) load
(and store) the bounds, and (b) check them.

The latter is especially important, since it prevents a proper
capability security architecture from ever being achieved with the
instructions in the present MPX.

But, like I said, most people do not think that full capability security
should be a goal.

> Intel's
> potentially catches all invalid memory references; ours catches only
> most of them. Intel's permits arbitrary (wild) address arithmetic; ours
> treats that as an error whether you dereference the result or not (i.e.
> we follow the C/C++ standard), although we do handle the C++
> one-past-the-end correctly.

> Intel's has a range cache and memory tables
> and associated hardware; ours does not.

Does Intel's have a range cache? I didn't see that.

> Intel's has a set of additional
> registers to be managed; ours does not.

My initial direction was to not have any additional SW visible
registers: hence US patent application 20110078389 MANAGING AND
IMPLEMENTING METADATA IN CENTRAL PROCESSING UNIT USING REGISTER
EXTENSIONS, where each general purpose register used for addressing had
an extension.

I would still prefer that. Meets my theoretical goal of having secure
code run insecure pointers with no unnecessary instructions.

However, as it became obvious that the ISA extension would only be given
a minimal hardware budget (it was competing with Larrabee), I eventually
acquiesced to having separate bounds/capability registers. I.e. I was
willing to compromise my design principles. I was only willing to do
this after wrestling with myself, eventually making myself comfortable
that it was still capability security, even though it might not reach my
minimal instruction overhead goals.
I.e. I was willing to compromise performance and compatibility, but not
security. Interestingly, some of the other members of the team were
vociferously opposed to this compromise, using words like "They would be
embarrassed to have their names associated with it." This was
essentially the last straw, that resulted in me leaving - being pushed
off - the instruction set extension project that was based on my ideas
and inventions. And yet MPX is now announced, both with the compromise
that I was willing to make (separate bounds) and the compromise that I
was not willing to make (no possibility of being a capability ISA).

>
> We will file ours, although it is not clear whether it will be included
> in production Mills.

--
The content of this message is my personal opinion only. Although I am
an employee (currently Imagination Technologies, which acquired MIPS; in
the past of companies such as Intellectual Ventures/QIPS, Intel, AMD,
Motorola, and Gould), I reveal this only so that the reader may account
for any possible bias I may have towards my employers' products. The
statements I make here in no way represent my employers' positions on
the issue, nor am I authorized to speak on behalf of my employers, past
or present.

Andy (Super) Glew

unread,

Aug 4, 2013, 5:20:03 PM8/4/13

to

When I was working on what may have been a predecessor version of MPX,
we used on-the-fly binary translation / instrumentation, essentially
Pin, to run legacy codes.

Now, Pin is mainly an instrumentation tool suitable for studies. But a
binary translator is certainly a possibility.

The biggest problem is that somebody has to set the bounds. If you can
replace just malloc, you get a lot of coverage. Many bounds can be
inferred from symbol tables, and a surprisingly large number of legacy
executables have symbol tables. Stack frames can be inferred. But it
the compiler does

struct foo { int a, int b; } bar;

int* iptr = &bar.a;

then the compiler must insert the instruction to insert the bounds.

I.e. you can get a lot of coverage for legacy binaries, but certainly
not all.

However, it is not necessary to cover all legacy binaries.

Dusty decks - C or C++ code that can be recompiled, but which cannot be
recompiled with the bounds checking option of a modern compiler, because
it is not ANSI compatible - is quite common.

Q: Mitch, do you have any way of securing legacy binaries/

The Raksha people (DIFT / taint propagation) have an interesting approach./

Ivan Godard

unread,

Aug 4, 2013, 7:03:24 PM8/4/13

to

Wow.

I agree that caps must have the constraints associated with the pointer;
aliased permissions are essential for caps. Tieing the constraints to
the address of the pointer is just a way to do fat pointers in thin
spaces, and so works for caps. However, if the goal is just a debugging
aid rather than actual protection then you can tie the constraints to
the object instead.

Different purposes yield different methods and different costs. Tie to
the pointer and you get differential caps for the same object; tie to
the object and you have one bounds per object not one bounds per
pointer. Whether that makes a difference depends on the average arc fan-in.

You (Andy) and I fundamentally disagree over whether incremental
capsulation is viable. You think that it's the only way to get caps in
the marketplace; I think that if that's the only way then there is no way.

That judgement is somewhat weighed by our respective opinions about OOO.
The cost of loading disjoint bounds for each pointer can be fairly well
hidden in an OOO because in practice the machine is stalled for the
pointer fetch anyway, and if you organize memory right the bounds will
be in a different controller's province. If you miss on the bounds while
hitting on the pointer the OOO can speculate through the check, and if
you have enough in-flight capacity you won't block in the retire stage;
actual bounds failure will be rare enough to ignore.

I lack your faith in OOO, so I really want those checks to be done at
effective-address time. I can speculate a little, but nowhere near
enough to survive a bounds miss. True fat pointers are OK, because the
bounds will be fetched with the pointer, but disjoint bounds are not
only a big waste of memory (sparse) but will also independently miss.
Disjoint works for tag bits, because they are dense; people would be
willing to pay 1/64th of memory. Pay half of memory for disjoint bounds,
where most of the half is empty? I doubt it.

More interlinear.

Exactly; that's why I was talking about debugging aids, not security.

> Concerned: my main concern is that MPX might give bounds checking such
> a bad name that this avenue of ISA evolution may be cut off.
>
>
> I hope that MPX will be good enough to demonstrate customer demand that
> will justify further improvements. I am worried that it will not be good
> enough.

Nah - my guess is people will mostly ignore MPX in particular, not
security in general.

Yes, a slice op (LEA with bounds) is essential, and must be dirt cheap.
The great majority of pointers created are function arguments that get
pushed onto the stack, and the traffic to write the bounds out will be
painful.

Speaking of which, I didn't see any way to *invalidate* a bounds, so a
later call of an unchecked function can use dead bounds if it happens to
use the same stack address for the pointer. I suppose that doesn't
matter; after all, it is unchecked. But what happens when you
dereference a stale pointer to a free'd region? The check would still
pass, unless the bounds are klobbered by free, but free doesn't know
where the stale pointer is and so doesn't have the key.

> But, yes, MPX is a debugging aid only because the compiler may choose to
> generate bounds checking code or not. I.e. bounds checking is at the
> discretion of the compiler, code generator, or programmer choosing
> compiler options. The fact that the bounds check instruction is
> decoupled, independent of, the memory reference makes it hard to do
> mandatory security.
>
> This is the thing that I find most embarrassing. MPX, as announced,
> cannot be used as the basis for a real capability security architecture.
>
> On the other hand, Ivan, we members of the cult of capability security
> are a very small minority. In many, most, subcultures of computer
> architecture, not being a capability architecture would be a good thing.

'Tis true 'tis, 'tis pity. 'Tis pity 'tis, 'tis true.

> The really embarrassing and concerning thing is that there is a lot of
> instruction overhead to use MPX. Possibly less overhead than if similar
> bounds checking were generated with existing instructions - but I would
> be interested to see how much.
>
>
> > Because the table is keyed by pointer location rather than by pointer
> value it requires the compiler
>> to track where a given pointer came from, which seems buggy to me.
>
> Here I must correct you. (Actually, this whole post is mainly to correct
> this single wrong statement.)
>
> Think about it:
>
> Any fat pointer capability system, like that if the IBM AS400, really
> has the bounds associated with the address of the memory location
> containing the pointer not the address that the pointer points to.
>
> MPX is just like that - except instead of using a fat pointer the
> bounds, which in the AS400 would be adjacent to the actual pointer - are
> placed in an outlying table. You might call it a non-adjacent fat
> pointer.

Exactly. So why not extend the ISA with a "load pointer" operations that
does it all?

> Actually, keying by pointer value is what would constitute breaking the
> capability security model - most definitely if the table were global. A
> global table indexed by pointer value is ambient security, whereas a
> table indexed by pointer location is capability security, but only if
> the path to any such pointer location itself is protected by
> capabilities. (It is this latter that MPX does not make mandatory.)

Yes, I agree. But MPX can't be used for caps, so you may as well make it
a good debugging aid, and for most bugs all you need is simple
array-overrun checks, which can be done at the object and don't need
differential views, and much cheaper than MPX.

> You might argue that a table keyed by pointer value might be made into a
> capability system by manipulating it constantly, e.g. by removing
> pointers a particular function or method is not allowed to access. I.e.
> by creating a local descriptor table, not a global descriptor table.
> However, as far as I know the only really commercially successful
> almost-capability systems, like the IBM AS-400, have been fat pointer
> based systems. (I know, Norm Hardy and KeyKOS - but I am afraid that I
> don't consider that commercially successful. And I don't know the
> details of its table manipulations.)

No such argument; key by pointer is not cap. But it cheaply catches
bugs, if the number of objects is smaller than the number of pointers
anyway.

>> And it's a big hunk of hardware.
>
> Actually, I doubt that it is a big chunk of hardware. 4 128 bit
> registers - which might even be obtained by reusing the existing x86
> segment registers, which *may* not be heavily used in 64 bit code.
> (Which originally were not supposed to be used in 64-bit code at all,
> although more and more their use has been revived.)
>
> Something like a page table walker for lookup. Possibly microcode, and
> hence cheap to build, but slow to use.
>
> The bounds check instructions themselves are quite simple. IMHO
> excessively simple, as mentioned above.

You're going to need a BLB.

>>
>> Several years ago we explored a sort of caps-lite approach that would
>> provide similar debugging assistance, but set it aside because we felt
>> that it didn't provide real (caps like) security and so wasn't worth the
>> trouble. Intel does listen to the customers, so I guess there's demand,
>> so I suppose we should dust ours off.
>>
>> Comparing the two, both use natural-sized pointers, which is a must to
>> avoid breaking every data structure. Both permit pointer construction,
>> comparison and delta (address difference) operations, and conversion
>> to/from integer. Intel's is optional; ours is not although you can
>> program around it by treating the pointer as an integer, for example for
>> xor-linked lists.
>
> I agree that making such checks mandatory is desirable.
>
> But I think that requiring such mandatory security in Day 1 would be the
> kiss of death for such a product. Too much legacy code would not work.

Yet (IMO) it's the only way it will ever be adopted; half measures are
too attractive to real programmers. And managers.

> The theory with which I started the project that may have led to MPX is
> that mandatory checking must be optional, at least at first. It must be
> possible for a compiler or assembly language programmer to bypass all
> bounds checks and capability checks, at least at first. I would like
> to be able to create systems where security is mandatory, OS level -
> e.g. by marking particular pages so that full checking is required for
> any reference to them, e.g. potentially for all memory. But not at first.

I don't buy it. But you knew that :-)

<snip>

> So, it's an open question: exactly what is this"pomyer value" in the BTE
> used for?
>

Beat's me.

Andy (Super) Glew

unread,

Aug 4, 2013, 7:51:18 PM8/4/13

to

On 8/1/2013 7:26 PM, Robert Wessel wrote:
>
> As regards MPX in particular, I see three big impediments.

> First
> availability: even though the new instructions appears to be defined
> to turn into NoOps when run on a processor not supporting MPX, there's
> not going to be much interest from developers until a significant user
> base exists

Yep.
But, as I say over and over: the threshold depends on the market segment.

Some ISVs won't start using a new feature until it is in 90% of the
target platforms (PCs, servers, etc.) that they are trying to sell their
software to.

Others will ship much earlier.

The "NOPability" is just another way of reducing software development
costs. There is a significant cost in having to ship and maintain two
binaries, using two different instruction sets. Because of this cost,
you may not be willing to support a new ISAX (ISA eXtension) until it
reaches, say, 40% of the market. Heck: many ISVs only maintain a single
x86 version, so they won't use new instructions until they are in 90% or
more of x86 target PCs.

But if the new instructions are NOPs on old machines, and if the NOPs
cost very little performance, then a SW vendor may be willing to ship
binaries using the new instructions when the new instructions are
implemented at a much lower threshold in the market place. E.g. you
might be willing to use MPX if 20% of the marketplace has the new
instructions, so long as the code runs as NOPs on all of remaining 80%
of the PCs in the target market, with a slowdown of 1-2%.

> (although I suppose having the developers workstations
> support MPX would be a decent win all on its own).

Yes, that's part of the pitch: even if code compiled --check-pointers is
too slow to use in production, just running it regularly during
development helps a lot. Tools like purify are often too slow for
regular use - slowdowns of 11x are reported, e.g. by
http://www.microquill.com/heapagent/ha_comp.htm#6. ($$$ costs, too.)

Whereas if --compile-check runs only, say, 20-50% slower than best
optimized non-checked code, using it in development tests because much
more reasonable.

And, when you get fast enough, you leave the checks enabled for actual
production code shipped to customers. The acceptable threshold - 1%,
5%, whatever - depends on the application - some customers are
especially performance sensitive, and require less than 1% slowdon,
while other customers are I/O bound, and may accept 20% slowdowns even
in shipping code.

> Second, it's not
> really clear to me that the MPX extensions as defined really make
> broadly useful bounds checking sufficiently inexpensive, and there are
> existing (relatively) low overhead techniques that compilers can
> implement that can provide some benefit - MPX has to be meaningfully
> better than that.

That's my biggest concern about MPX.

The bounds checking instructions, BNDCL and BNDCU, replace code
sequences that look like

LEA areg := basereg + indexreg*scale + offset
CMP areg, upb_reg
JGT 1f
INT #BR
1h:

OK, that's a somewhat significant reduction in number of instructions,
and possibly in BTB entries, so it *might* be good enough to get market
acceptance.

But it is still less than if BNDCL and BNDCU were a single instruction,
or if the checking could be folded into the address computation and
checks of a memory reference instruction.

(Good compilers will remove many checks, in any case where there are
separate checking instructions. (Checks cannot be removed, and still
have full security.). But that applies both to MPX and non-MPX code. I
can only hope that Intel's experience with the --check-pointer compiler
option indicates that there is a big performance win here.)

The bounds table access instructions, BNDLDX and BNDSTX, unlike
BNDCL/BNDCU, are compatible with "full mandatory security".
Much depends on how fast these instructions are implemented.
Worst case they require 2 memory accesses each. They may be microcoded,
or they may use a table walking state machine. There may be a special
cache for the bounds descriptors. I'll guess that the initial MPX
implementations will be very unoptimized, so these instructions will
probably be slow, and important to eliminate.

By comparison, the linear address translation that Milo Martin et al use
typically uses a single scaling and a single memory access - 1-3
instructions.

One issue is compatibility with multiprocessor parallel programs. The
x86 architecture defines aligned loads and stores to be atomic, at least
of 32 bits (and probably 64 bits, de-facto - pointer atomicity).
But if loading a protected pointer requires two separate instructions,
the ordinary MOV to load the actual pointer, and BNDLDX, it is probably
not atomic. Of course, Intel has HLE and TSX, transational memory.

> Third, arguably a better option in many cases, is
> to switch to a safer language than C or C++.

Arguably.

Recoding to a different programming language is a lot more work than
recompiling with --check-pointer.

Andy (Super) Glew

unread,

Aug 4, 2013, 9:17:28 PM8/4/13

to

On 7/31/2013 9:32 PM, Ivan Godard wrote:

> I'm somewhat surprised that Intel found this to be worth doing, but the
> customers must want it. It is only usable with compiler co-operation, so
> it's not a cap but a debugging aid.

Although note that MPX code can be shipped as production code, to
customers. If the performance is good enough, it is not just
restricted for use in-house by developers.

Yes, it does not catch all security bugs.

Yes, it does not allow a secure piece of code to call an insecure
library routine. It is trivial for code, once injected into the
system, to bypass MPX security.

But MPX is good enough to catch many situations where the bad guy would
try to exercise a buffer overflow.

> Because the table is keyed by
> pointer location rather than by pointer value it requires the compiler
> to track where a given pointer came from, which seems buggy to me.

I continue to be surprised that you can say this.

I suspectr it is just a slip.

Keying the table on pointer value, Ao, the address of the object, is
just plain incorrect.

Reason: many objects overlap. Many objects have the same starting address.

E.g.

struct S { int a; int b; } sobj;
S* sptr = &sobj;
int* aptr = &(sobj.a);

Now, obviously the bounds for sptr should be [&sobj,...+sizeof(S))
and the bounds for aptr = [&sobj.a,...+sizeof(int))
which is [&sobj,...+sizeof(int))

They overlap. If you do a load with a pointer containing &sobj, how
do you know which bounds to use? Whereas, if you have the bounds loads
use the address of the pointer, and just dataflow the bounds around, it
just works.

If there is a bounds check register extension for every GPR that can
contain an address, this is easy to see. It's just like the AS400, in
registers, and the bound table is just a way of getting the descriptor
that is equivalent to a fat pointer, but not adjacent.

Since MPX has fewer bounds registers than GPRs, the compiler must track
the association. And it won't work quite so transparently across
procedure calls - either you will lose bounds across calls, or the
compiler, the ABI, will have to be extended to pass bounds around. But
it is still doable. Value in either case.

E.g.

int foo(void*& vptref) {
S*& svptref = (S*&)(vptref);
int bv = svptref->b;
return bv;
}
foo(sptr); // okay
foo(aptr); // error

The calls of foo(sptr) and foo(aptr) must both be allowed. Especially if
foo is separately compiled, but inlined.

But foo(sptr) should be okay, whereas foo(aptr) should generate a
bounbdschecking error - at least a dynamic error.

But the "int bv = svptref->b" inside foo
will probably translate to an ordinary load,
possibly with bounds checking instructions.
Swagging what the MPX instructions would look like, if svptr is in RAX

bnd0 := BNDLDX( rax )
rbx := MOV_load64bits Memory[ rax ]
BNDCL bnd0, memory_address:rbx+8
BNDCU bnd0, memory_address:rbx+15
rcx := MOV_load64bits Memory[ rbx+15 ]

OK, I cheated a bit in that I passed a void*& vptref rather than a void*
vptr. I did that so I would not have to explore passing the bounds via
an ABI extension. But I think it is okay to leave that as an exercise
for the reader.

Ivan Godard

unread,

Aug 4, 2013, 10:05:25 PM8/4/13

to

On 8/4/2013 6:17 PM, Andy (Super) Glew wrote:
> On 7/31/2013 9:32 PM, Ivan Godard wrote:
>> I'm somewhat surprised that Intel found this to be worth doing, but the
>> customers must want it. It is only usable with compiler co-operation, so
>> it's not a cap but a debugging aid.
>
> Although note that MPX code can be shipped as production code, to
> customers. If the performance is good enough, it is not just
> restricted for use in-house by developers.
>
> Yes, it does not catch all security bugs.
>
> Yes, it does not allow a secure piece of code to call an insecure
> library routine. It is trivial for code, once injected into the
> system, to bypass MPX security.
>
> But MPX is good enough to catch many situations where the bad guy would
> try to exercise a buffer overflow.
>
>
>
>> Because the table is keyed by
>> pointer location rather than by pointer value it requires the compiler
>> to track where a given pointer came from, which seems buggy to me.
>
> I continue to be surprised that you can say this.
>
> I suspectr it is just a slip.
>
> Keying the table on pointer value, Ao, the address of the object, is
> just plain incorrect.
>
> Reason: many objects overlap. Many objects have the same starting address.

Well yes, of course. But if you simply set the bounds to the whole
malloc (plus 1 for C++) and check within that then you will catch the
large majority of actual bugs. True, if you have an array inside a
struct then you won't catch an overrun of the array that is not also an
overrun of the struct, but you're already not catching other things, so
this is one more, and I'd happily swap a few escapees for cheap, simple,
enforced, and universal.

YMMV. Actually, you're trying to make it look like caps. It ain't caps,
but it might be useful, and worth the hardware if you restrict it to
what it can do rather than what it can't.

Ivan

Terje Mathisen

unread,

Aug 5, 2013, 4:10:23 AM8/5/13

to

Andy (Super) Glew wrote:
> Another possibility: as you know, Intel recently changed CEO, replacing
> Paul Otellini. Andy Grove's favorite, Renee James, in #2 in a box at the
> top, president, with Brian Krzanich.
>
> James may be in search of a success to justify, prove, establish herself
> as worthy of being #1. In particular, most of James' experience is in
> managing Intel's SW group, including the acquisition of McAfee for 7.68
> billion dollars. She probably needs to stamp her mark on a chip,
> hardware rather than software, to establish credibility. Getting an
> instruction set extension like MPX into a chip would be one such way -
> even if it is really not

Ouch.

>
> However, this worries me: we may have seen something similar happen and
> not turn out well. Pat Gelsinger IMHO pushed Larrabee prematurely from a
> prototype into what was supposed to be a product. Larrabee and Pat both
> crashed and burned. I worry that James may be pushing MPX into a
> project - perhaps not prematurely (since I happen to know that stuff
> like MPX had had simulated several billion times more instructions that

This tripped my bogosity meter:

"Several billion times more" means 1e9 to 1e10, so as long as LRB ran at
least a single week worth of SW emulation (which I _know_ that it did
quite early in the evaluation process), Intel must have dedicated at
least 1e7 cores for MPX type simulation for the last 5-10 years.

Alternatively, with hw simulation giving one or two orders of
performance improvement you'd only need 1e5 such simulators.

Somehow I really doubt Intel has _this_ much spare resources...

> Larrabee), but with insufficient hardware support. I.e. MPX may have
> been forced to make excessive compromises in order to fit into whatever
> hardware budget it has been given for first ship.

This seems like a recurring issue: Every time someone has a sufficiently
good synergistic idea, it gets picked apart and partially implemented,
to the point where it is very hard to see that it makes any difference. :-(

Andy (Super) Glew

unread,

Aug 5, 2013, 10:17:22 AM8/5/13

to

On 8/4/2013 4:03 PM, Ivan Godard wrote:
> I lack your faith in OOO, so I really want those checks to be done at
> effective-address time. I can speculate a little, but nowhere near
> enough to survive a bounds miss. True fat pointers are OK, because the
> bounds will be fetched with the pointer, but disjoint bounds are not
> only a big waste of memory (sparse) but will also independently miss.
> Disjoint works for tag bits, because they are dense; people would be
> willing to pay 1/64th of memory. Pay half of memory for disjoint bounds,
> where most of the half is empty? I doubt it.

I tend to agree about this. It is one of the big problems that I have
with address scaling (linear transforms) for "large metadata" that may
be sparse: 4X the memory for the metadata? If there is a single
pointer per page, wasting 4X the address space, and 511/512ths the
physical memory?

MPX's bounds table seems to me to not waste so much address space, but
it will require waste 511/512ths of the metadata area for the worst case
of 1 pointer every 512 bytes.

>> I hope that MPX will be good enough to demonstrate customer demand that
>> will justify further improvements. I am worried that it will not be good
>> enough.
>
> Nah - my guess is people will mostly ignore MPX in particular, not
> security in general.

That would be embarrassing. The worry is that MPX may lead to dissing
of this sort of approach.

>
> Speaking of which, I didn't see any way to *invalidate* a bounds,

BNDSTX

i.e. if you want to invalidate a bounds, you store a new bounds, or an
invalid bounds, for the pointer.

Just like in fat pointers: when you write to a pointer you write the
bounds. In fat pointers that may be a single wide write. Except in MPX,
it may be two stores: one to the pointer at Ap, and the other a BNDSTX
to the bounds for Ap.

> But what happens when you
> dereference a stale pointer to a free'd region? The check would still
> pass, unless the bounds are klobbered by free, but free doesn't know
> where the stale pointer is and so doesn't have the key.

Although I will point out that if free zeroes the memory, and if the
seroing also zeroes the bounds...

Perhaps this is what the pointer vale is for: a cross check, that will
catch some, but not all, instances of reallocated pointers.

>> Any fat pointer capability system, like that if the IBM AS400, really
>> has the bounds associated with the address of the memory location
>> containing the pointer not the address that the pointer points to.
>>
>> MPX is just like that - except instead of using a fat pointer the
>> bounds, which in the AS400 would be adjacent to the actual pointer - are
>> placed in an outlying table. You might call it a non-adjacent fat
>> pointer.
>
> Exactly. So why not extend the ISA with a "load pointer" operations that
> does it all?

Well, I will admit that that has always been my preference.

I can only conjecture that Intel did not want to create such an instruction.

With the typical 4111 decoder, an instruction that, in microcode, used
to uops to load from two separate locations would suffer a big
performance loss relative to two separate instructions.

Plus, with separate instructions, a compiler can common subexpression.
E.g. if the compiler has two pointers that it thinks point into the same
array, it need only load the bounds once, and can use the same bounds
for both pointers.

It's not necessarily capabilities.

Andy (Super) Glew

unread,

Aug 5, 2013, 10:19:57 AM8/5/13

to

On 8/4/2013 7:05 PM, Ivan Godard wrote:
> On 8/4/2013 6:17 PM, Andy (Super) Glew wrote:

>> Keying the table on pointer value, Ao, the address of the object, is
>> just plain incorrect.
>>
>> Reason: many objects overlap. Many objects have the same starting
>> address.
>
> Well yes, of course. But if you simply set the bounds to the whole
> malloc (plus 1 for C++) and check within that then you will catch the
> large majority of actual bugs. True, if you have an array inside a
> struct then you won't catch an overrun of the array that is not also an
> overrun of the struct, but you're already not catching other things, so
> this is one more, and I'd happily swap a few escapees for cheap, simple,
> enforced, and universal.

A very large fraction of bugs are buffer overflows between adjacent
variables on the stack.

Michael S

unread,

Aug 5, 2013, 10:36:59 AM8/5/13

to

On Monday, August 5, 2013 11:10:23 AM UTC+3, Terje Mathisen wrote:
>
>
> This seems like a recurring issue: Every time someone has a sufficiently
> good synergistic idea, it gets picked apart and partially implemented,
> to the point where it is very hard to see that it makes any difference. :-(
>

In this particular case, which ideas are synergistic?

Michael S

unread,

Aug 5, 2013, 10:49:49 AM8/5/13

to

On Monday, August 5, 2013 5:19:57 PM UTC+3, Andy (Super) Glew wrote:
> On 8/4/2013 7:05 PM, Ivan Godard wrote:
>
> > On 8/4/2013 6:17 PM, Andy (Super) Glew wrote:
>
>
>
> >> Keying the table on pointer value, Ao, the address of the object, is
>
> >> just plain incorrect.
>
> >>
>
> >> Reason: many objects overlap. Many objects have the same starting
>
> >> address.
>
> >
>
> > Well yes, of course. But if you simply set the bounds to the whole
>
> > malloc (plus 1 for C++) and check within that then you will catch the
>
> > large majority of actual bugs. True, if you have an array inside a
>
> > struct then you won't catch an overrun of the array that is not also an
>
> > overrun of the struct, but you're already not catching other things, so
>
> > this is one more, and I'd happily swap a few escapees for cheap, simple,
>
> > enforced, and universal.
>
>
>
> A very large fraction of bugs are buffer overflows between adjacent
> variables on the stack.
>

Separate variables on the stack are separate objects rather than parts of the same object.
Bounds, keyed by the address of the object, will catch buffer overflows like those just fine.

Terje Mathisen

unread,

Aug 5, 2013, 10:57:13 AM8/5/13

to

They are not, that is the problem.

I.e. if MPX started out as something like what you've been asking for,
i.e. a (nearly) transparent way to get fat pointers and (nearly) free
bounds checks, then what they've delivered has been emasculated to the
bare minimum that just might show a small performance win for some types
of checked code.

EricP

unread,

Aug 5, 2013, 1:21:30 PM8/5/13

to

If you look at the pseudo-code for the BNDLDX inst it says in part:

Temp_lb[31:0] = LoadFrom(A_BTE );
Temp_ub[31:0] = LoadFrom(A_BTE + 4);
Temp_ptr[31:0] = LoadFrom(A_BTE + 8);
IF Temp_ptr equal ptr_value Then
BND.LB = Temp_lb;
BND.UB = Temp_ub;
ELSE
BND.LB = 0;
BND.UB = 0;
FI;

The pointer field is a tag to indicate validity.
If it does not match, the bounds are set to min and max
effectively disabling the check.
BNDSTX sets the ptr field.

> Notice also that they call it the "pointer value", whereas I have talked
> about "address of pointer" or "backpointer". "Pointer value" sounds
> like the address of the object that the pointer is pointing to, rather
> than the address of the pointer itself.

That looks like what they intended.

> Later, on page 9-10 of the reference, you can see examples of the
> compact notation I use to discuss this: "Bounds in memory are associated
> with the memory address where the pointer is stored, i.e., Ap. Linear
> address LAp ..."
>
> I.e. the pointer lives in user memory at address Ap, and contains the
> address of the object, Ao. The descriptor lives somewhere else,
> non-adjacent - I call it address Ad, for obvious reasons - and contains
> Ao_lwb, and Ao_upb. My "backpointer" in the descriptor contains the
> original Ap. It is not clear to me if the MPX descriptor / bound table
> entry's "pointer value" is Ap or Ao.
>
> The MPX BTE doesn't need at Ap for lookup. Noteven fr full security.

I noticed a gotcha in MPX.

The descriptor BTE is located by looking up the address of the object.
However if you took the address a field within that object,
and if the field was far enough away from the object origin
(more that 16 bytes on x86, 32 bytes on x64) and tried
to check that field address, it would index to a different
descriptor BTE, one most likely invalid (BTE.ptr != field_addr).
And invalid descriptor is interpreted as disabling check.

Eric

Ivan Godard

unread,

Aug 5, 2013, 1:59:18 PM8/5/13

to

On 8/5/2013 7:19 AM, Andy (Super) Glew wrote:
> On 8/4/2013 7:05 PM, Ivan Godard wrote:
>> On 8/4/2013 6:17 PM, Andy (Super) Glew wrote:
>
>>> Keying the table on pointer value, Ao, the address of the object, is
>>> just plain incorrect.
>>>
>>> Reason: many objects overlap. Many objects have the same starting
>>> address.
>>
>> Well yes, of course. But if you simply set the bounds to the whole
>> malloc (plus 1 for C++) and check within that then you will catch the
>> large majority of actual bugs. True, if you have an array inside a
>> struct then you won't catch an overrun of the array that is not also an
>> overrun of the struct, but you're already not catching other things, so
>> this is one more, and I'd happily swap a few escapees for cheap, simple,
>> enforced, and universal.
>
> A very large fraction of bugs are buffer overflows between adjacent
> variables on the stack.
>
>
>
>

Exactly. So if each stack object (within the frame) has its own
descriptor then you catch those overruns. The stack frame isn't one
object :-) So again, keying by address (range) works (almost) as well as
a (dense or distributed) fat pointer, and is *much* cheaper.

Not that I think it's necessarily worth the trouble even then; ours is
about as cheap as is possible (I think), and I still doubt we will put
it in.

Andy (Super) Glew

unread,

Aug 5, 2013, 2:13:21 PM8/5/13

to

On 8/5/2013 10:21 AM, EricP wrote:
> Andy (Super) Glew wrote:
>>
>> INTERESTING TECHNICAL NOTE (at least interesting to me): from
>> http://download-software.intel.com/sites/default/files/319433-015.pdf
>>
>> I notice that MPX's BTE (Bound Table Entry) contains
>

>> * pointer value - ??

>
> If you look at the pseudo-code for the BNDLDX inst it says in part:
>
> Temp_lb[31:0] = LoadFrom(A_BTE );
> Temp_ub[31:0] = LoadFrom(A_BTE + 4);
> Temp_ptr[31:0] = LoadFrom(A_BTE + 8);
> IF Temp_ptr equal ptr_value Then
> BND.LB = Temp_lb;
> BND.UB = Temp_ub;
> ELSE
> BND.LB = 0;
> BND.UB = 0;
> FI;
>
> The pointer field is a tag to indicate validity.
> If it does not match, the bounds are set to min and max
> effectively disabling the check.
> BNDSTX sets the ptr field.

Right. It is the pointer value, the address of the object, Ao.

It is a small consistency check. It will catch some errors - like when
a library routine that is not using MPX stores a completely different
pointer to a memory location that was originally written to by MPX code.

In some ways, the MPX bounds are not just the extension parts of what in
a fat pointer would be next to the pointer value. They are the full fat
pointer.

> I noticed a gotcha in MPX.
>
> The descriptor BTE is located by looking up the address of the object.

Nope.

Page 9-10 of the reference "Bounds in memory are associated with the
memory address where the pointer is stored, i.e. Ap. Linear address LAp ..."

Florian Weimer

unread,

Aug 5, 2013, 3:43:27 PM8/5/13

to

* Ivan Godard:

> I'm somewhat surprised that Intel found this to be worth doing, but
> the customers must want it. It is only usable with compiler
> co-operation, so it's not a cap but a debugging aid. Because the table
> is keyed by pointer location rather than by pointer value it requires
> the compiler to track where a given pointer came from, which seems
> buggy to me.

It's required for detecting intra-object overflows (overflows which do
not cross a heap or stack allocation boundary).

There's some debate whether s.a[4] is valid, considering this
definition:

struct {
char a[4];
char b;
} s;

I'm not sure if this is intended solely as a debugging aid. That
probably depends on the performance hit and memory overhead. The
fully compatibility they're aiming for is a strong hint for more
general use.

> Intel's requires extra code at point of use; ours does not. Intel's
> potentially catches all invalid memory references; ours catches only
> most of them. Intel's permits arbitrary (wild) address arithmetic;

I think the hardware does not enforce either approach because the
bounds checks are separate.

Mike Stump

unread,

Aug 5, 2013, 4:05:12 PM8/5/13

to

In article <87k3jzh...@mid.deneb.enyo.de>,

Florian Weimer <f...@deneb.enyo.de> wrote:
>There's some debate whether s.a[4] is valid, considering this
>definition:
>
> struct {
> char a[4];
> char b;
> } s;

:-) I don' think there is any debate about that. int i = s.a[4] is
not valid. char *cp = &s.a[4]; is valid. int i = ((char *)&s)[4]; is valid.

More curious is:

char *cp = &s.a[2]; cp = cp - 2; cp = cp + 4; int i = *cp;

Here, reasonable people might disagree. I'm in the, "yes, this is
called upcasting, and it is fine," camp.

John Savard

unread,

Aug 6, 2013, 5:01:43 PM8/6/13

to

On Mon, 5 Aug 2013 20:05:12 GMT, m...@kithrup.com (Mike Stump) wrote, in
part:

Following what is written:

The pointer to char cp is declared;
The location of s.a[2] is assigned to it.
(s.a is of type char)
That pointer has two subtracted from it, and then four added to it. So
it has been increased by two.
Then the character found at s.a[4] is stored in the integer variable i.

So, yes, it is valid.

I thought I saw
cp = cp + 2; cp = cp - 4;
and this is why I worked it out step by step, because in that case, it
would clearly be invalid. But in this case, I fail to see where any
problem would come from. The expression is evaluated first, before the
compiler concerns itself with what the result will be stored in. That's
true for C, FORTRAN, and most compiled languages.

John Savard
http://www.quadibloc.com/index.html

Ivan Godard

unread,

Aug 6, 2013, 5:44:58 PM8/6/13

to

True in C; explicitly not true for iterators in C++.

Gunnar Blomberg

unread,

Aug 7, 2013, 7:41:42 AM8/7/13

to

On Wed, 31 Jul 2013 21:32:30 -0700, Ivan Godard <iv...@ootbcomp.com>
wrote:

>I'm somewhat surprised that Intel found this to be worth doing, but the
>customers must want it. It is only usable with compiler co-operation, so
>it's not a cap but a debugging aid. Because the table is keyed by
>pointer location rather than by pointer value it requires the compiler
>to track where a given pointer came from, which seems buggy to me.

I believe it isn't possible to key by pointer value, because of
pointer arithmetic. The pointer value may not be to the start of the
object. How would you figure out the bounds for a pointer into a
buffer, if the pointer no longer points into the buffer?

EricP

unread,

Aug 7, 2013, 5:04:24 PM8/7/13

to

Andy (Super) Glew wrote:
> On 8/5/2013 10:21 AM, EricP wrote:
>> I noticed a gotcha in MPX.
>>
>> The descriptor BTE is located by looking up the address of the object.
>
> Nope.
>
> Page 9-10 of the reference "Bounds in memory are associated with the
> memory address where the pointer is stored, i.e. Ap. Linear address LAp
> ..."

Oh right. For x86 there can be one BDE for each 4 byte pointer.
If the pointers location isn't naturally aligned it still works.

To copy a pointer, such as to pass-by-value to a routine,
the pointer copier also sets up a new descriptor for the copy.

Since the descriptor table has a 2 level tree structure,
the descriptor creater, such as the above copier,
is a complex sequence that may have to walk the table
and allocate memory chunks to the tree
(zero-init descriptors default to invalid).

Eric

Michael S

unread,

Aug 9, 2013, 8:02:21 AM8/9/13

to

99% of pointer arithmetic is not a problem, only exotic cases like below seem problematic.

type_t arr1[10];
type_t arr2[10];
type_t *ptr10 = &arr1[0];
type_t *ptr21 = &arr2[1];
type_t *ptr15 = &arr1[5];
// more complex code that prevents compiler from figuring out statically that values of pointers did not change
ptrdiff_t diff = ptr21 - ptr10;
// more complex code that prevents compiler from figuring out statically that values of diff and of ptr15 and did not change
type_t *ptr2x = ptr15 + diff;
// here programmer expects that ptr2x points to arr2[6]

According to my understanding, C standard does not guarantee that ptr2x will point to arr2[6], but "idiomatic" C/C++ compilers always worked that way and programmers are so accustomed to this particular behavior that it does not matter what the standard is saying.

So, for code sequences like those, keying by value of the pointer would be problematic, indeed. However one could hope, that sequences like the one above are very uncommon. For more common cases keying by value of pointer is technically possible. It is just complex and computationally expensive.

BGB

unread,

Aug 9, 2013, 11:27:15 AM8/9/13

to

On 7/31/2013 12:08 PM, EricP wrote:

> EricP wrote:
>> Paul A. Clayton wrote:
>>> Some links to information on the Intel website are available here:
>>> http://software.intel.com/en-us/intel-isa-extensions#pid-16080-1495
>>>
>>> I have not looked at this yet, but the description I looked over
>>> (http://www.realworldtech.com/forum/?threadid=135174&curpostid=135274 )
>>> seemed to indicate that the base/bound values can either be directly
>>> loaded or be loaded from a general table (allowing one to avoid the
>>> need for fat pointers).
>>>
>>> From my extremely vague impression, this might be similar to a
>>> technique referenced by Andy Glew called HardBound
>>> (Devietti et al., 2008, "HardBound: Architectural Support for Spatial
>>> Safety of the C Programming Language"--I have not read this paper
>>> either!).
>>>
>>> Sorry not to provide any real content on this, but since no one else
>>> has posted this news here I thought that it would be better to submit
>>> a low-content post quickly.
>>

>> The instruction manual PDF is here
>>
>> Intel� Architecture Instruction Set Extensions Programming Reference
>> http://software.intel.com/sites/default/files/319433-015.pdf
>>
>> It also covers AVX-512, SHA (Secure Hash), Random number instructions,
>> paging mode accesses enhancements, various other new instructions.
>>
>> Eric
>>
>
> It looks unnecessarily complicated and less functional
> than a straight forward CompareAndTrap instruction.
>
> It stores bounds in a hardware defined table structure,
> with each entry as a 4-tuple consisting of a
> lower bound, upper bound, pointer, and reserved fields.
>
> But any language that has bounds checks already define their own
> software defined buffer descriptors and don't put them all in one table
> (usually they are created on the fly on the stack).
> So this is pretty much useless for existing checking languages.
>
> Having a single global bounds table looks problematic.
>
> For any buffers that are allocated dynamically, it requires
> management software for dynamically allocating and freeing
> buffer descriptors in its global bounds table.
> That adds unnecessary overhead.
>
> Switching tables on the fly looks possible (and expensive) but
> then you'd loose access to the descriptors in the prior table.
>

yeah, instructions for compare-and-trap and maybe also
compare-and-clamp, would be pretty useful.

say, assuming we could do something like:
cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
cmpclamp rcx, r11, r13 //clamp RCX to a value between R11 and R13

maybe also:
tstclamp rcx, r11, r13 //test if RCX is a value between R11 and R13
this would then set up EFLAGS such that we could use Jcc instructions
with it.

...

also of-course immediate forms:
cmptrap ecx, 0xb0f0b0f0, 0xcafecafe //*1
cmpclamp ecx, 0, 255 //clamp ECX to the range of 0..255
tstclamp ecx, 0, 255 //test if ECX is in the range of 0..255

*1: dunno about a form with 64-bit immeds.
I guess an option could be silently encoding a memory access, which uses
a pair of bounds.

so, internally it is something like:
cmptrap rcx, [_constbounds]
...
_constbounds dq 0xbada55f001000000, 0xbada55f001dacafe

this would work, and could potentially handle all the immed forms, at
the cost of probably preventing using these forms on memory-memory operands.

these could then replace some common sequences of instructions commonly
involved both in HLL compiled output, as well as some signal-processing
stuff, ...

> Eric
>
>

BGB

unread,

Aug 9, 2013, 12:08:17 PM8/9/13

to

granted, the "cmptrap reg, rm" form would basically just be a revived
"bound" instruction.

could maybe reuse #BR as well.

different forms could make more sense though for sake of allowing test
and clamp operations, which could very well be more useful (since,
typically we want to trap and throw a language exception, or simply
clamp something, rather than necessarily get the OS involved...).

>
> these could then replace some common sequences of instructions commonly
> involved both in HLL compiled output, as well as some signal-processing
> stuff, ...
>

mostly matters as-much as it is actually faster though...
if it isn't faster, granted, there isn't as much of a point.

>
>> Eric
>>
>>
>

Bakul Shah

unread,

Aug 10, 2013, 3:22:31 PM8/10/13

to

On 8/9/13 5:02 AM, Michael S wrote:
> On Wednesday, August 7, 2013 2:41:42 PM UTC+3, Gunnar Blomberg wrote:
>> I believe it isn't possible to key by pointer value, because of
>> pointer arithmetic. The pointer value may not be to the start of the
>> object. How would you figure out the bounds for a pointer into a
>> buffer, if the pointer no longer points into the buffer?
>
> 99% of pointer arithmetic is not a problem, only exotic cases like below seem problematic.
>
> type_t arr1[10];
> type_t arr2[10];
> type_t *ptr10 = &arr1[0];
> type_t *ptr21 = &arr2[1];
> type_t *ptr15 = &arr1[5];
> // more complex code that prevents compiler from figuring out statically that values of pointers did not change
> ptrdiff_t diff = ptr21 - ptr10;

Most C compilers don't check this but this is expressly *not*
allowed as per the C standard. See section 6.5.6#9 (C99 std):

When two pointers are subtracted, both shall point to
elements of the same array object, or one past the last
element of the array object; the result is the difference of
the subscripts of the two array elements.

> // more complex code that prevents compiler from figuring out statically that values of diff and of ptr15 and did not change

The compiler may or may not check such things but if you misuse the language
you're on your own and can't complain if the compilers makes the assumption
you're following 6.5.6#9 and generates incorrect code.

> type_t *ptr2x = ptr15 + diff;
> // here programmer expects that ptr2x points to arr2[6]
>
> According to my understanding, C standard does not guarantee that ptr2x will point to arr2[6], but "idiomatic" C/C++ compilers always worked that way and programmers are so accustomed to this particular behavior that it does not matter what the standard is saying.

I would call this "idiotic" C, not idiomatic. There is rarely ever
any need for subtracting ptrs to different objects (and when done
it is generally done in very machine dependent code).

Joe keane

unread,

Aug 10, 2013, 4:02:45 PM8/10/13

to

In article <2d0db08b-3294-4599...@googlegroups.com>,

Michael S <already...@yahoo.com> wrote:
>type_t arr1[10];
>type_t arr2[10];
>type_t *ptr10 = &arr1[0];
>type_t *ptr21 = &arr2[1];
>type_t *ptr15 = &arr1[5];
>// more complex code that prevents compiler from figuring out statically
>that values of pointers did not change
>ptrdiff_t diff = ptr21 - ptr10;
>// more complex code that prevents compiler from figuring out statically
>that values of diff and of ptr15 and did not change
>type_t *ptr2x = ptr15 + diff;
>// here programmer expects that ptr2x points to arr2[6]

Besides it's illegal, it's also just freakin bizarre.

I've done some hacky code but...

Terje Mathisen

unread,

Aug 10, 2013, 4:51:50 PM8/10/13

to

Bakul Shah wrote:
> On 8/9/13 5:02 AM, Michael S wrote:
>> According to my understanding, C standard does not guarantee that
>> ptr2x will point to arr2[6], but "idiomatic" C/C++ compilers always
>> worked that way and programmers are so accustomed to this particular
>> behavior that it does not matter what the standard is saying.
>
> I would call this "idiotic" C, not idiomatic. There is rarely ever
> any need for subtracting ptrs to different objects (and when done
> it is generally done in very machine dependent code).

The _only_ place I have ever done this is in asm code loading data from
two arrays, combining them to generate a third array:

At this point it becomes quite natural to use a single pointer with
offsets to the others:

I.e. ESI -> first array, EBX = offset between first and second array,
ECX offset between first and target:

mov eax,[esi]
xor eax,[esi+ebx]
mov [esi+ecx],eax

but for C(++) I would never (I hope!) write code the same way.

Ivan Godard

unread,

Aug 10, 2013, 5:24:13 PM8/10/13

to

Terje, you have a dirty mind :-)

This is a use I'd never considered. An optimizing compiler might find
this when doing strength reduction of iteration variables, but checking
hardware would throw up on the resulting code.

I suppose the canonical example would be memcpy, where:
*p++ = *q++;
becomes:
*p[0] = *p[<delta>]; ++p;
using the trick. However, the same compiler should be able to find:
*p[i] = *q[i]; ++i;
which has the same operation count, although it uses two more registers
and the indexed effective address may take a cycle longer than unindexed
on some machines.

The problem with your trick is that it presumes that <delta> can never
exceed the size of an instruction offset literal. That assumption was
safe in many 32 bit machines but not all, and in few 64 bit machines.
Testing for the applicability would burn all your gain and then some I
expect, so the trick is really only applicable in app- and
implementation-specific code and cannot be part of general libraries.

Nevertheless, I am a great admirer of good dirty minds :-)

Andy (Super) Glew

unread,

Aug 10, 2013, 7:44:54 PM8/10/13

to

On 8/10/2013 2:24 PM, Ivan Godard wrote:
> On 8/10/2013 1:51 PM, Terje Mathisen wrote:
>> Bakul Shah wrote:
>>> On 8/9/13 5:02 AM, Michael S wrote:
>>>> According to my understanding, C standard does not guarantee that
>>>> ptr2x will point to arr2[6], but "idiomatic" C/C++ compilers always
>>>> worked that way and programmers are so accustomed to this particular
>>>> behavior that it does not matter what the standard is saying.
>>>
>>> I would call this "idiotic" C, not idiomatic. There is rarely ever
>>> any need for subtracting ptrs to different objects (and when done
>>> it is generally done in very machine dependent code).
>>
>> The _only_ place I have ever done this is in asm code loading data from
>> two arrays, combining them to generate a third array:
>>
>> At this point it becomes quite natural to use a single pointer with
>> offsets to the others:
>>
>> I.e. ESI -> first array, EBX = offset between first and second array,
>> ECX offset between first and target:
>>
>> mov eax,[esi]
>> xor eax,[esi+ebx]
>> mov [esi+ecx],eax
>>
>> but for C(++) I would never (I hope!) write code the same way.

Intel's C compiler generates code this same way via aggressive
optimizations.

>>
>
> Terje, you have a dirty mind :-)
>
> This is a use I'd never considered. An optimizing compiler might find
> this when doing strength reduction of iteration variables, but checking
> hardware would throw up on the resulting code.

Yep.

Been there, fixed that.

Andy (Super) Glew

unread,

Aug 10, 2013, 7:48:33 PM8/10/13

to

Although note that you also need a check that the particular bounds you
are comparing against is valid. Which is a bit more code.

Library routines may be passed both pointers with pounds and pointers
with no bounds, and must work in both cases.

Apart from that, my only objection to

> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13

is that it is a separate instruction, optional and not even potentially
mandatory security.

Bakul Shah

unread,

Aug 10, 2013, 11:19:47 PM8/10/13

to

On 8/10/13 1:51 PM, Terje Mathisen wrote:
> Bakul Shah wrote:

>> ... There is rarely ever

>> any need for subtracting ptrs to different objects (and when done
>> it is generally done in very machine dependent code).
>
> The _only_ place I have ever done this is in asm code loading data from two arrays, combining them
> to generate a third array:
>
> At this point it becomes quite natural to use a single pointer with offsets to the others:
>
> I.e. ESI -> first array, EBX = offset between first and second array, ECX offset between first and
> target:
>
> mov eax,[esi]
> xor eax,[esi+ebx]
> mov [esi+ecx],eax
>
> but for C(++) I would never (I hope!) write code the same way.

Very clever! But machine dependent. A long time ago I worked on a
system with 6 byte pointers (2 byte segment + 4 byte seg offset).
Here the three arrays would be in different segments and a ptrdiff
between them would've been a 6 byte quantity. Your trick would've
been a pessimization!

BGB

unread,

Aug 10, 2013, 11:40:57 PM8/10/13

to

a lot depends on the source-language and/or compiler/VM in-use.

presumably, it would be up to the compiler or code-generator to
determine (statically) whether or not a bounds-check will be used (along
with the whole matter of whether we are using a narrow or fat-pointer, ...).

though, potentially such an instruction could interpret a certain case,
such as ((lower==upper)&&(upper==0)) as being no-op, but this seems a
little pointless.

if the ABI only allows raw narrow pointers though, then granted, this
mechanism probably wont be able to do a whole lot.

> Apart from that, my only objection to
>
> > cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>
> is that it is a separate instruction, optional and not even potentially
> mandatory security.

possibly, but I was personally assuming a case of "slightly cheaper
bounds checks for JIT compiler output".

granted, something like this would be woefully insufficient if security
were the concern.

in this case, it would more replace use of constructions like, for example:
rax -> boxed-pointer
[rax+0]=pointer value,
[rax+8]=lower bound,
[rax+16]=upper bound,
...

mov rcx, [rax+0]
cmp rcx, [rax+8]
jl .trap
cmp rcx, [rax+16]
jge .trap
movsd xmm0, [rcx]
...
.trap:
... logic for throwing VM exception ...

with something like:
mov rcx, [rax+0]
tstclamp rcx, [rax+8]
jnz .trap
movsd xmm0, [rcx]
...

granted, a lot would depend on if these instructions would actually be
faster than using the existing constructions.

for something like C, the compiler probably may or may not use
fat-pointers, probably depending on what it thinks are the chances of
the code generating an out-of-bounds access.

otherwise, as noted, instructions for things like clamping values to a
range, and things like testing for bit-tags, ... could be potentially
useful sometimes.

taken further, maybe something to allow for a faster "switch" could also
be cool (as-is, "switch()" can be fairly expensive in
performance-critical code, though I don't actually know if dedicated
instructions could help much of anything with this).

actually, a clamp-instruction could probably itself help with
faster-switch if it were itself faster, since it would no longer be
needed to actually check the upper and lower bounds (and explicitly
branch to default), and instead we would simply leave the low and high
bounds of the jump table as pointing to 'default'.

similarly, something like "tstclamp" could also be used to jump to
'default'.

though, granted, AFAICT, usually the main 'jmp' instruction is the main
expensive part.

...

Florian Weimer

unread,

Aug 11, 2013, 3:13:35 PM8/11/13

to

* Mike Stump:

> In article <87k3jzh...@mid.deneb.enyo.de>,
> Florian Weimer <f...@deneb.enyo.de> wrote:
>>There's some debate whether s.a[4] is valid, considering this
>>definition:
>>
>> struct {
>> char a[4];
>> char b;
>> } s;
>
> :-) I don' think there is any debate about that. int i = s.a[4] is
> not valid. char *cp = &s.a[4]; is valid. int i = ((char *)&s)[4]; is valid.

Throw in comparisons with memcmp on the pointers or casts to
uintptr_t, and things get more strange.

> char *cp = &s.a[2]; cp = cp - 2; cp = cp + 4; int i = *cp;
>
> Here, reasonable people might disagree. I'm in the, "yes, this is
> called upcasting, and it is fine," camp.

s.a[4] is defined as *(s.a + 4), so I don't see how this is different.

Andy (Super) Glew

unread,

Aug 13, 2013, 1:56:45 AM8/13/13

to

On 8/10/2013 4:48 PM, Andy (Super) Glew wrote:
> On 8/9/2013 8:27 AM, BGB wrote:
>> yeah, instructions for compare-and-trap and maybe also
>> compare-and-clamp, would be pretty useful.
>>
>> say, assuming we could do something like:
>> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>> cmpclamp rcx, r11, r13 //clamp RCX to a value between R11 and R13
>
> Although note that you also need a check that the particular bounds you
> are comparing against is valid. Which is a bit more code.
>

...

>
> Apart from that, my only objection to
>
> > cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>
> is that it is a separate instruction, optional and not even potentially
> mandatory security.

Oh, more details:

Usually you don't to do just

cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13

Usually you have an associated access size:

cmptrap [rcx,+4bytes), [r11, r13]
//trap if RCX to RCX + 4 bytes is not between R11 and R13

Most access sizes are simple - 1, 2, 4, 8, 16, ... bytes.

Sometimes you need more arbitrary access sizes.

The test becomes something like

check_bounds(access,bounds) {
assert( access.lo <= access.hi );
assert( bounds.lwb.valid && bounds.upb.valid
? bounds.lwb.value <= bounds.upb.value
: true );
if( bounds.lwb.valid == 0 ) {
lwb_okay = true;
} else {
lwb_okay = (bounds.lwb.value <= access.lo);
}
if( bounds.upb.valid == 0 ) {
upb_okay = true;
} else {
upb_okay = (bounds.upb.value >= access.hi);
}
return lwb_okay && upb_okay;
}

You can still do this using a three input adder, and some bitmask logic.

BGB

unread,

Aug 13, 2013, 5:32:57 AM8/13/13

to

On 8/13/2013 12:56 AM, Andy (Super) Glew wrote:
> On 8/10/2013 4:48 PM, Andy (Super) Glew wrote:
>> On 8/9/2013 8:27 AM, BGB wrote:
>>> yeah, instructions for compare-and-trap and maybe also
>>> compare-and-clamp, would be pretty useful.
>>>
>>> say, assuming we could do something like:
>>> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>>> cmpclamp rcx, r11, r13 //clamp RCX to a value between R11 and R13
>>
>> Although note that you also need a check that the particular bounds you
>> are comparing against is valid. Which is a bit more code.
>>
> ...
>>
>> Apart from that, my only objection to
>>
>> > cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>>
>> is that it is a separate instruction, optional and not even potentially
>> mandatory security.
>
> Oh, more details:
>
> Usually you don't to do just
> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>
> Usually you have an associated access size:
>
> cmptrap [rcx,+4bytes), [r11, r13]
> //trap if RCX to RCX + 4 bytes is not between R11 and R13
>
> Most access sizes are simple - 1, 2, 4, 8, 16, ... bytes.
>
> Sometimes you need more arbitrary access sizes.
>

this is possible, but is generally something that can be enforced by a
compiler (and/or unlikely assuming the programmer isn't doing something
clever with pointers).

say, if RCX<R13, and both values are aligned on 4-byte boundary, then
(RCX+3)<R13 by extension. then it is mostly about enforcing alignment.

the main problem with more conventional bounds-checks is mostly that
straightforward arithmetic/... can easily produce out-of-bounds values,
mandating explicit checks except in cases where it is possible to prove
that the value will not be out of bounds.

luckily, if accessing by array index (with 0 based arrays), it is
possible to do something like:
cmp ecx, [rax+8] ;say, [rax+8]=array length
jae .trap
mov rdx, [rax+0] ;[rax+0]=array data
movss xmm1, [rdx+4*rcx]

there is a slight special case in the case of offset-arrays (in my VM),
where it may be something more like, say:
add ecx, [rax+12] ;say, [rax+12]=base offset
cmp ecx, [rax+8] ;say, [rax+8]=array length
jae .trap
mov rdx, [rax+0] ;[rax+0]=array data
movss xmm1, [rdx+4*rcx]

where in my language/VM, an offset array is basically the result of
something like:
var arr:float[1024];
var arr2:float[]=arr+256;
var arr3:float[]=arr-1;

where arr2 basically has a valid index range between -256 and 767, and
arr3 between 1 and 1024, with each referring to the same array data.

(this VM uses split headers and data areas for both arrays and objects).

sadly, there may still be costs for things like detecting and trapping
on 'null' (and logic to optimize away the null-checks, ...), ...

granted, more strict checks are likely needed if accessing array data
via raw pointers (such as in a C compiler).

side note: actually... in general, my VM architecture and current JIT
are pretty much crap, but it is hard trying to justify the time/effort
that would be needed to really make it better (like, trying to be more
performance-competitive with C compiler output, ...).

but, it works ok despite being almost comically bad and embarrassing in
many areas: stack-machine model + occasional threaded code + a
trampoline loop (and chained basic-blocks) + ... vs, say, 3AC+SSA + full
code generation + a proper ABI (and unified function/method bodies) + ...

decided to omit a large description of the use of trampoline loops and
CPS (Continuation Passing Style), and how this relates to
inter-operation with the C ABI (there are good and bad points).

a "cleaned up" VM would probably use 3AC+CPS, probably a modified ABI,
and probably still use some internal runtime calls.

>
> The test becomes something like
>
> check_bounds(access,bounds) {
> assert( access.lo <= access.hi );
> assert( bounds.lwb.valid && bounds.upb.valid
> ? bounds.lwb.value <= bounds.upb.value
> : true );
> if( bounds.lwb.valid == 0 ) {
> lwb_okay = true;
> } else {
> lwb_okay = (bounds.lwb.value <= access.lo);
> }
> if( bounds.upb.valid == 0 ) {
> upb_okay = true;
> } else {
> upb_okay = (bounds.upb.value >= access.hi);
> }
> return lwb_okay && upb_okay;
> }
>
> You can still do this using a three input adder, and some bitmask logic.
>

yep.

unread,

Aug 14, 2013, 3:37:41 PM8/14/13

to

Andy (Super) Glew wrote:
> On 8/10/2013 4:48 PM, Andy (Super) Glew wrote:
>> On 8/9/2013 8:27 AM, BGB wrote:
>>> yeah, instructions for compare-and-trap and maybe also
>>> compare-and-clamp, would be pretty useful.
>>>
>>> say, assuming we could do something like:
>>> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>>> cmpclamp rcx, r11, r13 //clamp RCX to a value between R11 and R13
>>
>> Although note that you also need a check that the particular bounds you
>> are comparing against is valid. Which is a bit more code.
>>

> ....

>>
>> Apart from that, my only objection to
>>
>> > cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13
>>
>> is that it is a separate instruction, optional and not even potentially
>> mandatory security.
>
> Oh, more details:
>
> Usually you don't to do just
> cmptrap rcx, r11, r13 //trap if RCX is not between R11 and R13

I had originally thought of this as having a condition code and 2 operands
cmptrapCC r, r/m/imm
as being an optimization of

cmp r, r/m/imm
jmpCC label
trap
label:

so that the trap PC points where the problem occured.
Having a general condition code could allow it to be used
for general assertions, not just index or range checks.

So it takes a pair of cmptrapCC instructions to do an
array index check, but many arrays have small static sizes
so it is comparing the index to a small immediate value.

This also works for both signed and unsigned data types.
Unsigned data types may not require a lower value check.

Also having a bounds check style instruction requires a pair of values,
and this could be a pair of 8-byte immediates in the instruction buffer,
which looks problematic to me.
Alternatively it implies a hardware defined bounds data structure
which I would also prefer to avoid.

The pair of check instr. might be fused internally
into a single bounds microop.

>
> Usually you have an associated access size:
>
> cmptrap [rcx,+4bytes), [r11, r13]
> //trap if RCX to RCX + 4 bytes is not between R11 and R13
>
> Most access sizes are simple - 1, 2, 4, 8, 16, ... bytes.
>
> Sometimes you need more arbitrary access sizes.
>
>
> The test becomes something like
>
> check_bounds(access,bounds) {
> assert( access.lo <= access.hi );
> assert( bounds.lwb.valid && bounds.upb.valid
> ? bounds.lwb.value <= bounds.upb.value
> : true );
> if( bounds.lwb.valid == 0 ) {
> lwb_okay = true;
> } else {
> lwb_okay = (bounds.lwb.value <= access.lo);
> }
> if( bounds.upb.valid == 0 ) {
> upb_okay = true;
> } else {
> upb_okay = (bounds.upb.value >= access.hi);
> }
> return lwb_okay && upb_okay;
> }
>
> You can still do this using a three input adder, and some bitmask logic.

Yes, though the problem is that this logic can be language dependent.
For example, Ada allows null arrays but other languages do not.
In Ada a null array is legal where the upper bound < lower bound.
In VAX Fortran 77, a string slice where upper bound < lower bound
str[1:0] was an error trap. Interestingly, when compiled without
bounds checks it did 'the right thing' and gave a null string.

So an Ada index check is

if (lowb <= uprb) then
cmptrapGE index, lowb
cmptrapLE index, uprb
...
end if

whereas an F77 check is

cmptrapLE lowb, uprb
cmptrapGE index, lowb
cmptrapLE index, uprb

Having a simple cmptrapCC instruction that does a single check
seemed most flexible for different languages.

This was in addition to instructions like
addto (add trap overflow), addtc (add trap carry),
multo (signed multiply trap overflow), mulutc (unsigned mul trap carry)
etc.

in order to make checking languages cheap and simple.

Eric

Andy (Super) Glew

unread,

Aug 14, 2013, 4:24:48 PM8/14/13

to

But the big problem is usually NOT locally allocated buffers, that the
compiler knows the size of.

The problem is buffers that the compiler does not know the size of.
Buffers that are passed in via parameters. At the minimum you would
need to spend an instruction to subtract the size -which is a win if you
do many accesses the the same buffer, but a loss if you only do one access.

Including buffers that the compiler used to compile module Foo may
expect to be aligned, but which the compiler used to compile module Bar
which calls Foo did not align.

Bakul Shah pointed this out: this is a heterogeneous programming
environment. Not every compiler makes the same assumptions.

> granted, this is a different matter from a secure ISA, where a secure
> ISA would necessarily have to protect against code produced by unknown
> compilers, vs leaving primary security enforcement to the compiler or to
> a VM layer, thus requiring more stringent checks.

even for advisory security, you have to worry about heterogeneous compilers.

E.g. many of the security bugs in Java occur in the C libraries it calls.

BGB

unread,

Aug 14, 2013, 6:40:03 PM8/14/13

to

if the language uses a high-level array type, then granted, the size
will probably be included in the array.

if the language uses fat-pointers, and allows misaligned access, then
they will probably be set up "as appropriate", meaning the limit will
likely include the adjustment already.

and, if dealing with narrow pointers, then one has a different set of
issues (this then becomes more a matter of policy).

like, a compiler could disallow them, or access them using run-time
lookups and checks, or ...

> Including buffers that the compiler used to compile module Foo may
> expect to be aligned, but which the compiler used to compile module Bar
> which calls Foo did not align.
>

then you generally have bigger problems...

mismatches like this tend to cause crashes due to "seemingly trivial"
ABI differences.

the general trend in compilers (C compilers included) tends to be to
assume aligned members unless declared otherwise.

ABIs also tend to mandate specific sizes and alignment requirements for
specific types, may mandate specific alignments for the stack, ...

like, the CPU may not mandate things to be aligned, but that doesn't
mean, for example, that a compiler is free to call something expecting
the Win64 or the SysV/AMD64 ABI, without having the stack aligned on a
16-byte boundary, obeying the relevant rules for argument passing, ...

like, a compiler may do something different internally, but its external
interface at least needs to match what is expected by the relevant ABI.

> Bakul Shah pointed this out: this is a heterogeneous programming
> environment. Not every compiler makes the same assumptions.
>

it depends on what you are expecting.

I was not saying this would be for a secure ISA, nor even necessarily
for nailing down an entire application.

more, the concern here, I think, is mostly of having reasonably cheap /
high-performance bounds-checks for arrays.

if anything, this helps trap a particularly annoying set of sometimes
hard to track-down bugs.

>> granted, this is a different matter from a secure ISA, where a secure
>> ISA would necessarily have to protect against code produced by unknown
>> compilers, vs leaving primary security enforcement to the compiler or to
>> a VM layer, thus requiring more stringent checks.
>
> even for advisory security, you have to worry about heterogeneous
> compilers.
>
> E.g. many of the security bugs in Java occur in the C libraries it calls.
>

yes, but this isn't really the JVM's problem, but rather a problem with
the C libraries called.

at least potentially though, in security-sensitive areas, someone could
develop/mandate the use of C compilers with explicit security checks
and/or include a machine-code validation layer (similar to Google Native
Client).

if the code produced doesn't follow the rules, then the validation logic
will reject it.

otherwise, the compiler only really needs to be concerned with the
integrity of the code it itself produces (that it enforces its own
semantic requirements, and that it doesn't accept invalid input or
produce invalid output, ...).

I don't really think security is the primary reason for bounds-checking
though, more just that it is a lot more useful for hunting down bugs.

otherwise, one is left with corrupt memory, and the sometimes
non-trivial task of trying to hunt down what code happens to be running
outside the array bounds...

BGB

unread,

Aug 14, 2013, 9:05:31 PM8/14/13

to

expanding on this:
for a locally-defined variable, the size can be known statically.

if a local array is passed as an "array object type", then there will
generally be an explicit array size inside the object.

in my case, C-land narrow pointers are generally handled via runtime
lookups.

similarly, passing something to C land will generally involve coercing
it, like say coercing "float[]" to "float*" may involve calculating the
address of the specific array index and passing this. if it later falls
back under control of the VM, it will try to look it up and see if it is
a recognized object (and will calculate the offset from the pointer).

some code may also use pointers, where by default the VM will use "boxed
pointers" internally, but may use unbounded boxed-pointers in cases
where the underlying object is outside VM managed memory, and where the
code has sufficient rights to have access to these (only "root" code is
allowed to use raw pointers in my VM).

note that, unlike C, this VM currently uses heap-allocated local arrays.

taking the address of a local will generally give a value-reference
object, which is considered distinct from a pointer, though internally
this is currently just a box-pointer referencing a single value.

as noted, securing C land is currently a non-goal in my case, where the
current policy is to restrict direct access to C-land from untrusted
script code (currently requires API wrapping, which is its own set of
pain, though C-side security annotations are another possible strategy).

I have on occasion considered running C in my VM, but the main limiting
factor is mostly time/effort (and a few awkward specifics, likely the C
code would have to run inside of a partly sandboxed C runtime, *1).

*1: basically, it would need to make a few tweaks to fudge over some
target-specific issues, and probably use wrapped-over versions of the C
API, ... (otherwise there are a few rather ugly issues which are not
easily glossed over).

it would still be a goal to have reasonably transparent access to native
APIs though (via the FFI), and to have normal C apps being able to work
with minimal modification (as least assuming it doesn't "turn ugly").

...

Andy (Super) Glew

unread,

Aug 14, 2013, 9:47:00 PM8/14/13

to

And what is the correct adjustment for an array of bytes? Or an array
of 32-bit words?

sizeof(array[0])? sizeof the base type?

But what happens if the assembly code is using SIMD packed vectors - the
adjustment may then need to be 16 bytes. Or 32 bytes. Or 64 bytes.

Or perhaps the user is calling a library that does byte manipulations,
even though the data is declared as being a word.

Sorry, that doesn't work.

>> Including buffers that the compiler used to compile module Foo may
>> expect to be aligned, but which the compiler used to compile module Bar
>> which calls Foo did not align.
>>
>
> then you generally have bigger problems...
>
> mismatches like this tend to cause crashes due to "seemingly trivial"
> ABI differences.

But sometimes they don't cause crashes.

Sometimes they cause security problems.

And these are exactly the sort of thing that we are trying to catch.

(Bad guys go looking for crashes, since very often a crash is next to a
security hole.)

The whole point of something like this is to convert silent flaws into
noisy boundscheck violations - that will probably cause the program in
question to die.

Ivan Godard

unread,

Aug 14, 2013, 10:59:47 PM8/14/13

to

Why not? Your array descriptor says it's an array of words. The library
sugnature says it takes an array of bytes, so you cast
row<word>->row<byte>, which is a descriptor-to-descriotor hardware
operation in the registers that changes the element size and count, and
pass that. Ditto SIMD, etc.

The flaw in all these proposals is that you are trying to do the
checking in compiler-emitted code, instead of implementing the
higher-level concept (the array descriptor) in hardware directly.

MitchAlsup

unread,

unread,

Aug 15, 2013, 4:36:42 AM8/15/13

to

On 8/14/2013 11:07 PM, BGB wrote:
> On 8/14/2013 9:59 PM, Ivan Godard wrote:
>> On 8/14/2013 6:47 PM, Andy (Super) Glew wrote:

>>>> if the language uses fat-pointers, and allows misaligned access, then
>>>> they will probably be set up "as appropriate", meaning the limit will
>>>> likely include the adjustment already.
>>>
>>> And what is the correct adjustment for an array of bytes? Or an array
>>> of 32-bit words?
>>>

>>> ... what happens if the assembly code is using SIMD packed vectors - the

>>> adjustment may then need to be 16 bytes. Or 32 bytes. Or 64 bytes.
>>

>> Why not? Your array descriptor says it's an array of words. The library
>> sugnature says it takes an array of bytes, so you cast
>> row<word>->row<byte>, which is a descriptor-to-descriotor hardware
>> operation in the registers that changes the element size and count, and
>> pass that. Ditto SIMD, etc.
>>
>> The flaw in all these proposals is that you are trying to do the
>> checking in compiler-emitted code, instead of implementing the
>> higher-level concept (the array descriptor) in hardware directly.

You have to draw a line somewhere.

I drew the following lines (even before I worked on this stuff at Intel,
dating back to Gould):

No descriptors that do not fit in a relatively small amount of memory
2-4-8X the pointer size.

No variable size descriptors. (I can imagine relaxing this, to have 2
sizes of descriptors. But fixed width makes many things easier.)

No struct descriptors.

No multidimensional array descriptors.

Byte granular bounds of arbitrary alignment (motivated by C char[], if
nothing else)

The minimal descriptor is
uint64_t lwb
uint64_t upb
plus a few control bits: R, R/W, X, ...
Plus the backpointer, if you want to avoid wasting space by address scaling.
uint64_t backpointer_address
IT looks like Intel MPX chose to waste memory, not have a backpointer,
but instead have
uint64_t pointer_value

Without the backpointer, it is tempting to try to squeeze the control
bits into unused upper bits of the bounds. But if you exceed 128 bits,
you might as well round up to 256 bits - 4X a 64-bit pointer.

If you have bounds and pointer_value and backpointer, ugh - you have to
squeeze the control bits into the 5 unused lower bits of the descriptor.
Assuming that 256-bit/32-byte descriptors are aligned.

Orthogonal R/W/X uses 3 bits already.

Non-orthogonal - R, RW, RWX, RX, X - helps, but not much.

This doesn't leave many bits for types.

Even stealing 8 bits from the top of all pointers doesn't free up much.

Not enough for a decent type system.

Besides, I still want to leave types to software

=> No knowledge of types.
If the user wants to create a packed array of 13 bit integers, let
him. I won't give him bit granular bounds, but byte granular.
If the user wants to treat floating point numbers as addresses, let
him. The problem with type checking is that the number of user defined
scalar types is potentially infinite - if you want to prevent floating
point being accidentally cast as an address, why not prevent typedef
uint16_t color being misinterpreted as typedef uint16_t weight.
Dimensional analysis. Type checking could be the next phase. The
non-adjacent metadata could be made into a facility for SW to use. But
SW can imbue the metadata tag bits with type knowledge.

If you are forced to jump to 8x64b descriptors, maybe.

But I still want to leave types to software.

My priorities are:

(1) bounds checking

(2) dead pointers (hard to do without another pointer; some proposals
use, say, 16 bits of version number)

(3) taint propagation (e.g. for SQL injection)

Type checking is lower priority. Based on studies of security
breaches. Of course, stats may have changed since I last looked at them.

Other, orthogonal, lines of evolution for security related instruction sets:

(A) integer overflow (big cause of buffer overflows)

(B) ...

heck, you can google.

Andy (Super) Glew

unread,

Aug 15, 2013, 4:42:43 AM8/15/13

to

On 8/15/2013 12:18 AM, Ivan Godard wrote:

> I agree that putting a VM in hardware is overkill, but security does not
> require a VM, or even a target language. Neither does security require a
> type system. It requires a cheap, easy to use way to let you at the bits
> you are supposed to be able to use, while keeping you out of the bits
> that you are not suppose to be able to see much less touch. What you do
> with the bits that you can see and touch is your (and the language you
> are using's) business, and the hardware should neither care nor get in
> your way.
>
> Ivan

I couldn't agree more.

But I thought you were telling me that array descriptors with base type
and dimensions were a good idea, a few posts ago.

The minimalist "bits you are allowed to touch" is a contiguous region
[lwb,upb]. Plus maybe R/W/X...

If you want discontiguous regions, that's what pointers are for.

Ivan Godard

unread,

Aug 15, 2013, 5:52:43 AM8/15/13

to

On 8/15/2013 1:42 AM, Andy (Super) Glew wrote:
> On 8/15/2013 12:18 AM, Ivan Godard wrote:
>
>> I agree that putting a VM in hardware is overkill, but security does not
>> require a VM, or even a target language. Neither does security require a
>> type system. It requires a cheap, easy to use way to let you at the bits
>> you are supposed to be able to use, while keeping you out of the bits
>> that you are not suppose to be able to see much less touch. What you do
>> with the bits that you can see and touch is your (and the language you
>> are using's) business, and the hardware should neither care nor get in
>> your way.
>>
>> Ivan
>
> I couldn't agree more.
>
> But I thought you were telling me that array descriptors with base type
> and dimensions were a good idea, a few posts ago.
>
> The minimalist "bits you are allowed to touch" is a contiguous region
> [lwb,upb]. Plus maybe R/W/X...
>
> If you want discontiguous regions, that's what pointers are for.
>
>
>
>
>
>

They are a good idea. So is protection. *Different* ideas.

MitchAlsup

unread,

Aug 15, 2013, 12:09:50 PM8/15/13

to

On Thursday, August 15, 2013 3:36:42 AM UTC-5, Andy (Super) Glew wrote:
> The minimal descriptor is
> uint64_t lwb
> uint64_t upb
> plus a few control bits: R, R/W, X, ...

<snip>

> This doesn't leave many bits for types.

What do you loose in protection if you only check the cache line of
the upper and lower bounds (as accessible). This gives you 12-more
bits to type around with.

Mitch

BGB

unread,

Aug 15, 2013, 1:19:40 PM8/15/13

to

loosely, yes.

the big problem with the JVM is basically that it works pretty well for
one language (Java), but does pretty awful for most everything else.

they have made extensions to make it work better for dynamic languages
though, so now it is at least half-usable for things like Jython or JRuby.

some people have managed to get C to compile on it, but it isn't really
all that usable.

.NET is a little better here, in that it works pretty well for languages
which have similar semantics to C#, and has a little flexibility WRT
building language features.

kind of falls a bit flat when it comes to C++/CLI though.

design downsides:
.NET CIL is not well suited to efficient interpretation;
it leaves a fair bit of work to need to be done in the backend
(requiring a more complex JIT).

there is Dalvik, which does some interesting stuff, but has a drawback
in that it has basically similar features/semantics to the JVM.

Dalvik uses a 3-address register machine with explicit types on opcodes
(in contrast to a stack machine model, like that in the JVM or .NET).

OTOH, there is LLVM, which has the advantage that it does pretty well
for C and C++, but a few drawbacks:
it doesn't really handle implementing high-level language constructs
very well;
LLVM bitcode isn't really ideal as a program distribution format (it is
closer to being a packed tree format than a proper bytecode, and is in
many regards fairly specific to LLVM), and isn't really ideal WRT
semantics (IMHO, 3AC SSA-form isn't ideal for distribution), ...

the above is worth mention is that some projects (such as PNaCl) are
using LLVM bitcode as their program distribution format.

the idea would basically be to do something sort of like LLVM, just
using a more Dalvik-like instruction format, and including built-in
support for a few "weak areas" for LLVM, namely garbage-collection and
run-time type-tagging. similarly, including support for things like
bytecode validation, and also by default using bounds-checked arrays and
pointers.

also it makes sense to include a few things which are vaguely useful but
are problematic to try to bolt-on, namely tail-call optimization and
(exit only) continuations.

so, the goal would be something which can effectively handle a range of
languages (such as C, Java, JavaScript, and Scheme), ...

not that it would involve trying to design in every feature from every
language, but more in that it would try to provide basic facilities to
allow reasonably-effectively implementing these features.

a potential drawback would be that it would be relatively unknown, and
an uphill battle to be competitive with existing VMs.

> I agree that putting a VM in hardware is overkill, but security does not
> require a VM, or even a target language. Neither does security require a
> type system. It requires a cheap, easy to use way to let you at the bits
> you are supposed to be able to use, while keeping you out of the bits
> that you are not suppose to be able to see much less touch. What you do
> with the bits that you can see and touch is your (and the language you
> are using's) business, and the hardware should neither care nor get in
> your way.
>

there are two concerns here:
security;
type-safety (mostly to help catch bugs).

VMs also offer a certain level of hardware abstraction:
the same binary can be used on different pieces of hardware, and usually
also abstract over the OS, reducing the need to recompile;
typically, they also help facilitate loading and compiling code at
runtime, ...

but, raw native code also has its advantages (high-performance,
generality, not needing to depend on a specific VM implementation or use
a specific language, ...).

and, putting a VM in hardware sort of turns into the worse of both worlds.

> Ivan

Robert Swindells

unread,

Aug 15, 2013, 1:48:38 PM8/15/13

to

On Thu, 15 Aug 2013 12:19:40 -0500, BGB wrote:

> so, the goal would be something which can effectively handle a range of
> languages (such as C, Java, JavaScript, and Scheme), ...

Don't the cool kids just compile everything to JavaScript these days ?

BGB

unread,

Aug 15, 2013, 3:12:43 PM8/15/13

to

how many bits are needed for types is variable IME.

if encoding types directly, 12 bits can't do a whole lot.
if those 12 bits index a type-definition somewhere else, than it can do
a fair bit more (but is still a little skimpy depending on the types of
types it may index).

another option is allowing a lot of the information to actually be
located in object-headers, rather than in the pointer.

as can be noted from before, with 64 bit references:
4 bits tag, 60 bits data (simple/direct pointers/values);
4+12 bits tag, 48 bits data (pointers/values with more complex types).

granted, this scheme does limit the maximum theoretical address space.

it also has a drawback of requiring all bounds-checked arrays and
pointers to be copied around as memory-objects.

another option could be just trying to shove everything into a 128-bit
register (when it fits), and if it doesn't fit, overflowing to a header.

this means that the compact case consists of a pointer+metadata,
possibly with limits to the maximum array sizes, ...

say: 48 bits=address, 24 bits=array size, 24 bits=array offset, 16
bits=element type, 12-bits=element size, 4 bits=tag.

if it no longer fits, then it overflows to a header stored off in
memory, essentially with the 128-bit pointer holding a pointer pair (A
points to the address, B points to a header describing the array).

the tag could encode the reference type, with other types used for
various other types of values.

a bit of a drawback is that making all pointers be 128 bits could be a
bit expensive, when only a certain small percentage of pointers actually
need explicit bounds-checking metadata.

maybe 60% of the time, for arrays, it is possible to optimize away the
need for this metadata. meanwhile, things like object-references and
struct pointers almost never need bounds checking.

256 bits per pointer seems a bit severe...

> Mitch
>

BGB

unread,

Aug 15, 2013, 3:30:35 PM8/15/13

to

if targeting a web-browser...

otherwise, compiling to JavaScript is sort of a "pile of crap" solution...

the only real reason to consider it is that JS is one of the few things
that browser people can manage to generally agree on.

otherwise, if not targeting a browser, there is probably little reason
to really take it seriously.

this is not to say though that such a VM couldn't include a backend for
compiling to JS for sake of running in a browser though, but on that
front one could just as easily also include a backend for compiling to
JVM ByteCode and shoving the results into ClassLoader...

otherwise, it makes a lot more sense to JIT compile directly to native code.

Bakul Shah

unread,

Aug 16, 2013, 12:27:53 AM8/16/13

to

On 8/14/13 9:46 PM, Ivan Godard wrote:
> On 8/14/2013 9:00 PM, MitchAlsup wrote:
>> On Wednesday, August 14, 2013 9:59:47 PM UTC-5, Ivan Godard wrote:
>>
>>> The flaw in all these proposals is that you are trying to do the
>>> checking in compiler-emitted code, instead of implementing the
>>> higher-level concept (the array descriptor) in hardware directly.
>>
>> Right and now lets automagically convert <C> pointers into descriptors
>> and use descriptor accesses to memory instead of untyped loads and stores.
>>
>
> Hardly automagically. Magically perhaps, but not auto :-) Real work, including dumping existing
> unsafe ISAs in favor of a secure architecture. There is nothing in *standard* C that can't be safely
> supported on a descriptor machine. Of course, many C codes that "work" will break because they are
> non-standard and buggy.

May be you can represent a C ptr as a <descriptor, offset>.
sizeof(your intptr_t) will be 128 bits! Only code that did
(long)ptr or (int)ptr will break but that is already
considered broken by the C std and anyway Mill doesn't have
to worry about dusty decks! Now you can can check bounds in
hardware.

Ivan Godard

unread,

Aug 16, 2013, 12:58:42 AM8/16/13

to

Actually we very much need to worry about dusty decks, at least those
that are standard-conforming, and preferebly those that are not but are
at least reasonable. Unfortunate fact of commercial life :-(

You guys have about convinced me that partial bounds-checking is at
least worth supporting. It won't be security; that needs a different
mechanism. But it will catch a lot of the egregious bugs, and should
have no space or time cost compared to the usual rubbish.

I put it in the memory portmanteau filing, which is (very slowly)
getting finished, and it will be in the corresponding presentation if
there's time left after more important matters. I'm still not convinced
that it should be in the hardware; you can tell me what you think when
you see it.

Ivan

BGB

unread,

Aug 16, 2013, 2:02:25 AM8/16/13

to

pretty much should be workable...

I have actually partly faked C pointers in a similar way when kludging C
code over to C# or Java, so little is to say that an actual C compiler
couldn't do something similar.

the bigger question at this point is mostly whether it is better to
always use 128-bit pointers, or attempt trickery to remain with narrower
64-bit pointers (when possible).

always 128-bit pointers:
pro: the pointer will always be the same size;
con: may waste memory in cases where it is unnecessary.

always 64-bit pointers, with static box-pointers (*):
pro: the pointer will always be the same size;
pro: saves memory when sufficient;
con: requires in-memory value-objects when not sufficient;
con: risks worse performance (additional memory operations and copying);
con: depending on implementation, may risk exposing implementation
details (such that the pointer is essentially actually a struct pointer
rather than pointing directly at the value).

*: this is essentially what my currently VM uses.
explicit pointer-types are pretty much always use box-pointers, but,
luckily, explicit pointers are also fairly uncommon in my
script-language (vs C).

speculative use of 64/128 bit pointers:
pro: matches width requirements per-type;
con: pointers are not always the same size;
con: uncertain cases will likely require just using 128-bit pointers
anyways (out of safety);
con: may end up looking semantically like the always 128 case (otherwise
wackiness will ensue).

alternate sub-case, explicit use of pointer-size indicators:
pro: avoids the funkiness of speculative behavior;
pro: allows the programmer to indicate intended use;
con: essentially not too much different from near/far pointers, making
an additional burden for the programmer.

sub-case: always 64-bit pointers, but with box-pointer as a dynamic
escape-case:
pro: the pointer will always be the same size;
pro: saves memory when sufficient;
con: may introduce higher performance overhead due to run-time checks
(to we have a raw pointer, or a boxed-pointer?);
con: depending on specifics, logical pointer-values may be nonsensical,
or require dynamic checking and coercion in terms of pointer arithmetic.

...

Bakul Shah

unread,

Aug 17, 2013, 2:48:54 PM8/17/13

to

By dusty decks I meant old *compiled* programs. These don't exist
for the Mill. C source programs must be ported. C is "portable"
but you do have to "port" and not just recompile! One porting
trick I learned is to *remove* all casts and carefully examine
resulting warnings! And to distrust union variables and bitfields
(reverse endianness and a bitfield may even be split in two!).

Basically I was saying you can implement standard conforming C
while using bounds / access rights checking descriptors. You
just can't do it as efficiently as on present architectures!

>
> You guys have about convinced me that partial bounds-checking is at least worth supporting. It won't
> be security; that needs a different mechanism. But it will catch a lot of the egregious bugs, and
> should have no space or time cost compared to the usual rubbish.

Consider:

extern int f(char* t);
g(char* s) { ... f(s+off); ... }

If you always do bounds checking, you have to pass where s starts
and ends. In effect:

extern int f(char* t, char* t_base, char* t_limit);
g(char* s, char* s_base, char* s_limit) {
...
f(s+off, s_base, s_limit);
...
}

Alternately each ptr becomes <addr,offset> and you store base,
limit and any access rights at addr. Either way, it is not cost
free but if you want a safer C what other choice do you have?

Ivan Godard

unread,

Aug 17, 2013, 3:56:52 PM8/17/13

to

Wait and see :-)

Ivan

p.s. We'd do the filing and publishing faster if we could, but I have to
admit it's a lot of fun tantalizing you all. Kind of like teasing
children before Christmas :-)

Andy (Super) Glew

unread,

Aug 20, 2013, 9:00:21 PM8/20/13

to

Stack frames are not necessarily cache line aligned.

More than half of the buffer overflows in the standard suite would not
be handled by cache line granular bounds.

Unless you changed your compiler's memory allocation policies.

(Mallocs are often (usually) cache line granular. But not other things.)

BGB

unread,

Aug 21, 2013, 11:43:53 AM8/21/13

to

On 8/20/2013 8:00 PM, Andy (Super) Glew wrote:
> On 8/15/2013 9:09 AM, MitchAlsup wrote:
>> On Thursday, August 15, 2013 3:36:42 AM UTC-5, Andy (Super) Glew wrote:
>>> The minimal descriptor is
>>> uint64_t lwb
>>> uint64_t upb
>>> plus a few control bits: R, R/W, X, ...
>> <snip>
>>> This doesn't leave many bits for types.
>>
>> What do you loose in protection if you only check the cache line of
>> the upper and lower bounds (as accessible). This gives you 12-more
>> bits to type around with.
>>
>> Mitch
>>
>
>
> Stack frames are not necessarily cache line aligned.
>
> More than half of the buffer overflows in the standard suite would not
> be handled by cache line granular bounds.
>

this could be mandated at the ABI level.

if it works on 16-byte alignment, this will work with Win64, SysV/AMD64,
and GCC versions of x86 cdecl, provided the compiler has some means of
indicating array bounds.

32 or 64 bytes could also be doable, but a little more painful, as IME
objects in the 16-32 byte range are pretty common on the heap (small
structs, ...), and bounds-checking would necessarily force them up to
the minimum size.

does make me idly wonder if the allocation strategy I use for my memory
manager could apply to hardware:
off, somewhere, there is a table representing memory in terms of a big
bitmap, say, with 2 or 4 bits for every N bytes (16 or 256 in my MM);
when hardware wants to validate an access, it can validate that the
start and end lie on the same object:
leatrap rdx, [rax+8*rcx]

trap if RDX falls in a different object from RAX.

assuming we can do something SIMD-like (and the pointers are within a
certain range):
calculate byte-base for cell;
shift vector right to align with the intended cell;
mask off for pointer distance;
check value against expected bit pattern;
trap if pattern doesn't match.

if we have a 512-bit vector, 2-bits per cell, and 16 byte cells, this
could handle around 4kB (4032 bytes, to account for byte-alignment and
shifting).

the expected pattern would be all cells as data.

example:
0=free (unused memory);
1=head (object header / new-object);
2=data (object data);
3=reserved (locked/invalid memory).

then a "good" run will look something like:
0xAAAAAAAA_AAAAAAAA_...

longer range-checks would be harder, but could be handled like:
do check for the first span, align end with byte;
load/check whole vector for each span (say, 4096 bytes);
do check for final span.

first to trap results in a trap.

alternately, it could be acceptable to impose a hard limit of 4kB or so,
and mandate 256-byte cells for larger runs.

a 512 bit vector would then cover 64512 bytes.

possibly the bitmaps could be managed with a vaguely page-table like
structure, with one page of bitmap managing 256kB or 4MB of memory.

alternatively: at 128 byte cells, 4kB will require 64 bits for a bitmap,
allowing the structure to function much more like a page-table (if,
albeit, 128 bytes is a bit coarse).

256 byte cells would give 4kB in 32 bits.

32 and 64 bits would manage 256 and 512 bytes with 16 byte cells.

to use the feature, either the OS + MM would need to manage these bitmaps.

for stack-frames, it is likely that there would need to be some
ABI-level bit-twiddling, with each function containing a checked-array
indicating the layout of its stack-frame to the MM, which would be
written into the bitmap (hidden runtime function call?...).

> Unless you changed your compiler's memory allocation policies.
>
> (Mallocs are often (usually) cache line granular. But not other things.)
>

yeah.

side-note: my MM/GC actually uses 8 bits per bitmap entry, but oh well
(2 and 4 bits were used in the past, but I needed more state).

example:
2 bits: cell-type;
2 bits: mark state (0=white, 1=black, 2=grey, 3="extra-black");
4 bits: cell meta type (special per-cell state, related to GC semantics,
like "locked cells", precise vs conservative marking, ...).

each object has a header which gives additional information (type of
object, ...).