Hardware Architecture and Self-Modifying Code

Rick C. Hodgin

unread,

Aug 26, 2016, 9:10:45 AM8/26/16

to

As a software developer, I've always considered self-modifying code to be
almost essential for high-speed throughput and performance. In fact, when
I sat down to begin designing my CAlive compiler, I implicitly had resolved
in the back of my mind that I would introduce self-modifying code ... and
not just in the concept of LiveCode (edit-and-continue), but rather within
each generated function so that certain aspects of it can be re-configured
after the first pass, so that subsequent passes do not incur the "first
pass check" burdens, and so on.

However, as I've begun to get into the deep details of designing my LibSF
386-x40 hardware, I'm beginning to realize a fundamental flaw in SMC, even
with hardware support, and that's the multi-thread nature of code pretty
much precludes it (even with traditional hardware support).

I had planned previously to introduce a type of memory layering which, per
task, resolves fundamentally to the original binary code block, but then
also holds explicit pointers to small memory blocks which overlay portions
of the original code, allowing IP retrieval to retrieve the altered code
rather than the original code based on the data pointed to by those
pointers. I call these SMC registers, and SMCBs (SMC blocks).

This solution works great, and appears as though it will be fairly easily
implemented ... but then I got into considering the software-side of things,
and real run-time environments, and those cases where there is a single
task operating on multiple threads simultaneously, each of which might be
in the same SMC function simultaneously, and each of which might need to
hold different SMC registers pointing to different SMCBs. I had no solution
as the SMC registers are maintained with the task state, and are not given
to separate threads.

As such, I will introduce a hardware threading model which allows for each
thread to possess its own register state, and for the OS and user apps to
be able to perform thread-context switches which happen with a full machine
state which may be different than the overall task state.

-----
I was wondering if anyone else has a better solution? One which would
allow for SMC and multiple-threads without the need for these new hardware
facilities?

TYIA!

Best regards,
Rick C. Hodgin

Joe Pfeiffer

unread,

Aug 26, 2016, 11:44:58 AM8/26/16

to

"Rick C. Hodgin" <rick.c...@gmail.com> writes:

> As a software developer, I've always considered self-modifying code to be
> almost essential for high-speed throughput and performance.

I believe this to be an *extremely* idiosyncratic view.

Rick C. Hodgin

unread,

Aug 26, 2016, 11:49:28 AM8/26/16

to

FWIW, my considerations in this area are not limited by what exists at our
disposal, but are governed instead by what is needed. In all of my hardware
design areas, I am attempting to look at fundamentals and address their
needs. I am trying very hard to not be limited by what already exists,
outside of the recognition of its existence and a personal desire to not
cause all of that investment in labor and thought to be wholly abandoned
in migration. It's why LibSF 386-x40 will support most 32-bit 80386 code
in binary form, so that most apps which can be compiled to a 32-bit
version on x86 will run natively, but to also extend it out to 40-bits,
and to introduce a new ISA which has these new features. It allows a
migration over time.

A least those are my goals. I may fail in achieving all of them. The
early polls suggest it's at least a possibility that I will succeed.
Time will tell (James 4:15).

Stefan Monnier

unread,

Aug 26, 2016, 12:02:02 PM8/26/16

to

> As a software developer, I've always considered self-modifying code to be
> almost essential for high-speed throughput and performance.

That seems to fly in the face of everything I know about
computer architecture. Care to give some argument for why you think
this way?

Stefan

Ivan Godard

unread,

Aug 26, 2016, 12:03:02 PM8/26/16

to

I think he means SMC in the sense of a JIT, not in the sense of (for
example) patching the return branch of a call with the offset of the
return (as was done in some early Fortran).

JIT solves the thread problem by giving each thread its own version
instead of patching-in-place. However, I don't really see the gain from
his registers over simply keeping a few (mutable) function pointers around.

Rick C. Hodgin

unread,

Aug 26, 2016, 12:26:14 PM8/26/16

to

I don't know if you'll answer me or not, Ivan. I hope you will. :-)

My solution may not be a good one. It's the one I've considered and come
up with so far. My hope in asking for help is to find a better solution,
or to help me flush out this one.

The purpose of the registers is to load into hardware information for the
hardware itself to handle the code swapping out at the thread level. The
original function never changes in memory, but through the SMC registers,
it points to an offset and length to replace with code from another offset.
In this way, it's like a level of address translation before everything
else, so that by the time the IP address to reach into logical memory is
passed for translation into physical memory, it's already been adjusted
into the correct range for the thread.

My only loading registers, you're potentially changing large blocks of
code with a single population done one time at some point, so that the
block of code which may have done the translation originally is never
run again for the duration of its immediate lifetime (until the ret is
executed).

As I say, it may not be the best solution. It's the best one I've been
able to come up with. I'm hoping there are better ones that don't
require the new hardware support.

Rick C. Hodgin

unread,

Aug 26, 2016, 12:32:48 PM8/26/16

to

Not really. I think SMC itself reveals and explains its place sufficiently
if you study what it potentially brings to the table for a given algorithm.

We don't have good SMC hardware features today, and the research I've done
into SMC requires undesirable things like invalidating the cache manually
through a hard-coded instruction, etc.

It's taken me quite a while to work out this mechanism. I think it's
close, and at least possibly the correct choice. But, I don't have any
experience or formal training in hardware design, so I'm completely
winging it by the seat of my own internal abilities and gifts. I could
very well be wrong.

Quadibloc

unread,

Aug 26, 2016, 12:50:40 PM8/26/16

to

On Friday, August 26, 2016 at 7:10:45 AM UTC-6, Rick C. Hodgin wrote:

> As a software developer, I've always considered self-modifying code to be
> almost essential for high-speed throughput and performance.

As you've seen, this was considered an unusual view.

Originally, self-modifying code served two major purposes in early computer
architectures:

- The addresses of instructions were modified, so that loops could access
arrays.

- Subroutine calls often worked by storing the return address in memory
immediately before their branch target.

The first was eliminated in modern computers by adding index registers;
indirect addressing was another method used to deal with it, so even the PDP-8,
without index registers, was able to avoid the need for this technique.

The second was dealt with when computers started to have general registers
instead of a memory-accumulator architecture; then return addresses could be
placed in registers, which required specifying calling conventions such as the
R-type and S-type calling conventions of the System/360 architecture.

Self-modifying code, though, remained possible when the assembler programmer
had a felt need for it. It couldn't be used in re-entrant code, which some
general register architectures made possible, but that was a relatively rare
special case.

On some computers, there was an Execute instruction that included the
capability of modifying the instruction to be executing, reducing the need to
modify instructions in memory.

In early high-performance computers, where fetch, decode, and execute were
overlapped, in the absence of interlocks, one would have to allow one or two
instructions between the code that modified an instruction and its execution.

The death-knell for self-modifying code (which had already been under a cloud
of disapproval on the basis that it complicated debugging) was when
architectures started to be deeply pipelined. The System/360 Model 91 comes to
mind as a watershed moment; here was an important computer from a major
manufacturer on which self-modifying code simply would not work as expected.

And today its innovations have spread to a huge chunk of the microprocessors
with which we interact in daily use.

However...

Your point of view here is not _all_ wrong, even just looking at it through
what is conventionally well-known, without knowing what specifically you may
have in mind.

The inherent flexibility of treating code as data is, of course, what made the
higher-level language *compiler* possible.

Of course, compilers write their output to disk before executing it.

Just-in-time compilation, though, was noted as one restricted form of
self-modifying code that's still used, where code is prepared in memory and
then used there. In order to permit this important technique to continue to be
used, computer security mechanisms which aim to prevent viruses from tampering
with programs have had to include escape mechanisms.

The IBM 704 compiler, which at a very early date provided advanced code
optimizations, profiled programs by running part of the code it generated in a
modified fashion.

While languages like LISP and FORTH handle machine code in a somewhat
unconventional way, they don't require self-modifying code, but it's not too
much of a stretch to wonder if someone working in that direction might not come
up with a language which has added flexibility and power due to a technique
involving self-modifying code used internally.

John Savard

Quadibloc

unread,

Aug 26, 2016, 12:54:04 PM8/26/16

to

On Friday, August 26, 2016 at 7:10:45 AM UTC-6, Rick C. Hodgin wrote:

> In fact, when
> I sat down to begin designing my CAlive compiler, I implicitly had resolved
> in the back of my mind that I would introduce self-modifying code ... and
> not just in the concept of LiveCode (edit-and-continue),

So part of your concept does involve trying to reach higher flexibility.

> but rather within
> each generated function so that certain aspects of it can be re-configured
> after the first pass, so that subsequent passes do not incur the "first
> pass check" burdens, and so on.

Now, _that_ can be done just by unrolling the first iteration of the loop. Yes,
that means the program is a trifle bigger, but given the performance gains from
pipelining mechanisms that preclude self-modifying code, it seems a small price
to pay.

John Savard

Quadibloc

unread,

Aug 26, 2016, 1:08:38 PM8/26/16

to

On Friday, August 26, 2016 at 10:32:48 AM UTC-6, Rick C. Hodgin wrote:

> We don't have good SMC hardware features today, and the research I've done
> into SMC requires undesirable things like invalidating the cache manually
> through a hard-coded instruction, etc.

Even before getting into the obstacles to self-modifying code, on the older computers where it was no problem, it was still an awkward technique because the awkwardness of the instruction formats made it difficult to employ as a natural way to handle dynamic algorithms. I can't think of examples offhand, but if there was research done on this, it likely involved using pseudocode rather than actual machine instructions.

Well, there _is_ "genetic programming", but _that_ is hugely inefficient, and
is resorted to for solving problems for which no good computer algorithm is
known; it's an attempt to substitute massive computer brute force for human
genius. Usually, 'genetic programming' involves something more like a
descriptive data string than a program in any case; there are no branches
allowed, I suspect, in nearly all cases.

John Savard

Quadibloc

unread,

Aug 26, 2016, 1:09:18 PM8/26/16

to

On Friday, August 26, 2016 at 10:50:40 AM UTC-6, Quadibloc wrote:

> The IBM 704 compiler,

The FORTRAN compiler for the IBM 704, I should have said.

John Savard

wolfgang kern

unread,

Aug 26, 2016, 1:31:20 PM8/26/16

to

Rick C. Hodgin wrote about SMC...

Yes, I use SMC since four decades in my OS. But meanwhile I figured
that even it saves a lot of work, it may reduce performance if used
too often or on wrong (hot) locations.

I see nothing wrong by using SMC, but it should only be used
when temporary performance penalties gain much more lateron.

eg: I set all 'variables' as part of immediate data within my code
after any vmode change (which takes some time anyway).

ie:
cmp eax,imm32 ;only the four bytes of imm32 are modified to fit
;any new chosen vmode limits.
__
wolfgang

Stefan Monnier

unread,

Aug 26, 2016, 1:35:58 PM8/26/16

to

> Not really. I think SMC itself reveals and explains its place sufficiently
> if you study what it potentially brings to the table for a given algorithm.

That's what I'm asking: what can it "potentially bring to the table for
a given algorithm"?

Currently, the only cases where self-modifying code is used is jit
compilation, for which you don't need anything fancy and current
computer architectures provide good enough solutions.

Stefan

John Levine

unread,

Aug 26, 2016, 1:36:19 PM8/26/16

to

>The death-knell for self-modifying code (which had already been under a cloud
>of disapproval on the basis that it complicated debugging) was when
>architectures started to be deeply pipelined. The System/360 Model 91 comes to
>mind as a watershed moment; here was an important computer from a major
>manufacturer on which self-modifying code simply would not work as expected.

Uh, what? I programmed a 360/91. Self modifying code worked the same
as it did on any other 360. You could modify the next instruction and
it would work correctly, albeit slowly. See page 12 of the Model 91
Functional Characteristics.

If you're thinking of imprecise interrupts, they had nothing to do
with self-modifying code. They were because there wasn't enough
status info in the pipeline to back up correctly if a an instruction
caused a data related exception such as zero divide.

The current z/Series POO says (using about two pages of text) that you
can modify the next instruction and it'll work so long as you use the
same address, i.e., not an aliased page or address space.

This doesn't mean that self-modifying code is a good idea, but there
must be a fair amount of it in the mainframe wild if IBM is still
willing to promise that it will work.

Rick C. Hodgin

unread,

Aug 26, 2016, 4:03:55 PM8/26/16

to

On Friday, August 26, 2016 at 1:35:58 PM UTC-4, Stefan Monnier wrote:
> > Not really. I think SMC itself reveals and explains its place sufficiently
> > if you study what it potentially brings to the table for a given algorithm.
>
> That's what I'm asking: what can it "potentially bring to the table for
> a given algorithm"?

Function templates, common epilogue and prologue code in a shared cache
with only the internals of the algorithm being needed in other caches.
The ability to completely swap out the algorithm for first-pass, multi-
pass, and last-pass iterative loops, with even dynamic changes based on
processed data types and their needs. The new ability of being able to
directly swap out portions of an algorithm based on observed runtime
resources. Single entry-point APIs which don't have to switch, but are
immediately in machine code.

All of this can be done today with or without SMC, but SMCBs make it
more efficient, easier to use, and it provides mechanisms which expose
new abilities to the compiler which do not exist today. These can be
exploited as needed.

> Currently, the only cases where self-modifying code is used is jit
> compilation, for which you don't need anything fancy and current
> computer architectures provide good enough solutions.

As my mother would've said to someone who replied to her with a comment
like yours, "Don't use it then." :-) This solution isn't for everybody.
And, I'm not sure it's a good solution yet. I think it is, but as I
have said ... I could be wrong.

-----
To reiterate, I'm not looking at creating another variation of existing
architectures, but am looking to create a truly fundamental architecture
to which there is no general purpose equal. I want to create the
benchmark, the one which incorporates abilities sufficient to handle all
types of code and coding styles, and does so efficiently, and with a very
close relationship between compiler and hardware.

We'll see though ... a lot of naysayers, so I'm basically doing all of it
alone rather than with the help someone like me is asking for, and would
desire to have far more so. I believe in most cases, the cross on the
door keeps everyone away because they do not want to confront the
possibility that they might actually be a sinner in need of salvation,
and that Jesus Christ is the only way. And, FWIW, it hurts me that I
encounter such a response in an ongoing capacity. We are all stronger
working together, rather than separately.

Chris M. Thomasson

unread,

Aug 26, 2016, 4:39:20 PM8/26/16

to

On 8/26/2016 1:03 PM, Rick C. Hodgin wrote:
> On Friday, August 26, 2016 at 1:35:58 PM UTC-4, Stefan Monnier wrote:
>>> Not really. I think SMC itself reveals and explains its place sufficiently
>>> if you study what it potentially brings to the table for a given algorithm.
>>
>> That's what I'm asking: what can it "potentially bring to the table for
>> a given algorithm"?
>
> Function templates, common epilogue and prologue code in a shared cache
> with only the internals of the algorithm being needed in other caches.
> The ability to completely swap out the algorithm for first-pass, multi-
> pass, and last-pass iterative loops, with even dynamic changes based on
> processed data types and their needs. The new ability of being able to
> directly swap out portions of an algorithm based on observed runtime
> resources. Single entry-point APIs which don't have to switch, but are
> immediately in machine code.
>
> All of this can be done today with or without SMC, but SMCBs make it
> more efficient, easier to use, and it provides mechanisms which expose
> new abilities to the compiler which do not exist today. These can be
> exploited as needed.

Isn't creating SMC fairly expensive and should use a serializing
instruction, like CPUID on x86?

Rick C. Hodgin

unread,

Aug 26, 2016, 4:40:54 PM8/26/16

to

It's occurred to me this afternoon that what I'm really on about here is
memory aliasing. A way to add an explicit runtime ability to alias one
thing to another through hardware resources. I had always considered to
do it only on instruction data, but there's no reason why it couldn't be
used on all types of data, and even on read-only data.

I think this solution may prove to be much larger than my initial thinking
for merely self-modifying code. I think this may be out to be a powerful
way to hide latency in memory copies (providing the original memory is not
being altered and can be properly aliased for the duration of the non-
temporal copy.

I'll have to consider this on the grander scope. I may really be on to
something here.

Quadibloc

unread,

Aug 26, 2016, 6:11:26 PM8/26/16

to

On Friday, August 26, 2016 at 2:03:55 PM UTC-6, Rick C. Hodgin wrote:

> Function templates, common epilogue and prologue code in a shared cache
> with only the internals of the algorithm being needed in other caches.

I'm not sure if self-modifying code is the best way to deal with this.

> The ability to completely swap out the algorithm for first-pass, multi-
> pass, and last-pass iterative loops, with even dynamic changes based on
> processed data types and their needs.

Surely _those_ things are achievable now with branch instructions. The assigned
and computed GO TO statements in FORTRAN don't require self-modifying code for
their implementation.

However, _some_ current architectures do make anything but a conditional jump
awkward; usually there's some kind of an indirect jump because you need it to
return from a subroutine.

John Savard

Quadibloc

unread,

Aug 26, 2016, 6:13:54 PM8/26/16

to

On Friday, August 26, 2016 at 11:36:19 AM UTC-6, John Levine wrote:

> Uh, what? I programmed a 360/91. Self modifying code worked the same
> as it did on any other 360. You could modify the next instruction and
> it would work correctly, albeit slowly.

Ah. My memory was playing tricks. I did think this was one limitation of that
computer _in addition_ to imprecise interrupts. My own direct experience was with
an Amdahl 470 V/6, less sophisticated than the /91, but still influenced by it,
and it _may_ have had such a limitation.

Of course, the "albeit slowly" is enough to be a good reason not to use the
technique if at all possible.

John Savard

John Levine

unread,

Aug 26, 2016, 6:26:41 PM8/26/16

to

>> Uh, what? I programmed a 360/91. Self modifying code worked the same
>> as it did on any other 360. You could modify the next instruction and
>> it would work correctly, albeit slowly.

>Of course, the "albeit slowly" is enough to be a good reason not to use the
>technique if at all possible.

Depends what the alternative is. It cost a memory buffer flush and
reload, which would likely be faster than a couple of conditional
branches, particularly if any of the branches were mispredicted.

I have to say, I was surprised when I found that z/Series still
promises that self modifying code works, without even needing an
instruction to say flush prefetches. Given that the most likely use
of it is obviated by the EX second byte hack, I hate to think what
sort of code they need to keep working.

Maybe it's some tricky stuff in first level interrupt and trap
handlers which have to store the registers somewhere before they can
do any useful work. Since there is nothing like an interrupt stack,
if I wanted to handle potentially recursive interrupts, I could see
how patching the address in a STM would be helpful.

The best known self-modifying code on 360 mainframes was (is) in the
sort utilities, which compile the comparison criteria, which can be
quite complex, into a routine it calls with pointers to the two
records. But that's not in-line patching, that's more like a very
early form of JIT.

nedbrek

unread,

Aug 26, 2016, 7:07:07 PM8/26/16

to

The common way to alias a region of memory is through the virtual memory mapping (install multiple virtual addresses pointing to the same physical address).

nedbrek

unread,

Aug 26, 2016, 7:20:26 PM8/26/16

to

Early x86's required a JMP between the store and any execution that depended on it (in order to flush the instruction refetch buffer). P6 needed heavier weight checks, and did away with that requirement.

SMC is expensive, but transparent to the user.

It requires the instruction cache (and any lines representing instructions in flight which might be trying to get replaced) to be kept coherent with the store stream. Usually this means stores must snoop the I cache.

If the target instruction address is in the machine when a store occurs, the machine must flush and restart at the target address.

Other architectures usually require some sort of serialization instruction (memory fence, instruction fence).

This is the same for SMC and CMC (cross modifying code - like a JIT).

Ned

Chris M. Thomasson

unread,

Aug 26, 2016, 9:06:47 PM8/26/16

to

Thank you for the excellent information Ned. Now, I am wondering if a:

MFENCE
CLFLUSH

is enough on a x86? CPUID seems too hardcore. Humm...

Rick C. Hodgin

unread,

Aug 26, 2016, 9:38:12 PM8/26/16

to

> The common way to alias a region of memory is through the virtual memory mapping (install multiple virtual addresses pointing to the same physical address).

You're constrained in granularity to large blocks at that that point, and it
requires a call to the OS for setup. What I propose with SMCBs would be at
the ISA level, and directly on hardware.

nedbrek

unread,

Aug 26, 2016, 9:55:53 PM8/26/16

to

X86 has transparent support. You don't need to do anything.

Rick C. Hodgin

unread,

Aug 26, 2016, 10:02:18 PM8/26/16

to

On Friday, August 26, 2016 at 9:55:53 PM UTC-4, nedbrek wrote:
> X86 has transparent support. You don't need to do anything.

Paging is transparent in use, but not in setup. It requires a dispatch into
kernel code, and the kernel must setup and implement the paging table entries.
The aliased memory is also limited to 4KB and larger pages. The granularity
I'm talking about with SMCBs ranges from single bytes to multiple megabytes
with a population into a single register.

nedbrek

unread,

Aug 26, 2016, 11:03:19 PM8/26/16

to

Sorry, google groups spoiled the thread. My reply ("x86 has transparent support") was for Chris :)

Virtual aliasing does require ring 0.

Quadibloc

unread,

Aug 27, 2016, 12:12:09 AM8/27/16

to

On Friday, August 26, 2016 at 4:26:41 PM UTC-6, John Levine wrote:

> I have to say, I was surprised when I found that z/Series still
> promises that self modifying code works, without even needing an
> instruction to say flush prefetches.

But note the caveat: it doesn't work if the same address is... spelled
differently. So the interlock comes before address translation.

Maybe it is simply in order that they can say this is almost 100% compatible
with the original System/360. Maybe the interlock is a side effect of something
they need for more essential purposes.

I would strongly suspect that self-modifying code is not the sort of thing one
would have any occasion to use very much on a System/360. On top of that, most
programs on just about any system these days are written in higher-level
languages.

So I wouldn't draw any firm conclusions. It certainly is _possible_ that they
needed to do this so that some really old operating system would not be broken,
but could run on the new hardware for some reason. Or some unusual device
driver. But I doubt that there is a code gremlin lurking somewhere that made
this feature a requirement, either in IBM code or customer code.

John Savard

Quadibloc

unread,

Aug 27, 2016, 12:37:15 AM8/27/16

to

On Friday, August 26, 2016 at 4:26:41 PM UTC-6, John Levine wrote:

> Maybe it's some tricky stuff in first level interrupt and trap
> handlers which have to store the registers somewhere before they can
> do any useful work. Since there is nothing like an interrupt stack,
> if I wanted to handle potentially recursive interrupts, I could see
> how patching the address in a STM would be helpful.

Yes; the System/360 architecture doesn't have...

32-bit immediates in instructions for absolute addressing (and, of course, now
you'ld need 64 bits, and instructions can't be more than 48 bits long),

any stacks,

memory indirect addressing.

So if one has an interrupt that can be interrupted by *itself*, one needs to do
something fast. However, as with most other architectures, an interrupt
automatically disables all interrupts of _equal_ or lower priority.

You might need to turn interrupts of equal priority *back on again*, but if so
you can execute a few instructions before doing that.

And there is a standard way, involving the register-to-register form of the
jump to subroutine instruction (which doesn't actually jump anywhere), to store
the current program counter in a register - that's part of the standard calling
convention. I presume there's a 64-bit version of it now. So a program can
manage to load its first base register, and grab constants adjacent to code.

Variables can be accessed with the pointer too - while self-modifying code
wasn't used much on the 360, one didn't have to have separate code and data
pages, one's variables could come immediately after one's code, and the same
base register could point to them both.

With a subroutine calling convention, you get to assume you can trash general
register 15 for that first base register. Interrupts have no such luxury.

However, while an interrupt apparently just saves the program status word,
replacing it with a new program status word... since that new PSW will put the
computer in supervisor mode, the general registers can be saved with a STM
instruction that has a base register field of zero, saving them in an absolute
address in the first 4,096 locations.

Ah: and some interrupts are *nonmaskable*. But those don't get used
recursively; one example is the pseudo-interrupt that results from the
Supervisor Call instruction; one can simply not do that if it hurts when one
does that. Others handle things like someone in effect tripping over the power
cord, so information loss is a given in such dire circumstances.

For the maskable ones, one can just take what one has saved in the conventional
location, and copy it somewhere else before re-enabling the interrupt on the
same level.

Takes longer than changing the STM instruction itself, though (and using indexing is of course not an option when you have no control over any register contents), so I guess there _could_ be a case where that time was not available.

John Savard

Anton Ertl

unread,

Aug 27, 2016, 5:30:44 AM8/27/16

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>The death-knell for self-modifying code

As others have posted, at least S/360 and IA-32/AMD64 support changing
the next instruction, without any ado required from the programmer.

>Of course, compilers write their output to disk before executing it.

Some do, but load-and-go compilers don't.

>Just-in-time compilation, though, was noted as one restricted form of
>self-modifying code that's still used, where code is prepared in memory and
>then used there.

There are also other possible uses:

* In Quickening in the JVM a VM instruction that performs e.g., a
long-winded lookup in the constant pool is rewritten into a (quick)
VM instruction that has the constant inline. Now this is typically
handled in current JVM implementations by interpretive execution of
the VM code for the first n executions, and only produce native-code
afterwards, but an alternative is to compile the code right away,
and modify it later.

* In an executable file, references to dynamically linked functions do
not contain the actual address of the function (a similar issue as
the constant-table access above), but to some dispatch table. Once
you know the actual address, you can rewrite the call to directly
call there instead of the dispatch table (not sure what is done for
this stuff these days).

* In the Linux kernel, there are (or used to be; it's a while since I
read that) some instructions that are not supported by all target
hardware (or something like this); on the hardware where it does not
work, it is rewritten to something that works on the hardware upon
first execution.

* Debuggers like to modify the code in order to implement break-points
and possibly single-stepping.

The problem with these uses is not the self-modification itself; all
architectures support it, either transparently, or they require some
management code from the software, some architectural (SPARC is nice
here), some more or less implementation-specific (PowerPC), or
completely implementation-dependent and only accessible through system
calls (MIPS, last I looked); the bigger problem is the atomicity of
the change in multi-threaded environments, as the OP wrote.

>While languages like LISP and FORTH handle machine code in a somewhat
>unconventional way, they don't require self-modifying code

If there was such a requirement, you could work around it by having
the modifyable part become data. E.g., in Forth you can do

defer x
' dup is x

and then X performs DUP; and you can change X to do something else
later. One implementation that has been discussed is to let X
actually be a JMP instruction followed by the address of the word that
it performs; IS would then change the address after the JMP, i.e.,
modify the JMP instruction. However, looking at VFX Forth and
SwiftForth (two native-code Forth compilers), they both keep the
address in something that the hardware sees as data; my guess is that
they do this because of the atomicity problem. And, with indirect
branch prediction, an inlined indirect call might even be cheaper.

However, Forth also shows that transparent support for self-modifying
code has a cost. Forth compiler writers like to interleave data and
code; that can be pretty expensive on an Intel or AMD CPU after the
486; if code and data happen to be in the same cache line (or with
hardware prefetching even in adjacent cache lines), and are accessesed
alternatingly, there is lots of slow ping-ponging going on in the data
cache. A program that still exhibits this behaviour on VFX is:

create x 0 ,
: foo 10000000 0 do 1 x +! loop ;
foo

VFX puts the data at a 32-byte boundary, and the code at the next
32-byte boundary, with the loop starting 48 bytes after the data, but
apparently that is still too close., and the loop takes 468 cycles per
iteration instead of 5.6 when they are 336 bytes apart.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
an...@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html

James Van Buskirk

unread,

Aug 27, 2016, 11:36:07 AM8/27/16

to

"Stefan Monnier" wrote in message
news:jwvlgzjo2j1.fsf-...@gnu.org...

> Currently, the only cases where self-modifying code is used is jit
> compilation, for which you don't need anything fancy and current
> computer architectures provide good enough solutions.

Fortran more or less requires compilers to generate self-modifying code.
RECURSIVE procedures may have multiple instances and their internal
procedures have access to the instance data of the instance active when
they were passed as actual arguments or their addresses were taken.
Given C interoperability the information about the internal procedure's
calling sequence pretty much has to boil down to an address to which
control will be transferred, and since there is no hard upper bound to
the number of instances of a RECURSIVE procedure, there is no hard
upper bound to the number of addresses which correspond to the
different instances of the internal procedure. Thus normally each
address will point to a trampoline that loads the instance data pointer
and the code for that trampoline had to be generated at run-time.

Ivan Godard

unread,

Aug 27, 2016, 11:49:34 AM8/27/16

to

??? This sounds like an ordinary thunk or lambda, in which a common
implementation of the argument binding is by a reference pointer in a
data packet. Yes, the indirection implied by the data reference can be
removed if you rewrite the thunk to use direct reference rather than
indirect, but that's an optimization and not required by the language -
and given the rarity of formal function arguments to recursive
procedures is unlikely to give measurable gain anyway. What am I not
understanding here?

James Van Buskirk

unread,

Aug 27, 2016, 12:15:05 PM8/27/16

to

"Ivan Godard" wrote in message news:npscqc$rb6$1...@dont-email.me...

It is news to me that common C ABIs implement function address passing
in such an indirect fashion. I thought that the basic C implementation was
passing a bare address to which control was to be transferred on function
invocation.

Ivan Godard

unread,

Aug 27, 2016, 12:24:12 PM8/27/16

to

On 8/27/2016 9:14 AM, James Van Buskirk wrote:
> "Ivan Godard" wrote in message news:npscqc$rb6$1...@dont-email.me...

>> ??? This sounds like an ordinary thunk or lambda, in which a common
>> implementation of the argument binding is by a reference pointer in a
>> data packet. Yes, the indirection implied by the data reference can be
>> removed if you rewrite the thunk to use direct reference rather than
>> indirect, but that's an optimization and not required by the language
>> - and given the rarity of formal function arguments to recursive
>> procedures is unlikely to give measurable gain anyway. What am I not
>> understanding here?
>
> It is news to me that common C ABIs implement function address passing
> in such an indirect fashion. I thought that the basic C implementation was
> passing a bare address to which control was to be transferred on function
> invocation.
>

C doesn't have statically nested functions, so it needs only a code
pointer for function binding. Numerous other languages do have
statically nested functions, and so need to have both a code pointer and
an environment pointer so references to variables in the containing
environment can be accessed. (there are other approaches, such as a
display, but down-stack static links are the common implementation). C++
now supports lambdas, which also have a statically scoped argument
binding and so also need a similar mechanism. To obtain a similar effect
in C, the programmer must explicitly pass argument pointers to all the
values referenced in what would have been the static environment in
other languages.

Quadibloc

unread,

Aug 27, 2016, 12:27:21 PM8/27/16

to

On Saturday, August 27, 2016 at 9:36:07 AM UTC-6, James Van Buskirk wrote:

> Fortran more or less requires compilers to generate self-modifying code.
> RECURSIVE procedures

If a procedure is called recursively, it absolutely _must not_ contain
self-modifying code, since it has to be *re-entrant*.

John Savard

Ivan Godard

unread,

Aug 27, 2016, 1:01:01 PM8/27/16

to

On 8/27/2016 9:14 AM, James Van Buskirk wrote:

> It is news to me that common C ABIs implement function address passing
> in such an indirect fashion. I thought that the basic C implementation was
> passing a bare address to which control was to be transferred on function
> invocation.
>

This may help:
https://en.wikipedia.org/wiki/Nested_function

James Van Buskirk

unread,

Aug 27, 2016, 1:37:06 PM8/27/16

to

"Ivan Godard" wrote in message news:npserb$2nb$1...@dont-email.me...

> C doesn't have statically nested functions, so it needs only a code
> pointer for function binding. Numerous other languages do have statically
> nested functions, and so need to have both a code pointer and an
> environment pointer so references to variables in the containing
> environment can be accessed. (there are other approaches, such as a
> display, but down-stack static links are the common implementation). C++
> now supports lambdas, which also have a statically scoped argument binding
> and so also need a similar mechanism. To obtain a similar effect in C, the
> programmer must explicitly pass argument pointers to all the values
> referenced in what would have been the static environment in other
> languages.

But C does have statically nested functions through Fortran
interoperability!
Here is an example that might make my concepts clearer. The Fortran
subroutines outer and callback combine to create C function pointers to
internal procedures to two separate instances of subroutine outer. Then
the C function Csub is invoked with all of the C function pointers. Notice
how the C function is correctly sorting out which function pointers go with
which instance of subroutine outer. Perhaps it's just my limited
imagination,
but I don't see how this is going to work without self modifying code or an
unusual C ABI.

D:\gfortran\clf\nested>type Csub.c
#include <stdio.h>

void Csub(void(*a)(int x),int(*b)(),void(*c)(int x),int(*d)())
{
a(1);
c(2);
printf("b() = %d\n",b());
printf("d() = %d\n",d());
}

D:\gfortran\clf\nested>gcc -c Csub.c

D:\gfortran\clf\nested>type nested.f90
module funcs
use ISO_C_BINDING
implicit none
interface
subroutine Csub(a,b,c,d) bind(C,name='Csub')
import
implicit none
type(C_FUNPTR), value :: a,b,c,d
end subroutine Csub
end interface
contains
RECURSIVE subroutine outer(a,b,c,d)
type(C_FUNPTR), optional :: a,b,c,d
integer(C_INT) x
if(.NOT.present(c)) then
a = C_FUNLOC(fset)
b = C_FUNLOC(fget)
else
c = C_FUNLOC(fset)
d = C_FUNLOC(fget)
end if
call callback(a,b,c,d)
contains
subroutine fset(new) bind(C)
integer(C_INT), value :: new
x = new
end subroutine fset
function fget() bind(C)
integer(C_INT) fget
fget = x
end function fget
end subroutine outer
RECURSIVE subroutine callback(a,b,c,d)
type(C_FUNPTR), optional :: a,b,c,d
type(C_FUNPTR), save :: a1,b1,c1,d1
if(.NOT.present(a)) then
call outer(a1,b1,c,d)
else if(.NOT.present(c)) then
call outer(a1,b1,c1,d1)
else
call Csub(a,b,c,d)
end if
end subroutine callback
end module funcs

program test
use funcs
implicit none
call callback()
end program test

D:\gfortran\clf\nested>gfortran nested.f90 Csub.o -onested

D:\gfortran\clf\nested>nested
b() = 1
d() = 2

paul wallich

unread,

Aug 27, 2016, 4:11:20 PM8/27/16

to

It depends on your terminology and what you're thinking of as code
versus what you're thinking of as data. Usually you have a stack of data
that's operated on by the same code (with the only difference being the
stack[frame] point) but that's isomorphic with a stack of code fragments
that have the data implicit. And I'm not even sure that would
necessarily be less space-efficient.

paul

Chris M. Thomasson

unread,

Aug 27, 2016, 6:04:13 PM8/27/16

to

On 8/26/2016 6:55 PM, nedbrek wrote:
> X86 has transparent support. You don't need to do anything.
>

Thanks. Although, I do remember something about being advised to use a
serializing instruction, for proper form.

Chris M. Thomasson

unread,

Aug 27, 2016, 6:09:23 PM8/27/16

to

Well, as long as the self-modifying code does not create possibilities
for infinite recursion, or blowing the stack... Then, they should be
fine wrt recursion.

William Edwards

unread,

Aug 27, 2016, 6:31:19 PM8/27/16

to

A principle tenant of security thinking is that writeable pages should not be executable ("W^X"). All mainstream operating systems strive to ensure this, and all mainstream modern CPUs facilitate it using variations of the "NX bit".

https://en.wikipedia.org/wiki/Executable_space_protection gives a nod to the Burroughs 5000 before explaining the extent of this in the modern world.

Exploit mitigation is very important. Do the architectures you are working on have some other mechanism to make SMC safe in the modern online world?

Rick C. Hodgin

unread,

Aug 27, 2016, 9:04:57 PM8/27/16

to

On Saturday, August 27, 2016 at 6:31:19 PM UTC-4, William Edwards wrote:
> A principle tenant of security thinking is that writeable pages should not be executable ("W^X"). All mainstream operating systems strive to ensure this, and all mainstream modern CPUs facilitate it using variations of the "NX bit".
>
> https://en.wikipedia.org/wiki/Executable_space_protection gives a nod to the Burroughs 5000 before explaining the extent of this in the modern world.
>
> Exploit mitigation is very important. Do the architectures you are working on have some other mechanism to make SMC safe in the modern online world?

To be clear, this is being designed for my own CPU, which is LibSF 386-x40,
something I call "Arxoda," which is a 40-bit extension of the 80386 ISA, along
with support for a 32-bit version of ARM, and my own ISA creation, which is
not yet cast in stone. I have an architectural design which implements a new
threading model, which also includes a few things I have not released publicly
yet because I'm still going through the iterations of feasibility in my mind.

You can see the base core design here:

https://github.com/RickCHodgin/libsf/blob/master/li386/oppie/oppie-6.png

So, to begin with, SMCB would not be a general purpose extension for existing
hardware products, though they'd all be free to add them.

Second, SMCB would implement compile-time variations of algorithms in general.
The compiler would've analyzed how the code block could be altered on various
passes, making it possible to construct more than one portion of various
portions of the code base, allowing them to be switched out dynamically to
no longer have to test a condition once it's been fulfilled, for first- and
last-pass considerations, etc.

Physically, all SMCB code would already be in an execute-only block having
been loaded along with other binary code in the app, but the dynamics of the
runtime environment would allow various portions of that load-image code to
be swapped out without actually writing any real changes to memory, but only
providing the SMCB engine with information to obtain the correct instruction
data through the SMC entries that have been applied.

In addition, a runtime "fixup" protocol could be created which allows an
application to coordinate a request with the OS to grant permission to obtain
a read/write copy of an arbitrary indicated binary block, fix it up, and then
request permission from the OS to have it supplant a prior area with the
overwrite.

I have this particular protocol in mind for my CAlive compiler (because it
uses something I call LiveCode (which is edit-and-continue) allowing for
development in and on a running system).

> On Friday, August 26, 2016 at 3:10:45 PM UTC+2, Rick C. Hodgin wrote:
> > As a software developer, I've always considered self-modifying code to be
> > almost essential for high-speed throughput and performance.

As I say, it may not be the best solution. I've been pursuing it for a while,
and I believe it's about ready for permanent inclusion in my LibSF 386-x40
design. But, it's still not cast in stone.

Robert Wessel

unread,

Aug 28, 2016, 3:15:22 AM8/28/16

to

On Fri, 26 Aug 2016 17:36:16 +0000 (UTC), John Levine <jo...@iecc.com>
wrote:

>>The death-knell for self-modifying code (which had already been under a cloud
>>of disapproval on the basis that it complicated debugging) was when
>>architectures started to be deeply pipelined. The System/360 Model 91 comes to
>>mind as a watershed moment; here was an important computer from a major
>>manufacturer on which self-modifying code simply would not work as expected.

>
>Uh, what? I programmed a 360/91. Self modifying code worked the same
>as it did on any other 360. You could modify the next instruction and

>it would work correctly, albeit slowly. See page 12 of the Model 91
>Functional Characteristics.
>
>If you're thinking of imprecise interrupts, they had nothing to do
>with self-modifying code. They were because there wasn't enough
>status info in the pipeline to back up correctly if a an instruction
>caused a data related exception such as zero divide.
>
>The current z/Series POO says (using about two pages of text) that you
>can modify the next instruction and it'll work so long as you use the
>same address, i.e., not an aliased page or address space.
>
>This doesn't mean that self-modifying code is a good idea, but there
>must be a fair amount of it in the mainframe wild if IBM is still
>willing to promise that it will work.

That support bit a bunch of people, performance wise, somewhere around
the z10, where IBM introduced split I and D caches. Potential code
modifications were tracked to a 256 block, and if you modified
anything in that block, you got an expensive invalidation to the
I-cache for any instructions cached in that block. Unfortunately some
old programming styles on those machines made that juxtaposition
rather more common than you'd expect these days (local data stored
very close to the code, that sort of thing - think about being able to
access local data without the need for an additional base register).
It also bit some JIT-type scenarios, where insufficient separation was
made between generated code and data.

Nick Maclaren

unread,

Aug 28, 2016, 6:11:18 AM8/28/16

to

In article <npt32i$7u6$1...@dont-email.me>,

Yes. Some older FORTRAN compilers relied on self-modifying code,
and still allowed recursion; full (asynchronous) reentrancy is not
needed, though it (arguably) is for OpenMP support. But, as Ivan
says, nested functions do NOT need self-modifying code, nor does
dynamic code generation, as was well-known in the 1960s.

While I have written self-modifying code, and don't swallow the
claim that excluding writable and executable pages is critical for
security, the simple fact is that virtually all experience is that
it causes more problems than it is worth. None of them are show
stoppers, but there are a LOT of other features that are complicated
by allowing self-modifying code.

For the benefit of dynamic code generation and JIT compilation, it's
clearly beneficial to have a relatively cheap, unprivileged mechanism
for switching pages between writing and executable and back again.
But that's really all that's needed.

Regards,
Nick Maclaren.

Anton Ertl

unread,

Aug 28, 2016, 7:45:07 AM8/28/16

to

William Edwards <willv...@gmail.com> writes:
>A principle tenant of security thinking is that writeable pages should not be executable ("W^X"). All mainstream operating systems strive to ensure this, and all mainstream modern CPUs facilitate it using variations of the "NX bit".

As return-oriented programming (which works even with W^X)
demonstrates, this is just security theatre.

As an example (not really return-oriented programming, but similar in
spirit), in Gforth, you can disable dynamic native-code generation,
and you no longer need to write to executable pages, and this costs
about a factor of 2 in performance.

Have you gained any security? No, because you have a Turing-complete
virtual-machine interpreter for which you can write code without
needing the hardware executable permission.

In this case it is particularly simple, because the thing is designed
to allow programming anything, but as had been demonstrated,
return-oriented programming allows you to do the same with code
fragments occuring in popular libraries (e.g., glibc).

>Exploit mitigation is very important. Do the architectures you are working on have some other mechanism to make SMC safe in the modern online world?

The same that you have otherwise and the only one that really works
(but is apparently too hard to practice): Don't leave any holes open.

The questions are rather: Does it really help security to prevent
writing code (which eliminates many techniques that are not
self-modifying code in the classical sense)? And on the other hand,
how much benefit does writing code give?

Quadibloc

unread,

Aug 28, 2016, 8:02:09 AM8/28/16

to

On Sunday, August 28, 2016 at 4:11:18 AM UTC-6, Nick Maclaren wrote:

> Yes. Some older FORTRAN compilers relied on self-modifying code,
> and still allowed recursion; full (asynchronous) reentrancy is not
> needed,

This is interesting.

> While I have written self-modifying code, and don't swallow the
> claim that excluding writable and executable pages is critical for
> security,

You're quite correct that it isn't critical for security - the IBM 360 doesn't
do that, and it's secure. (Maybe it isn't so simple, and the z/Architecture has
switched to a different coding style, of course.)

The x86 architecture, on the other hand, in the beginning, had separate code
and data segment pointers. So making executable pages non-writable on it breaks
less legitimate software.

Given the parlous state of the security of x86-based systems connected to the
Internet - not *all* of which would be corrected by switching to an operating
system other than Microsoft Windows - any stopgap measure which would make life
more difficult for virus writers is to be appreciated.

So something can be of critical _importance_ without being inherently essential
for security.

But the fact that write exclusion is really helpful on x86 systems indeed
doesn't speak to whether a novel architecture, built around using
self-modifying code to achieve more power and flexibility, is somehow going to
be inherently insecure.

If worst comes to worst, there would be, for example, the option of walling off
the SMC-facile instruction set from the Internet and having the programs that
talk to the Internet be written in a different ISA with a heavy security focus.

John Savard

Quadibloc

unread,

Aug 28, 2016, 8:08:22 AM8/28/16

to

On Sunday, August 28, 2016 at 1:15:22 AM UTC-6, robert...@yahoo.com wrote:

> That support bit a bunch of people, performance wise, somewhere around
> the z10, where IBM introduced split I and D caches. Potential code
> modifications were tracked to a 256 block, and if you modified
> anything in that block, you got an expensive invalidation to the
> I-cache for any instructions cached in that block.

Having separate I-caches and D-caches shouldn't have to make self-modifying
code expensive, at least in principle.

But long cache lines, instead of short ones, do improve performance.

A bus linking the I and D caches, plus the ability to mark individual 16-bit
halfwords in a cache line of the I cache as invalid, come to mind as the
measures relevant to the System/360 architecture. So, once IBM found the
problems from experience in the z10 or whatever, they could ameliorate, if not
eliminate, them in the next iteration.

John Savard

Quadibloc

unread,

Aug 28, 2016, 8:11:19 AM8/28/16

to

On Sunday, August 28, 2016 at 5:45:07 AM UTC-6, Anton Ertl wrote:
> William Edwards <willv...@gmail.com> writes:

> >A principle tenant of security thinking is that writeable pages should not be
> >executable ("W^X"). All mainstream operating systems strive to ensure this,
> >and all mainstream modern CPUs facilitate it using variations of the "NX bit".

> As return-oriented programming (which works even with W^X)
> demonstrates, this is just security theatre.

No, closing the obvious avenues of attack, and making it harder for a hacker to
break in, and limiting the choices of what can be done after breaking in, is
beneficial to security even if it doesn't totally make all security weaknesses
go away.

There is a space between "security theatre" and "panacea".

John Savard

Quadibloc

unread,

Aug 28, 2016, 8:16:10 AM8/28/16

to

On Saturday, August 27, 2016 at 7:04:57 PM UTC-6, Rick C. Hodgin wrote:

> To be clear, this is being designed for my own CPU, which is LibSF 386-x40,
> something I call "Arxoda," which is a 40-bit extension of the 80386 ISA, along
> with support for a 32-bit version of ARM, and my own ISA creation, which is
> not yet cast in stone.

In that case, you have no problem.

As long as only code written for your own ISA creation uses self-modifying
code, then you can simply include mechanisms...

by which switching between ISAs can made a privileged operation

so that an operating system can be used in which internet-facing programs only
run in either ARM code or 386-x40 code, which *does* have no-execute protection
and the rest

while local computation can also make use of the added power of self-modifying
code when written in _your_ ISA.

Of course hackers have busted out of sandboxes and virtualization, and so no
solution is perfect.

John Savard

Anton Ertl

unread,

Aug 29, 2016, 10:55:59 AM8/29/16

to

Sounds like a standard defense of security theatre (of any kind). If
you have any concrete arguments why W^X increases security despite
being circumventable with ROP, please state them.

already...@yahoo.com

unread,

Aug 29, 2016, 11:12:24 AM8/29/16

to

One of the Big Names of 70s, I don't remember who exactly, ones argued that there ain't no such thing as a reliable (or robust? I don't remember exact epithet) software, software can be only correct or incorrect.
Even if his observation is correct it does not sound particularly useful.

William Edwards

unread,

Aug 29, 2016, 12:55:50 PM8/29/16

to

On Monday, August 29, 2016 at 4:55:59 PM UTC+2, Anton Ertl wrote:

> Quadibloc writes:
> >On Sunday, August 28, 2016 at 5:45:07 AM UTC-6, Anton Ertl wrote:

> >> William Edwards writes:
> >
> >> >A principle tenant of security thinking is that writeable pages should not be
> >> >executable ("W^X"). All mainstream operating systems strive to ensure this,
> >> >and all mainstream modern CPUs facilitate it using variations of the "NX bit".
> >
> >> As return-oriented programming (which works even with W^X)
> >> demonstrates, this is just security theatre.
> >
> >No, closing the obvious avenues of attack, and making it harder for a hacker to
> >break in, and limiting the choices of what can be done after breaking in, is
> >beneficial to security even if it doesn't totally make all security weaknesses
> >go away.
> >
> >There is a space between "security theatre" and "panacea".
>
> Sounds like a standard defense of security theatre (of any kind). If
> you have any concrete arguments why W^X increases security despite
> being circumventable with ROP, please state them.

We can examine whether you really believe that 'hardening' against a vulnerability is pointless if the protection is not absolute... Lets say you have a friend who does online banking. Would you prefer they use a browser with exploit mitigations up to the hilt or one that omits W^X?

Luckily for your friend, all the browser makers and mainstream OS makers have embraced exploit mitigations to the hilt.

Here is an excellent (if 3 years old now) presentation on exploit mitigation in general: https://www.openbsd.org/papers/ru13-deraadt/mgp00001.html

Now there are mitigations for ROP too, e.g. having separate call and data stack as implemented in hardware on the Mill or in software by e.g. the Clang SafeStack pass. Modern attacks are targeting other pointers instead, e.g. vtables. Mitigations are being developed to reduce that attack surface too, e.g. the Clang Code Pointer Integrity (CPI) and Code Flow Integrity (CFI) projects.

Anton Ertl

unread,

Aug 29, 2016, 2:03:35 PM8/29/16

to

William Edwards <willv...@gmail.com> writes:
>Luckily for your friend, all the browser makers and mainstream OS makers ha=

>ve embraced exploit mitigations to the hilt.

Not that I have noticed, on the contrary: Firefox/Iceweasel has
reenabled one of the major entrances for exploits that I had disabled
(JavaScript) when I "upgraded" it (feels more like a downgrade to me).
The CVE statistics
<http://www.cvedetails.com/product/3264/Mozilla-Firefox.html?vendor_id=452>
also don't speak for them being very effective, but maybe they do all
this "exploit mitigation" instead of getting rid of vulnerabilities.
Google Chrome numbers look a little better, especially wrt code
execution, but it still has vulnerabilities.

Also, I am pretty sure that they do have writeable code pages, as they
have JavaScript JIT compilers.

>Now there are mitigations for ROP too, e.g. having separate call and data s=
>tack

This works only against certain simple attacks. In general, the
attacker can work with any arbitrarily indirect pointer through which
code is executed in the future. Ah, you noticed that yourself:

>Modern attacks are targeting other pointers instead, e.g=
>. vtables. Mitigations are being developed to reduce that attack surface t=
>oo, e.g. the Clang Code Pointer Integrity (CPI) and Code Flow Integrity (CF=
>I) projects.

Of course, given Greenspun's tenth rule, checking that all the
indirect jumps/calls are to places that are possible targets of the
corresponding indirect jumps/calls in the source code still gives the
attacker a Turing-complete set of gadgets, as he just needs to get the
program to enter it's ad-hoc, informally-specified bug-ridden slow
implementation of half of Common Lisp.

Robert Wessel

unread,

Aug 31, 2016, 1:28:24 AM8/31/16

to

Nah... (almost*) everyone fixed their software - you had to, or you'd
be stuck until the next generation. And you still need to avoid
roughly the same code/data area interactions.

*FSVO "almost". Certainly any ISVs with products with active support
did. Also note that this wasn't really much of an issue for most
(any?) HLLs, unless they did something JIT-ish.

Robert Wessel

unread,

Aug 31, 2016, 2:04:17 AM8/31/16

to

On Sun, 28 Aug 2016 05:02:06 -0700 (PDT), Quadibloc
<jsa...@ecn.ab.ca> wrote:

>On Sunday, August 28, 2016 at 4:11:18 AM UTC-6, Nick Maclaren wrote:
>
>> Yes. Some older FORTRAN compilers relied on self-modifying code,
>> and still allowed recursion; full (asynchronous) reentrancy is not
>> needed,
>
>This is interesting.
>
>> While I have written self-modifying code, and don't swallow the
>> claim that excluding writable and executable pages is critical for
>> security,
>
>You're quite correct that it isn't critical for security - the IBM 360 doesn't
>do that, and it's secure. (Maybe it isn't so simple, and the z/Architecture has
>switched to a different coding style, of course.)
>
>The x86 architecture, on the other hand, in the beginning, had separate code
>and data segment pointers. So making executable pages non-writable on it breaks
>less legitimate software.

Although real mode code segments were writable - this was one area
making to jump to protected mode more difficult than necessary. Given
the paucity of available segments, storing common stuff in CS was
fairly common. And extra segment register (the 386s FS or GS), would
have made life much simpler.

James Van Buskirk

unread,

Aug 31, 2016, 4:24:07 PM8/31/16

to

"Nick Maclaren" wrote in message news:npudc3$ir2$1...@dont-email.me...

> Yes. Some older FORTRAN compilers relied on self-modifying code,
> and still allowed recursion; full (asynchronous) reentrancy is not
> needed, though it (arguably) is for OpenMP support. But, as Ivan
> says, nested functions do NOT need self-modifying code, nor does
> dynamic code generation, as was well-known in the 1960s.

While both you and Ivan, two sources whom I have cause to respect,
have asserted this, I still can't see how it works. I have rewritten my
previous example for greater clarity and flexibility and would like to
be informed how it might work with either the same address for all
functions or all different addresses without self-modifying code:

D:\gfortran\clf\nested>type Csub1.c
#include <stdio.h>

struct funs
{
void(*fset)(int x);
int(*fget)();
};

void Csub1(struct funs array[], int N)
{
int i, value, total;
for(i = 0; i < N; i++)
{
value = i+1;
printf("Address of setter function %2d = %p. Value set = %d\n",
i,(void *)array[i].fset,value);
array[i].fset(value);
}
total = 0;
for(i = 0; i < N; i++)
{
printf("Address of getter function %2d = %p. Value gotten = %d\n",
i,(void *)array[i].fget,array[i].fget());
total += array[i].fget();
}
printf("Sum of values gotten = %d\n", total);
}

D:\gfortran\clf\nested>gcc -Wall Csub1.c -c

D:\gfortran\clf\nested>type nested1.f90

module funcs
use ISO_C_BINDING
implicit none

type, bind(C) :: funs
type(C_FUNPTR) fset
type(C_FUNPTR) fget
end type funs
interface
subroutine Csub1(array, N) bind(C,name='Csub1')
import
implicit none
type(funs) array(*)
integer(C_INT), value :: N
end subroutine Csub1
end interface
contains
RECURSIVE subroutine outer(array, N, i)
type(funs) array(*)
integer(C_INT), value :: N, i
integer(C_INT) x
array(i) = funs(C_FUNLOC(fset),C_FUNLOC(fget))
call callback(array, N, i+1)

contains
subroutine fset(new) bind(C)
integer(C_INT), value :: new
x = new
end subroutine fset
function fget() bind(C)
integer(C_INT) fget
fget = x
end function fget
end subroutine outer

RECURSIVE subroutine callback(array, N, i)
type(funs) array(*)
integer(C_INT), value :: N, i
if(i <= N) then
call outer(array, N, i)
else
call Csub1(array, N)

end if
end subroutine callback
end module funcs

program test
use funcs
implicit none

integer N
type(funs), allocatable :: array(:)
write(*,'(a)',advance='no') 'Please enter the number of functions:> '
read(*,*) N
allocate(array(N))
call callback(array, N, 1)
end program test

D:\gfortran\clf\nested>gfortran -Wall nested1.f90 Csub1.o -onested1
nested1.f90:28:12:

function fget() bind(C)
1
Warning: 'fget' declared at (1) may shadow the intrinsic of the same name.
In o
rder to call the intrinsic, explicit INTRINSIC declarations may be required.
[-W
intrinsic-shadow]

D:\gfortran\clf\nested>nested1
Please enter the number of functions:> 10
Address of setter function 0 = 000000000023FB3C. Value set = 1
Address of setter function 1 = 000000000023FA2C. Value set = 2
Address of setter function 2 = 000000000023F91C. Value set = 3
Address of setter function 3 = 000000000023F80C. Value set = 4
Address of setter function 4 = 000000000023F6FC. Value set = 5
Address of setter function 5 = 000000000023F5EC. Value set = 6
Address of setter function 6 = 000000000023F4DC. Value set = 7
Address of setter function 7 = 000000000023F3CC. Value set = 8
Address of setter function 8 = 000000000023F2BC. Value set = 9
Address of setter function 9 = 000000000023F1AC. Value set = 10
Address of getter function 0 = 000000000023FB24. Value gotten = 1
Address of getter function 1 = 000000000023FA14. Value gotten = 2
Address of getter function 2 = 000000000023F904. Value gotten = 3
Address of getter function 3 = 000000000023F7F4. Value gotten = 4
Address of getter function 4 = 000000000023F6E4. Value gotten = 5
Address of getter function 5 = 000000000023F5D4. Value gotten = 6
Address of getter function 6 = 000000000023F4C4. Value gotten = 7
Address of getter function 7 = 000000000023F3B4. Value gotten = 8
Address of getter function 8 = 000000000023F2A4. Value gotten = 9
Address of getter function 9 = 000000000023F194. Value gotten = 10
Sum of values gotten = 55

Nick Maclaren

unread,

Aug 31, 2016, 4:44:38 PM8/31/16

to

In article <nq7ed5$24c$1...@dont-email.me>,

James Van Buskirk <not_...@comcast.net> wrote:
>
>> Yes. Some older FORTRAN compilers relied on self-modifying code,
>> and still allowed recursion; full (asynchronous) reentrancy is not
>> needed, though it (arguably) is for OpenMP support. But, as Ivan
>> says, nested functions do NOT need self-modifying code, nor does
>> dynamic code generation, as was well-known in the 1960s.
>
>While both you and Ivan, two sources whom I have cause to respect,
>have asserted this, I still can't see how it works. I have rewritten my
>previous example for greater clarity and flexibility and would like to
>be informed how it might work with either the same address for all
>functions or all different addresses without self-modifying code:

The following is perhaps the simplest solution, but there are others:

Function addresses are pairs, containing the address proper and the
scope information, and are passed in two registers or a register
and a known location.

Regards,
Nick Maclaren.

James Van Buskirk

unread,

Aug 31, 2016, 7:49:58 PM8/31/16

to

"Nick Maclaren" wrote in message news:nq7fjk$6b4$1...@dont-email.me...

My concern about this kind of solution is that it breaks the C
language ABIs that I am aware of.

Ivan Godard

unread,

Aug 31, 2016, 8:10:17 PM8/31/16

to

Yes, because C ASI is insufficient. Call the two-pointer solution the A
(for Algol) API; the widely use Itanium ABI is an example. It is trivial
to implement C-ABI on an A-ABI platform; just ignore the second pointer.
To run A-ABI code on a C-ABI requires passing two-pointer arguments when
the ABI expects one; you can do that with an extra "hidden"argument, as
in the hidden "this" pointer when calling class methods.

However, that doesn't deal with the memory layout of data containing
A-ABI functions in a C-ABI world; you can't fit 16 bytes of pointers
into an 8-byte hole. For that, the usual practice is for the compiler to
generate a small code fragment (called a thunk or trampoline) at every
point at which the address of a function is taken. It also allocates
data space for an environment pointer, and the generated thunk knows the
location of the corresponding pointer and initializes it with the
appropriate down-stack link.

The C-ABI function pointer is actually a pointer to the thunk code. That
code, which is unique to the particular callee/envirinment pair, can
load the environment pointer from the place it knows in data space and
then calls the "real" function with the environment as a hidden argument.

There is some additional armwaving required when the address-take is
inside a recursive function, involving saving and restoring the
environment pointer; I leave that to you as an exercise.

William Edwards

unread,

Sep 1, 2016, 8:20:27 AM9/1/16

to

OpenBSD 6.0 was officially released today, with a major focus on W^X yet again: http://undeadly.org/cgi?action=article&sid=20160901090415

They have a great release song to go with it :) http://www.openbsd.org/lyrics.html#60a

James Van Buskirk

unread,

Sep 1, 2016, 2:45:10 PM9/1/16

to

"Ivan Godard" wrote in message news:nq7rl7$clh$1...@dont-email.me...

Right, and all this was the point of my examples. I count the generation
of those trampolines as self-modifying code in that the program writes
them into memory somewhere and subsequently transfers control to
them. Perhaps that is a semantic issue where you and Nick differ from
my nomenclature?

Ivan Godard

unread,

Sep 1, 2016, 3:24:17 PM9/1/16

to

On 9/1/2016 11:45 AM, James Van Buskirk wrote:
> "Ivan Godard" wrote in message news:nq7rl7$clh$1...@dont-email.me...

>> There is some additional armwaving required when the address-take is
>> inside a recursive function, involving saving and restoring the
>> environment pointer; I leave that to you as an exercise.
>
> Right, and all this was the point of my examples. I count the generation
> of those trampolines as self-modifying code in that the program writes
> them into memory somewhere and subsequently transfers control to
> them. Perhaps that is a semantic issue where you and Nick differ from
> my nomenclature?
>

No, the program doesn't create the thunks, and the creation is not done
at run time; thunk creation is compile-time, just like creations of the
code for the rest of the program. This is neither JIT nor SMC, but is
entirely static.

James Van Buskirk

unread,

Sep 1, 2016, 3:41:09 PM9/1/16

to

"Ivan Godard" wrote in message news:nq9v8v$o7h$1...@dont-email.me...

Are we talking about the same example? I was thinking in terms of

https://groups.google.com/d/msg/comp.arch/-T_Gam9_TvY/nPqSxerlAQAJ

where thunks to 10 setter and 10 getter functions were generated
and a million would have been possible depending on user input.
How can this be achieved at compile time?

Anton Ertl

unread,

Sep 1, 2016, 4:19:42 PM9/1/16

to

Thunk (as in routines for Algol 60 call-by-name) creation may be
static (but the environment needs to be in a separate pointer), but
trampoline (a code fragment that represents a combination the code for
a function and its environment) creation is at run-time: For each
combination of environment pointer and code pointer, you need one
trampoline, and because the environments are created at run-time, and
can be created in arbitrary numbers, you need to do the same with
trampolines; the same trampoline pointer obviously cannot represent
two different combinations code code and environment.

Anton Ertl

unread,

Sep 3, 2016, 4:47:20 AM9/3/16

to

Robert Wessel <robert...@yahoo.com> writes:
>On Sun, 28 Aug 2016 05:08:20 -0700 (PDT), Quadibloc
><jsa...@ecn.ab.ca> wrote:
>
>>On Sunday, August 28, 2016 at 1:15:22 AM UTC-6, robert...@yahoo.com wrote:
>>
>>> That support bit a bunch of people, performance wise, somewhere around
>>> the z10, where IBM introduced split I and D caches. Potential code
>>> modifications were tracked to a 256 block, and if you modified
>>> anything in that block, you got an expensive invalidation to the
>>> I-cache for any instructions cached in that block.
>>
>>Having separate I-caches and D-caches shouldn't have to make self-modifying
>>code expensive, at least in principle.

In practice, they do.

>>A bus linking the I and D caches, plus the ability to mark individual 16-bit
>>halfwords in a cache line of the I cache as invalid, come to mind as the
>>measures relevant to the System/360 architecture. So, once IBM found the
>>problems from experience in the z10 or whatever, they could ameliorate, if not
>>eliminate, them in the next iteration.
>
>
>Nah... (almost*) everyone fixed their software - you had to, or you'd
>be stuck until the next generation. And you still need to avoid
>roughly the same code/data area interactions.

In the Intel-compatible world, the problem has been there since the P5
(Pentium) in 1993 and K5 introduced separate I and D caches, so it is
here to stay (and in the P5 and K6 (don't know about K5), even data
reads from the cache line invalidated it in the I-cache; that was
fixed in P6 and K7).

Neither Intel nor AMD have implemented something like the remedy
suggested by Quadibloc. What he suggests is probably too expensive in
every way: Stores would have to be synchronized with instruction
fetching on every instruction (currently they synchonize with each
other through snooping on cache misses, much rarer); also instruction
fetching and decoding in a pipelined manner with holes the cache line
would complicate the process quite a bit. It would probably be better
to go back to a combined L1 cache, and even there you would have to
synchronize the instruction prefetch buffer with the store buffer.

Concerning software: In the Forth world, these false-sharing problems
have been known for 20+ years, and most still interleave data and
code; the best that most did was additional padding, but what I see is
padding to 32-byte boundaries, while the cache lines are 64 bytes long
these days; also prefetching might even make alignment to 64-byte
boundaries insufficient (I have not seen prefetch effects since the
P55, but then I have not micro-benchmarked for that).

I would be surprised if the System z software vendors were so much
more nimble than the Forth system vendors. My guess is that they only
fixed the really bad cases of false sharing (where an inner loop
writes to some data in the same cache line), and all other cases are
still there.

> Also note that this wasn't really much of an issue for most
>(any?) HLLs, unless they did something JIT-ish.

JIT is not an issue. The few writes until the JIT has filled the
cache line don't really hurt much even if the code in the earlier part
of the cache line is already executed. The issue is whether you mix
data and code. This has even been an issue with gcc without any
JITting going on, on the P5:
<http://www.complang.tuwien.ac.at/misc/pentium-switch/switch-on-pentium>
(graphic at
<http://www.complang.tuwien.ac.at/misc/pentium-switch/graph.ps.gz>).

As for the 286 segments eliminating the issue on Intel-compatible
CPUs, this obviously has not happened in the Forth world. Probably
not elsewhere, either. If your software was designed to mix code and
data, you set up the segments in a way that allows that.

Quadibloc

unread,

Sep 3, 2016, 11:35:41 AM9/3/16

to

On Saturday, September 3, 2016 at 2:47:20 AM UTC-6, Anton Ertl wrote:

> Neither Intel nor AMD have implemented something like the remedy
> suggested by Quadibloc. What he suggests is probably too expensive in
> every way:

Actually, no, since for the x86 architecture, you don't really "need" to allow
self-modifying code. If *IBM* never implemented something like the remedy I
suggested for the processors for their System z mainframe, *then* you could
conclude that my suggested remedy is too expensive to try... even when it would
actually be useful for something.

John Savard

Anton Ertl

unread,

Sep 3, 2016, 11:54:43 AM9/3/16

to

Quadibloc <jsa...@ecn.ab.ca> writes:
>On Saturday, September 3, 2016 at 2:47:20 AM UTC-6, Anton Ertl wrote:
>
>> Neither Intel nor AMD have implemented something like the remedy
>> suggested by Quadibloc. What he suggests is probably too expensive in
>> every way:
>
>Actually, no, since for the x86 architecture, you don't really "need" to allow
>self-modifying code.

Whatever '"need"' may mean. Self-modifying code is allowed and works
on 8086, IA-32 and AMD64, transparently, and is used there. If it was
not allowed, like on RISCs, the false sharing would not be a problem:
the cache line could just reside in both the I and the D cache, and
any inconsistency between these caches that results in the program not
doing what is intended (i.e., true sharing) would be the fault of the
programmer (like in RISCs).

>If *IBM* never implemented something like the remedy I
>suggested for the processors for their System z mainframe, *then* you could
>conclude that my suggested remedy is too expensive to try...

The case for IBM z is just the same as for IA-32 and AMD64:
Self-modifying code is allowed in all these architectures, and
therefore any implementation with separate I and D caches has to keep
these caches consistent, and in that context false sharing is a
problem. And Intel have not attempted your solution in the 23 years
that they have had this problem, AMD not in the 20 years, and AFAIK
IBM not in the 8 years.

Robert Wessel

unread,

Sep 7, 2016, 6:54:27 PM9/7/16

to

On Sat, 03 Sep 2016 07:57:03 GMT, an...@mips.complang.tuwien.ac.at
(Anton Ertl) wrote:

>I would be surprised if the System z software vendors were so much
>more nimble than the Forth system vendors. My guess is that they only
>fixed the really bad cases of false sharing (where an inner loop
>writes to some data in the same cache line), and all other cases are
>still there.

Of course. Although in practice there tended to be fewer such problem
areas than originally expected - no code from HLLs, most code written
to the reentrant/reusable standards, etc., mostly couldn't have the
problem to begin with. Many other possible cases tended to simply be
big enough that even though you had code and data intertwined, the
chunks were big enough that you mostly avoided the issue except at the
function entry and epilog, and even then you tended to have several
functions in a row before an associated data area, so then only the
entry of the first function in the block and the epilog in the last.
in more than a few cases the fix was just to put some padding before
and after the storage areas. In other, especially the more JIT-like
areas, rather more movement tended to be necessary.

But certainly the problem was usually fairly limited except for loops
- although with the right inner loop, very painful until fixed. Which
is not to say non-existent, and you only needed one hit in your
codebase to require you to issue an emergency patch.